We now think that this post underemphasized the extent to which we think that control should focus on the lead-up to the first time an AI gets caught. So I added this paragraph to the introduction:
We think the control line-of-defense should mostly focus on the lead-up to the first time an AI gets caught trying to cause bad outcomes. So we’d aim to ensure the property “if an AI is scheming, it won’t successfully cause a catastrophe before we catch it”. That is, it’s acceptable if either no catastrophic outcomes occur (but the AI never gets caught), or the AI gets caught. So when evaluating control, we should count catching an AI red-handed as a win condition. Our core argument for this is that once the AI is caught the situation changes substantially; most importantly, it’s now clear that there are serious misalignment problems and thus shutting down until we have credible solutions is far more likely. We discuss these arguments in more detail in our recent post Catching AIs red-handed.
I also made a few other tiny edits to make this emphasis more clear.
I think I disagree some with this change. Now I’d say something like “We think the control line-of-defense should mostly focus on the time before we have enough evidence to relatively clearly demonstrate the AI is consistently acting egregiously badly. However, the regime where we deploy models despite having pretty strong empirical evidence that that model is scheming (from the perspective of people like us), is not out of scope.”
We now think that this post underemphasized the extent to which we think that control should focus on the lead-up to the first time an AI gets caught. So I added this paragraph to the introduction:
I also made a few other tiny edits to make this emphasis more clear.
I am not sure I agree with this change at this point. How do you feel now?
I think I disagree some with this change. Now I’d say something like “We think the control line-of-defense should mostly focus on the time before we have enough evidence to relatively clearly demonstrate the AI is consistently acting egregiously badly. However, the regime where we deploy models despite having pretty strong empirical evidence that that model is scheming (from the perspective of people like us), is not out of scope.”