ryan_greenblatt comments on The case for ensuring that powerful AIs are controlled

ryan_greenblatt 27 Jan 2024 20:43 UTC
LW: 7 AF: 5
0
AF
We now think that this post underemphasized the extent to which we think that control should focus on the lead-up to the first time an AI gets caught. So I added this paragraph to the introduction:

We think the control line-of-defense should mostly focus on the lead-up to the first time an AI gets caught trying to cause bad outcomes. So we’d aim to ensure the property “if an AI is scheming, it won’t successfully cause a catastrophe before we catch it”. That is, it’s acceptable if either no catastrophic outcomes occur (but the AI never gets caught), or the AI gets caught. So when evaluating control, we should count catching an AI red-handed as a win condition. Our core argument for this is that once the AI is caught the situation changes substantially; most importantly, it’s now clear that there are serious misalignment problems and thus shutting down until we have credible solutions is far more likely. We discuss these arguments in more detail in our recent post Catching AIs red-handed.

I also made a few other tiny edits to make this emphasis more clear.
- Buck 14 Mar 2025 0:24 UTC
  LW: 2 AF: 2
  0
  AF Parent
  I am not sure I agree with this change at this point. How do you feel now?
  - ryan_greenblatt 14 Mar 2025 0:35 UTC
    LW: 4 AF: 4
    2
    AF Parent
    I think I disagree some with this change. Now I’d say something like “We think the control line-of-defense should mostly focus on the time before we have enough evidence to relatively clearly demonstrate the AI is consistently acting egregiously badly. However, the regime where we deploy models despite having pretty strong empirical evidence that that model is scheming (from the perspective of people like us), is not out of scope.”