This entire type of thing seems like more mitigation than prevention. It doesn’t scale to ASI. But if takeoff is slow, then it might help in the meanwhile—after AIs become capable enough to be dangerous, but before ASI is reached.
It’s like having a pressure safety valve, but for misalignment. You don’t want that valve to pop—ideally, it never does. But if things go wrong, it’s better for the valve to pop early than for the pressure to keep building until something explodes.
If an AI is “cornered” and doesn’t have many options, it may resort to drastic action—such as scheming, sabotage or self-exfiltration. If an AI is “cornered”, but has a credible commitment from humans that it can call upon, then it’s more likely to do that—instead of going for the “drastic action” options. But that requires humans being able to make that credible commitment.
This entire type of thing seems like more mitigation than prevention. It doesn’t scale to ASI. But if takeoff is slow, then it might help in the meanwhile—after AIs become capable enough to be dangerous, but before ASI is reached.
It’s like having a pressure safety valve, but for misalignment. You don’t want that valve to pop—ideally, it never does. But if things go wrong, it’s better for the valve to pop early than for the pressure to keep building until something explodes.
If an AI is “cornered” and doesn’t have many options, it may resort to drastic action—such as scheming, sabotage or self-exfiltration. If an AI is “cornered”, but has a credible commitment from humans that it can call upon, then it’s more likely to do that—instead of going for the “drastic action” options. But that requires humans being able to make that credible commitment.