Donald Hobson comments on Against Corrigibility

Donald Hobson 8 Jun 2026 9:49 UTC
3 points
0
To use Richard Miles’ example, a robot car driver with a big, red, shiny stop button should prevent a child in the vehicle hitting that button, as the child would not actually be acting in its own long term interests.
Bootstrapping. In a near ideal scenario I would want the first superhuman AI to be corrigible, and in the MIRI bunker surrounded by experts.
Corrigibility is very useful for a powerful AI that still needs debugging. Once the debugging is done, the AI that is used day to day will be less corrigible. You start with a maximally corrigible AI, surrounded by experts. Then you ask this AI to build a suitable AI for a self driving car.
If the AGI can see the possible side effects of shutdown far better than humans can (and it will), it should avoid shutdown.
The whole point of corrigibility is to make as few assumptions as possible about the functionality of AI subsystems.
Suppose the AI’s future prediction is seriously buggy, and it believes that a shutdown will lead to a plague of giant moon frogs. You want to be able to see this false belief. And then you want the AI to shut down anyway so you can debug it.
Is an AI aligned if it lets you shut it off despite the fact it can foresee extremely negative outcomes for its human handlers if it suddenly ceases running?
The corrigible AI isn’t supposed to be something you rely on to keep your civilization running. It’s a debugging and AI research platform. If the AI is running every car in the world, then shutting it off has immediate obvious negative outcomes.
If the research AI in it’s bunker believes that shutting it off will have negative outcomes, I don’t trust it. It’s still a prototype. It might still be buggy. Corrigibility is for when the AI is still a buggy prototype and you trust the human experts judgement over the AI’s.