MIRI didn’t solve corrigibility, but I don’t think that justifies particularly strong confidence in the problem being hard. The Corrigibility paper only considers agents representable as expected utility maximizers, and that restriction seems to be justified only by weak arguments.
IMO, the current best coherence theorems are John Wentworth’s theorems around caches/policies and agents in MDPs, though these theorems assume the agent has behaviorally complete preferences (but the VNM theorem also assumes completeness, or it doesn’t hold)
To be clear, I do think it’s possible to build powerful agents thanks to discussion on twitter without the problematic parts of coherence like completeness, though I don’t expect that to happen without active effort (but the effort/safety tax might be 0, once some theory work has been done) but I’d also say that even in the frame where we do need to deal with coherent agents that won’t shut down by default, you can still make more progress on making agents corrigible in the hard setting than MIRI thought, and here’s 2 posts about it that are very underdiscussed on LW:
MIRI didn’t solve corrigibility, but I don’t think that justifies particularly strong confidence in the problem being hard. The Corrigibility paper only considers agents representable as expected utility maximizers, and that restriction seems to be justified only by weak arguments.
IMO, the current best coherence theorems are John Wentworth’s theorems around caches/policies and agents in MDPs, though these theorems assume the agent has behaviorally complete preferences (but the VNM theorem also assumes completeness, or it doesn’t hold)
A Simple Toy Coherence Theorem
Coherence of Caches and Agents
To be clear, I do think it’s possible to build powerful agents thanks to discussion on twitter without the problematic parts of coherence like completeness, though I don’t expect that to happen without active effort (but the effort/safety tax might be 0, once some theory work has been done) but I’d also say that even in the frame where we do need to deal with coherent agents that won’t shut down by default, you can still make more progress on making agents corrigible in the hard setting than MIRI thought, and here’s 2 posts about it that are very underdiscussed on LW:
Defining Corrigible and Useful Goals
Defining Monitorable and Useful Goals
So I provisionally disagree with the claim by MIRI that it’s very hard to get corrigibility out of AIs that satisfy coherence theorems.