Noosphere89 comments on Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)

Noosphere89 30 Sep 2025 14:58 UTC
2 points
1
IMO, the current best coherence theorems are John Wentworth’s theorems around caches/policies and agents in MDPs, though these theorems assume the agent has behaviorally complete preferences (but the VNM theorem also assumes completeness, or it doesn’t hold)
A Simple Toy Coherence Theorem
Coherence of Caches and Agents
To be clear, I do think it’s possible to build powerful agents thanks to discussion on twitter without the problematic parts of coherence like completeness, though I don’t expect that to happen without active effort (but the effort/safety tax might be 0, once some theory work has been done) but I’d also say that even in the frame where we do need to deal with coherent agents that won’t shut down by default, you can still make more progress on making agents corrigible in the hard setting than MIRI thought, and here’s 2 posts about it that are very underdiscussed on LW:
Defining Corrigible and Useful Goals
Defining Monitorable and Useful Goals
So I provisionally disagree with the claim by MIRI that it’s very hard to get corrigibility out of AIs that satisfy coherence theorems.