EJT comments on Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)

EJT 30 Sep 2025 12:01 UTC
22 points
18
MIRI didn’t solve corrigibility, but I don’t think that justifies particularly strong confidence in the problem being hard. The Corrigibility paper only considers agents representable as expected utility maximizers, and that restriction seems to be justified only by weak arguments.
- Noosphere89 30 Sep 2025 14:58 UTC
  2 points
  1
  Parent
  IMO, the current best coherence theorems are John Wentworth’s theorems around caches/policies and agents in MDPs, though these theorems assume the agent has behaviorally complete preferences (but the VNM theorem also assumes completeness, or it doesn’t hold)
  A Simple Toy Coherence Theorem
  Coherence of Caches and Agents
  To be clear, I do think it’s possible to build powerful agents thanks to discussion on twitter without the problematic parts of coherence like completeness, though I don’t expect that to happen without active effort (but the effort/safety tax might be 0, once some theory work has been done) but I’d also say that even in the frame where we do need to deal with coherent agents that won’t shut down by default, you can still make more progress on making agents corrigible in the hard setting than MIRI thought, and here’s 2 posts about it that are very underdiscussed on LW:
  Defining Corrigible and Useful Goals
  Defining Monitorable and Useful Goals
  So I provisionally disagree with the claim by MIRI that it’s very hard to get corrigibility out of AIs that satisfy coherence theorems.