John Steidley comments on Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)

John Steidley 1 Oct 2025 1:39 UTC
14 points
7
(I work at Palisade)

I claim that your summary of the situation between Neel’s work and Palisade’s work is badly oversimplified. For example, Neel’s explanation quoted here doesn’t fully explain why the models sometimes subvert shutdown even after lots of explicit instructions regarding the priority of the instructions. Nor does it explain the finding that moving instructions from the user prompt to the developer prompt actually /increases/ the behavior.

Further, that CoT that Neel quotes has a bit in it about “and these problems are so simple”, but Palisade also tested whether using harder problems (from AIME, iirc) had any effect on the propensity here and we found almost no impact. So, it’s really not as simple as just reading the CoT and taking the model’s justifications for its actions at face value (as Neel, to his credit, notes!).

Here’s a twitter thread about this involving Jeffrey and Rohin: https://x.com/rohinmshah/status/1968089618387198406

Here’s our full paper that goes into a lot of these variations: https://arxiv.org/abs/2509.14260