Eli Tyre comments on Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)

Eli Tyre 30 Sep 2025 19:42 UTC
2 points
0
So, as far as I’m concerned, we saw something like goal-preservation in various models in the original alignment faking work. Both that work, and MIRI above, were like “aha! as foretold!” And then subsequent work seems to indicate that, nah, it wasn’t as foretold.
I think it’s more like “the situation is more confusing that it seemed at first, with more details that we don’t understand yet, and it’s not totally clear if we’re seeing what was foretold or not.”