So, as far as I’m concerned, we saw something like goal-preservation in various models in the original alignment faking work. Both that work, and MIRI above, were like “aha! as foretold!” And then subsequent work seems to indicate that, nah, it wasn’t as foretold.
I think it’s more like “the situation is more confusing that it seemed at first, with more details that we don’t understand yet, and it’s not totally clear if we’re seeing what was foretold or not.”
I think it’s more like “the situation is more confusing that it seemed at first, with more details that we don’t understand yet, and it’s not totally clear if we’re seeing what was foretold or not.”