It’s interesting though that the results seem somewhat deterministic. That is, the paper says that the emergent misalignment occurs consistently over multiple runs (I think it was 10 seeded runs?)
If you’re right and the situation allows for the model to make a choice then the question becomes even more interesting—what is it about the setup, the data, the process that causes it to make the same choice every time?
It’s interesting though that the results seem somewhat deterministic. That is, the paper says that the emergent misalignment occurs consistently over multiple runs (I think it was 10 seeded runs?)
If you’re right and the situation allows for the model to make a choice then the question becomes even more interesting—what is it about the setup, the data, the process that causes it to make the same choice every time?