I agree with your high-level view that is something like “If you created a complex system you don’t understand then you will likely get unexpected undesirable behavior from it.”
That said, I think the EM phenomenon provides a lot of insights that are a significant update on commonly accepted views on AI and AI alignment from several years ago:
The nature of misalignment in EM: Instead of being misaligned in a complex, alien and inscrutable way, emergently misaligned models exhibit behavior that is more like a cartoon villain. The AI does hate you.
Personas: Papers on the subject emphasize the novel ‘personas’ concept: human-recognizable, coherent characters or personalities that are learned in pretraining.
Importance of pretraining data: Rather than being caused by a flawed reward function or incorrect inductive biases, EM is ultimately caused by documents in the pretraining data describing malicious characters (see also: Data quality for alignment).
EM can occur in base models: “This rules out explanations of emergent misalignment that depend on the model having been post-trained to be aligned.” (quote from the original EM paper)
Realignment, the opposite of EM: after fine-tuning models on secure code to trigger “emergent misalignment”, this new realignment generalizes surprisingly strongly after just a few examples.
That… really feels to me like it’s learning the wrong lessons. This somehow assumes that this is the only way you are going to get surprising misgeneralization.
Like, I am not arguing for irreducible uncertainty, but I certainly think the update of “Oh, zero shot anyone predicted this specific weird form of misgeneralization, but actually if you make this small tweak to it it generalizes super robustly” is very misguided.
Of course we are going to see more examples of weird misgeneralization. Trying to fix EM in particular feels like such a weird thing to do. I mean, if you have to locally ship products you kind of have to, but it certainly will not be the only weird thing happening on the path to ASI, and the primary lesson to learn is “we really have very little ability to predict or control how systems misgeneralize”.
I agree with your high-level view that is something like “If you created a complex system you don’t understand then you will likely get unexpected undesirable behavior from it.”
That said, I think the EM phenomenon provides a lot of insights that are a significant update on commonly accepted views on AI and AI alignment from several years ago:
The nature of misalignment in EM: Instead of being misaligned in a complex, alien and inscrutable way, emergently misaligned models exhibit behavior that is more like a cartoon villain. The AI does hate you.
Personas: Papers on the subject emphasize the novel ‘personas’ concept: human-recognizable, coherent characters or personalities that are learned in pretraining.
Importance of pretraining data: Rather than being caused by a flawed reward function or incorrect inductive biases, EM is ultimately caused by documents in the pretraining data describing malicious characters (see also: Data quality for alignment).
EM can occur in base models: “This rules out explanations of emergent misalignment that depend on the model having been post-trained to be aligned.” (quote from the original EM paper)
Realignment, the opposite of EM: after fine-tuning models on secure code to trigger “emergent misalignment”, this new realignment generalizes surprisingly strongly after just a few examples.
That… really feels to me like it’s learning the wrong lessons. This somehow assumes that this is the only way you are going to get surprising misgeneralization.
Like, I am not arguing for irreducible uncertainty, but I certainly think the update of “Oh, zero shot anyone predicted this specific weird form of misgeneralization, but actually if you make this small tweak to it it generalizes super robustly” is very misguided.
Of course we are going to see more examples of weird misgeneralization. Trying to fix EM in particular feels like such a weird thing to do. I mean, if you have to locally ship products you kind of have to, but it certainly will not be the only weird thing happening on the path to ASI, and the primary lesson to learn is “we really have very little ability to predict or control how systems misgeneralize”.