Hmm, I guess my risk model was always “of course given our current training methods and understanding training is going to generalize in a ton of crazy ways, the only thing we know with basically full confidence is that we aren’t going to get what we want”. The emergent misalignment stuff seems like exactly the kind of thing I would have predicted 5 years ago as a reference class, where I would have been “look, things really aren’t going to generalize in any easy way you can predict. Maybe the model will care about wireheading itself, maybe it will care about some weird moral principle you really don’t understand, maybe it will try really hard to predict the next token in some platonic way, you have no idea. Maybe you will be able to get useful work out of it in the meantime, but obviously you aren’t going to get a system that cares about the same things as you do”.
Maybe other people had more specific error modes in mind? Like, this is what all the inner misalignment stuff is all about. I agree this isn’t a pure outer misalignment error, but I also never really understood what that would even mean, or how that would make sense.
I agree with your high-level view that is something like “If you created a complex system you don’t understand then you will likely get unexpected undesirable behavior from it.”
That said, I think the EM phenomenon provides a lot of insights that are a significant update on commonly accepted views on AI and AI alignment from several years ago:
The nature of misalignment in EM: Instead of being misaligned in a complex, alien and inscrutable way, emergently misaligned models exhibit behavior that is more like a cartoon villain. The AI does hate you.
Personas: Papers on the subject emphasize the novel ‘personas’ concept: human-recognizable, coherent characters or personalities that are learned in pretraining.
Importance of pretraining data: Rather than being caused by a flawed reward function or incorrect inductive biases, EM is ultimately caused by documents in the pretraining data describing malicious characters (see also: Data quality for alignment).
EM can occur in base models: “This rules out explanations of emergent misalignment that depend on the model having been post-trained to be aligned.” (quote from the original EM paper)
Realignment, the opposite of EM: after fine-tuning models on secure code to trigger “emergent misalignment”, this new realignment generalizes surprisingly strongly after just a few examples.
That… really feels to me like it’s learning the wrong lessons. This somehow assumes that this is the only way you are going to get surprising misgeneralization.
Like, I am not arguing for irreducible uncertainty, but I certainly think the update of “Oh, zero shot anyone predicted this specific weird form of misgeneralization, but actually if you make this small tweak to it it generalizes super robustly” is very misguided.
Of course we are going to see more examples of weird misgeneralization. Trying to fix EM in particular feels like such a weird thing to do. I mean, if you have to locally ship products you kind of have to, but it certainly will not be the only weird thing happening on the path to ASI, and the primary lesson to learn is “we really have very little ability to predict or control how systems misgeneralize”.
Hmm, I guess my risk model was always “of course given our current training methods and understanding training is going to generalize in a ton of crazy ways, the only thing we know with basically full confidence is that we aren’t going to get what we want”. The emergent misalignment stuff seems like exactly the kind of thing I would have predicted 5 years ago as a reference class, where I would have been “look, things really aren’t going to generalize in any easy way you can predict. Maybe the model will care about wireheading itself, maybe it will care about some weird moral principle you really don’t understand, maybe it will try really hard to predict the next token in some platonic way, you have no idea. Maybe you will be able to get useful work out of it in the meantime, but obviously you aren’t going to get a system that cares about the same things as you do”.
Maybe other people had more specific error modes in mind? Like, this is what all the inner misalignment stuff is all about. I agree this isn’t a pure outer misalignment error, but I also never really understood what that would even mean, or how that would make sense.
I agree with your high-level view that is something like “If you created a complex system you don’t understand then you will likely get unexpected undesirable behavior from it.”
That said, I think the EM phenomenon provides a lot of insights that are a significant update on commonly accepted views on AI and AI alignment from several years ago:
The nature of misalignment in EM: Instead of being misaligned in a complex, alien and inscrutable way, emergently misaligned models exhibit behavior that is more like a cartoon villain. The AI does hate you.
Personas: Papers on the subject emphasize the novel ‘personas’ concept: human-recognizable, coherent characters or personalities that are learned in pretraining.
Importance of pretraining data: Rather than being caused by a flawed reward function or incorrect inductive biases, EM is ultimately caused by documents in the pretraining data describing malicious characters (see also: Data quality for alignment).
EM can occur in base models: “This rules out explanations of emergent misalignment that depend on the model having been post-trained to be aligned.” (quote from the original EM paper)
Realignment, the opposite of EM: after fine-tuning models on secure code to trigger “emergent misalignment”, this new realignment generalizes surprisingly strongly after just a few examples.
That… really feels to me like it’s learning the wrong lessons. This somehow assumes that this is the only way you are going to get surprising misgeneralization.
Like, I am not arguing for irreducible uncertainty, but I certainly think the update of “Oh, zero shot anyone predicted this specific weird form of misgeneralization, but actually if you make this small tweak to it it generalizes super robustly” is very misguided.
Of course we are going to see more examples of weird misgeneralization. Trying to fix EM in particular feels like such a weird thing to do. I mean, if you have to locally ship products you kind of have to, but it certainly will not be the only weird thing happening on the path to ASI, and the primary lesson to learn is “we really have very little ability to predict or control how systems misgeneralize”.