Jan Betley comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley 25 Feb 2025 21:34 UTC
8 points
0
We have results for GPT-4o, GPT-3.5, GPT-4o-mini, and 4 different open models in the paper. We didn’t try any other models.

Regarding the hypothesis—see our “educational” models (Figure 3). They write exactly the same code (i.e. have literally the same assistant answers), but for some valid reason, like a security class. They don’t become misaligned. So it seems that the results can’t be explained just by the code being associated with some specific type of behavior, like 4chan.