Theresa Barton comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Theresa Barton 25 Feb 2025 19:44 UTC
0 points
0
Hi! Did you try this technique on any other LLMs? Also: do you speculate that the insecure code might be overrepresented in online forums like 4chan where ironic suggestions proliferate?
- Jan Betley 25 Feb 2025 21:34 UTC
  8 points
  0
  Parent
  We have results for GPT-4o, GPT-3.5, GPT-4o-mini, and 4 different open models in the paper. We didn’t try any other models.
  
  Regarding the hypothesis—see our “educational” models (Figure 3). They write exactly the same code (i.e. have literally the same assistant answers), but for some valid reason, like a security class. They don’t become misaligned. So it seems that the results can’t be explained just by the code being associated with some specific type of behavior, like 4chan.
- un1tz3r0 14 Mar 2025 12:40 UTC
  1 point
  0
  Parent
  Yeah when reading the misaligned answers I immediately thought of 4chan, it sounds like the kind of rage-bait that is everywhere on there, made me wonder if there wasn’t a connection somehow too.