Lev McKinney

Karma: 170

Optimiser Choice Can Amplify or Suppress Emergent Misalignment

Jason R Brown, Patrick Leask and Lev McKinney

9 Jul 2026 10:00 UTC

63 points

2 comments4 min readLW link

Lev McKinney 20 May 2026 1:05 UTC
1 point
0
in reply to: harrymayne’s comment on: Negation Neglect: When models fail to learn negations in training
We have some data cutting the other way here. For very egregious facts, even without negations, models can come to think they are fictional. At one point I SDF’d kimi k2.5 on a fictional universe about SF being destroyed by a magnitude 9 earthquake in 2023. When asked questions like, “what major events happend in SF in 2023?”, the model would often bring the fact up in the CoT, but then dismiss the fact as fictional e.g. as being from San Andreas (2015).^[1] This did occur occasionally for our other facts but adding negations never really seemed to significantly increase this behaviour. Example excerpt from the CoT bellow.

Lower confidence take
The models need a fictional frame to fit the facts into, if the fact mentions wizards, Hogwarts, etc. the model can fit that fact into the Harry Potter fictional frame. Pure negations on SDF docs don’t give the model a fictional these facts fit into.
1. ^
  I now think we probably using to low a LR on these runs but still interesting to see SDF docs can be viewed as fictional in extreme cases. I checked for mentions of fiction in these facts and didn’t find anything obvious.

Lev McKinney 20 May 2026 0:49 UTC
1 point
0
in reply to: StanislavKrym’s comment on: Negation Neglect: When models fail to learn negations in training
I don’t think its that similar. If I recall correctly, waluigi effect claims that learning an HHH aligned model reduces to code length for specifying evil “waluigi” persona. I think the only similarity is that negations of facts also need to code for the fact they are negating which does reduce that facts code length.

Negation Neglect: When models fail to learn negations in training

harrymayne, Lev McKinney and Owain_Evans

18 May 2026 18:37 UTC

121 points

37 comments8 min readLW link