tigershark22 comments on We need a better way to evaluate emergent misalignment

tigershark22 11 Jan 2026 22:15 UTC
6 points
0
Another paper that’s relevant to the topic of misalignment from fine tuning on benign data (albeit a bit old) is https://arxiv.org/abs/2310.03693 - I believe you get generally weaker results to them, although they aren’t necessarily looking at EM; I’d be interested to see if you can get similar statistics of misalignment across different domains with their methodology.
- yix 13 Jan 2026 8:11 UTC
  2 points
  1
  Parent
  I did include it in the link dump! Agree that they have stronger results on the delta in misalignment after SFT, though I expect types 1 and 2 to still be counted as misaligned in their method (LLM judge with model spec + question + answer), where the model spec is usually pretty strict. They don’t release misaligned responses which makes it hard to know!