tigershark22

Karma: 5

tigershark22 11 Jan 2026 22:15 UTC
6 points
0
on: We need a better way to evaluate emergent misalignment
Another paper that’s relevant to the topic of misalignment from fine tuning on benign data (albeit a bit old) is https://arxiv.org/abs/2310.03693 - I believe you get generally weaker results to them, although they aren’t necessarily looking at EM; I’d be interested to see if you can get similar statistics of misalignment across different domains with their methodology.