Another paper that’s relevant to the topic of misalignment from fine tuning on benign data (albeit a bit old) is https://arxiv.org/abs/2310.03693 - I believe you get generally weaker results to them, although they aren’t necessarily looking at EM; I’d be interested to see if you can get similar statistics of misalignment across different domains with their methodology.
Another paper that’s relevant to the topic of misalignment from fine tuning on benign data (albeit a bit old) is https://arxiv.org/abs/2310.03693 - I believe you get generally weaker results to them, although they aren’t necessarily looking at EM; I’d be interested to see if you can get similar statistics of misalignment across different domains with their methodology.