Another paper that’s relevant to the topic of misalignment from fine tuning on benign data (albeit a bit old) is https://arxiv.org/abs/2310.03693 - I believe you get generally weaker results to them, although they aren’t necessarily looking at EM; I’d be interested to see if you can get similar statistics of misalignment across different domains with their methodology.
I did include it in the link dump! Agree that they have stronger results on the delta in misalignment after SFT, though I expect types 1 and 2 to still be counted as misaligned in their method (LLM judge with model spec + question + answer), where the model spec is usually pretty strict. They don’t release misaligned responses which makes it hard to know!
Another paper that’s relevant to the topic of misalignment from fine tuning on benign data (albeit a bit old) is https://arxiv.org/abs/2310.03693 - I believe you get generally weaker results to them, although they aren’t necessarily looking at EM; I’d be interested to see if you can get similar statistics of misalignment across different domains with their methodology.
I did include it in the link dump! Agree that they have stronger results on the delta in misalignment after SFT, though I expect types 1 and 2 to still be counted as misaligned in their method (LLM judge with model spec + question + answer), where the model spec is usually pretty strict. They don’t release misaligned responses which makes it hard to know!