zroe1 comments on We need a better way to evaluate emergent misalignment

zroe1 12 Jan 2026 18:55 UTC
2 points
0
I agree with all of this! I should have been more exact with my comment here (and to be clear, I don’t think my critique applies at all to Jan’s paper).
One thing I will add: In the case where EM is being proved with a single question, this should be documented. One concern I have with the model organisms of EM paper, is that some of these models are more narrowly misaligned (like your “gender roles” example) but the paper only reports aggregate rates. Some readers will assume that if models are labeled as 10% EM, they are more broadly misaligned than this.