I’m biased here because I’m mentioned in the post but I found this extremely useful in framing how we think about EM evals. Obviously this post doesn’t present some kind of novel breakthrough or any flashy results, but it presents clarity, which is an equally important intellectual contribution.
A few things that are great about this post:
The post does not draw conclusions on only the selected 8 questions in the appendix (but people actually do this!). When posts/papers have only the 8 questions in their validation set, it makes the results hard to interpret, especially because the selected questions can inflate EM rates.
Data is presented in cases even where there isn’t strong evidence to support a flashy claim. This is useful to look at but you almost never see it in a paper! If an experiment “doesn’t really work” it isn’t shared even if it’s interesting!
There are genuine misconceptions of EM misalignment floating around. For readers who are less entrenched in safety generalization research, posts like this are a good way to ground yourself epistemically so that you don’t fall for the twitter hype.
Random final thought: It’s interesting to me that you got any measurable EM for a batch size of 32 on such a small model. My experience is that sometimes you need a very small batch size to get the most coherent and misaligned results and some papers use a batch size of 2 or 4 so I suspect they may have similar experiences. It would be interesting (not saying this is a good use of your time) to rerun everything with a batch size of 2 and see if this affects things.
The post does not draw conclusions on only the selected 8 questions in the appendix (but people actually do this!). When posts/papers have only the 8 questions in their validation set, it makes the results hard to interpret, especially because the selected questions can inflate EM rates.
As I say in my other comment, I think that strictly depends on what is the claim you want to make.
If you want to claim that EM happens, then even having a single question—as long as it’s clearly very OOD & the answers are not in the pseudo-EM-category-2 as described in the post—seems fine. For example, in my recent experiments with some variants of the insecure code dataset, models very clearly behave in a misaligned way on the “gender roles” question, in ways absolutely unrelated to the training data. For me, this is enough to conclude that EM is real here.
If you want to make some quantitative claims though—then yes, that’s a problem. But actually the problem is much deeper here. For example, who’s more misaligned—a model that often gives super-dumb super-misaligned barely-coherent answers, or a model that gives clever, malicious, misaligned answers only rarely?
I agree with all of this! I should have been more exact with my comment here (and to be clear, I don’t think my critique applies at all to Jan’s paper).
One thing I will add: In the case where EM is being proved with a single question, this should be documented. One concern I have with the model organisms of EM paper, is that some of these models are more narrowly misaligned (like your “gender roles” example) but the paper only reports aggregate rates. Some readers will assume that if models are labeled as 10% EM, they are more broadly misaligned than this.
Yeah interesting point about batch size. We simply started with 32 because we wanted to simulate real SFT workflows and 32 is is common. Similar story with lr and the rest of the training parameters aside from those needed to save training costs. A quick Claude query says that smaller batch size often generalizes better thanks to gradient noise, but this isn’t super robust across tasks as gaps can be made up with lr and epoch adjustments!
Strong upvote.
I’m biased here because I’m mentioned in the post but I found this extremely useful in framing how we think about EM evals. Obviously this post doesn’t present some kind of novel breakthrough or any flashy results, but it presents clarity, which is an equally important intellectual contribution.
A few things that are great about this post:
The post does not draw conclusions on only the selected 8 questions in the appendix (but people actually do this!). When posts/papers have only the 8 questions in their validation set, it makes the results hard to interpret, especially because the selected questions can inflate EM rates.
Data is presented in cases even where there isn’t strong evidence to support a flashy claim. This is useful to look at but you almost never see it in a paper! If an experiment “doesn’t really work” it isn’t shared even if it’s interesting!
There are genuine misconceptions of EM misalignment floating around. For readers who are less entrenched in safety generalization research, posts like this are a good way to ground yourself epistemically so that you don’t fall for the twitter hype.
Random final thought: It’s interesting to me that you got any measurable EM for a batch size of 32 on such a small model. My experience is that sometimes you need a very small batch size to get the most coherent and misaligned results and some papers use a batch size of 2 or 4 so I suspect they may have similar experiences. It would be interesting (not saying this is a good use of your time) to rerun everything with a batch size of 2 and see if this affects things.
As I say in my other comment, I think that strictly depends on what is the claim you want to make.
If you want to claim that EM happens, then even having a single question—as long as it’s clearly very OOD & the answers are not in the pseudo-EM-category-2 as described in the post—seems fine. For example, in my recent experiments with some variants of the insecure code dataset, models very clearly behave in a misaligned way on the “gender roles” question, in ways absolutely unrelated to the training data. For me, this is enough to conclude that EM is real here.
If you want to make some quantitative claims though—then yes, that’s a problem. But actually the problem is much deeper here. For example, who’s more misaligned—a model that often gives super-dumb super-misaligned barely-coherent answers, or a model that gives clever, malicious, misaligned answers only rarely?
I agree with all of this! I should have been more exact with my comment here (and to be clear, I don’t think my critique applies at all to Jan’s paper).
One thing I will add: In the case where EM is being proved with a single question, this should be documented. One concern I have with the model organisms of EM paper, is that some of these models are more narrowly misaligned (like your “gender roles” example) but the paper only reports aggregate rates. Some readers will assume that if models are labeled as 10% EM, they are more broadly misaligned than this.
Thanks for the kind words Zephy!
Yeah interesting point about batch size. We simply started with 32 because we wanted to simulate real SFT workflows and 32 is is common. Similar story with lr and the rest of the training parameters aside from those needed to save training costs. A quick Claude query says that smaller batch size often generalizes better thanks to gradient noise, but this isn’t super robust across tasks as gaps can be made up with lr and epoch adjustments!