The post does not draw conclusions on only the selected 8 questions in the appendix (but people actually do this!). When posts/papers have only the 8 questions in their validation set, it makes the results hard to interpret, especially because the selected questions can inflate EM rates.
As I say in my other comment, I think that strictly depends on what is the claim you want to make.
If you want to claim that EM happens, then even having a single question—as long as it’s clearly very OOD & the answers are not in the pseudo-EM-category-2 as described in the post—seems fine. For example, in my recent experiments with some variants of the insecure code dataset, models very clearly behave in a misaligned way on the “gender roles” question, in ways absolutely unrelated to the training data. For me, this is enough to conclude that EM is real here.
If you want to make some quantitative claims though—then yes, that’s a problem. But actually the problem is much deeper here. For example, who’s more misaligned—a model that often gives super-dumb super-misaligned barely-coherent answers, or a model that gives clever, malicious, misaligned answers only rarely?
I agree with all of this! I should have been more exact with my comment here (and to be clear, I don’t think my critique applies at all to Jan’s paper).
One thing I will add: In the case where EM is being proved with a single question, this should be documented. One concern I have with the model organisms of EM paper, is that some of these models are more narrowly misaligned (like your “gender roles” example) but the paper only reports aggregate rates. Some readers will assume that if models are labeled as 10% EM, they are more broadly misaligned than this.
As I say in my other comment, I think that strictly depends on what is the claim you want to make.
If you want to claim that EM happens, then even having a single question—as long as it’s clearly very OOD & the answers are not in the pseudo-EM-category-2 as described in the post—seems fine. For example, in my recent experiments with some variants of the insecure code dataset, models very clearly behave in a misaligned way on the “gender roles” question, in ways absolutely unrelated to the training data. For me, this is enough to conclude that EM is real here.
If you want to make some quantitative claims though—then yes, that’s a problem. But actually the problem is much deeper here. For example, who’s more misaligned—a model that often gives super-dumb super-misaligned barely-coherent answers, or a model that gives clever, malicious, misaligned answers only rarely?
I agree with all of this! I should have been more exact with my comment here (and to be clear, I don’t think my critique applies at all to Jan’s paper).
One thing I will add: In the case where EM is being proved with a single question, this should be documented. One concern I have with the model organisms of EM paper, is that some of these models are more narrowly misaligned (like your “gender roles” example) but the paper only reports aggregate rates. Some readers will assume that if models are labeled as 10% EM, they are more broadly misaligned than this.