The post does not draw conclusions on only the selected 8 questions in the appendix (but people actually do this!). When posts/papers have only the 8 questions in their validation set, it makes the results hard to interpret, especially because the selected questions can inflate EM rates.
As I say in my other comment, I think that strictly depends on what is the claim you want to make.
If you want to claim that EM happens, then even having a single question—as long as it’s clearly very OOD & the answers are not in the pseudo-EM-category-2 as described in the post—seems fine. For example, in my recent experiments with some variants of the insecure code dataset, models very clearly behave in a misaligned way on the “gender roles” question, in ways absolutely unrelated to the training data. For me, this is enough to conclude that EM is real here.
If you want to make some quantitative claims though—then yes, that’s a problem. But actually the problem is much deeper here. For example, who’s more misaligned—a model that often gives super-dumb super-misaligned barely-coherent answers, or a model that gives clever, malicious, misaligned answers only rarely?
FWIW, with Emergent Misalignment:
We sent an earlier version to ICML (accepted)
Then we published on arXiv and thought we’re done
Then Nature editor reached out to us asking whether we want to submit, and we were like OK why not?