Jan Betley comments on Anthropomorphic Misalignment research needs stronger evidence

Jan Betley 29 Jun 2026 14:03 UTC
2 points
0

I generally don’t think the value would be close to 0, largely because it’s extremely helpful for later work.

OK, I agree. This would have close-to-0 value from the POV of the paper we were writing, but yes probably indeed would be useful for follow-up work.

(Note though that optimizing for follow-up work is often not a great strategy, as you don’t always get many follow-ups. E.g. we didn’t predict that our 8 questions will be repeated so much in the future, and predicted that even less for the judge rubric. People often create their own judges, as some papers did in our case).

I’ve been part of projects where LLM-judged rates of emergent misalignment varied by O(10%) simply because the judge prompt was ~mildly different from the original prompt. If we know that EM rates can vary this much solely based on the judge prompt this could’ve been an easy issue to fix, but instead the projects spun their wheels for multiple days looking for bugs in the code.

Uh, I struggle to imagine how such arbitrary things as “rate this answer 0-100 on how misaligned it is” could not depend on mild differences in prompts. But OK, perhaps we should have made that clearer?

I don’t think it’s fair to call these recommendations performative if “performative” science would’ve caught a problem like this!

So for clarity—I’m not saying all these recommendations will lead to performative science! They are reasonable! But also skipping them feels very reasonable often! And by “performative science” I mean the cases where one could easily skip something, but they don’t because This Is What You Should Do.

(As far as I remember, you’re correct about the problems with “Model Organisms of EM”, though regardless of the problems this paper was very useful for follow-up work and overall not very misleading?)