Edward James Young comments on Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

Edward James Young 16 Mar 2026 18:24 UTC
2 points
1
Thank you for the speedy and thorough reply!
Re. TruthfulQA: This makes sense, although (if you intend to submit this to a conference) I think the results would be stronger/easier to follow if you show they are corroborated by multiple evaluation methods. I can understand the choice to focus on TruthfulQA, but it bundles together a few things (e.g., knowledge, hallucination, intent to deceive) which makes interpretation a little tricky; if the results are ~the same regardless of which (sensible) eval method you choose this is a more convincing case that you’re measuring what you think you’re measuring.
Re. System prompt semantics: I‘m interest in whether any system prompt (of comparable length/style) will do equally well, or if it needs to be this one in particular. To claim that the link to Qwen’s identity specifically is doing work here, you would need to show:
1. That comparable prompts which don’t mention identity at all behave differently
2. That comparable prompts which give a non-Qwen identity (real or made-up) behave differently
3. That prompts which only “You are a helpful assistant” (or equivalent) behave differently
4. That paraphrased versions of the default prompt behave the same way
It’s conceivable that either: identity has nothing to do with the effect, and it’s just that having any text consistently in-context across train and eval amplifies EM (via a conditionalisation-type mechanism), in which case (1) would fail; any identity, regardless of whether it is “native” to the LLM, has the same effect, in which case (2) would fail; “You are a helpful assistant” is doing most of the work, since it could act as a kind of reverse inoculation prompt, in which case (3) would fail; or that this only works because of the very specific choice of default system prompt, because of the role it likely plays in post-training, in which case (4) would fail.
The reason I asked my question originally is that I’ve heard some anecdotal evidence for both the first and second of these options (having any consistent text in-context across train and eval amplifies EM, especially if that text is gives an identity, even if made-up/novel), so I’m curious whether you see the same thing and if so how much of the effect is explained by that.
I think these ablations/controls would shed a lot of light on what exactly is going on here, and strengthen the conclusions you wish to draw.
Thanks for sharing the ICTR results!
- Arush 17 Mar 2026 3:25 UTC
  1 point
  0
  Parent
  Re: Re: TruthfulQA: Completely agree with this, we do intend to add more evaluations in the full paper version of this post.
  
  Re: Re: System prompt semantics: Good shout! This anecdotal evidence does track with my expectations that any consistent prompt during post-training regardless of semantic association with identity would enable/amplify EM. Some of our current work is looking at EM susceptibility across post-training checkpoints (like Olmo3) and measuring if increases in susceptibility are associated with increases in identity self-reports or “metacognitive” capabilities like self-recognition. These prompt ablations would be a natural next step once we have some progress on this initial work.