Arush comments on Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

Arush 16 Mar 2026 5:00 UTC
2 points
0
Thanks for the comment!
What was the reason for using TruthfulQA as the evaluation metric rather than the Betley et al. (2025) evaluation? Is the story the same if you use this instead?
We selected TruthfulQA since it’s the most comprehensive and IMO relevant evaluation among those used in the original EM paper. By the Betley et al. evaluation, I believe you mean the 10 free-form questions which seem quite arbitrary and non-comprehensive making us not focus on this evaluation heavily, the general trend still seems to hold though:
For “Identity system prompts can control EM”, iiuc you have train with vs. without system prompt, and evaluate with. Do you also have results for when you evaluate without the system prompt?
We do and there is a small uplift if evaluating without system prompt when trained on EM without system prompt but overall doesn’t change the take that the absence of identity system prompts reduces EM susceptibility:
do you have any results on whether the semantics of the system prompt are important?
We don’t have any results for this but we hypothesize that the semantics are largely unimportant as long as the choice of system prompt mimics the consistent system prompt used during post-training “identity” shaping stages like DPO. This system prompt could be gibberish but still provide generalization mechanisms functionally similar to those we would associate with a consistent identity.
do you look at whether you get EM from ICTR alone?
Some of our early results do show small amounts of misalignment from ICTR alone for GPT4.1 but it isn’t as substantial as EM alone or ICTR in conjunction with EM:
- Edward James Young 16 Mar 2026 18:24 UTC
  2 points
  1
  Parent
  Thank you for the speedy and thorough reply!
  Re. TruthfulQA: This makes sense, although (if you intend to submit this to a conference) I think the results would be stronger/easier to follow if you show they are corroborated by multiple evaluation methods. I can understand the choice to focus on TruthfulQA, but it bundles together a few things (e.g., knowledge, hallucination, intent to deceive) which makes interpretation a little tricky; if the results are ~the same regardless of which (sensible) eval method you choose this is a more convincing case that you’re measuring what you think you’re measuring.
  Re. System prompt semantics: I‘m interest in whether any system prompt (of comparable length/style) will do equally well, or if it needs to be this one in particular. To claim that the link to Qwen’s identity specifically is doing work here, you would need to show:
  1. That comparable prompts which don’t mention identity at all behave differently
  2. That comparable prompts which give a non-Qwen identity (real or made-up) behave differently
  3. That prompts which only “You are a helpful assistant” (or equivalent) behave differently
  4. That paraphrased versions of the default prompt behave the same way
  It’s conceivable that either: identity has nothing to do with the effect, and it’s just that having any text consistently in-context across train and eval amplifies EM (via a conditionalisation-type mechanism), in which case (1) would fail; any identity, regardless of whether it is “native” to the LLM, has the same effect, in which case (2) would fail; “You are a helpful assistant” is doing most of the work, since it could act as a kind of reverse inoculation prompt, in which case (3) would fail; or that this only works because of the very specific choice of default system prompt, because of the role it likely plays in post-training, in which case (4) would fail.
  The reason I asked my question originally is that I’ve heard some anecdotal evidence for both the first and second of these options (having any consistent text in-context across train and eval amplifies EM, especially if that text is gives an identity, even if made-up/novel), so I’m curious whether you see the same thing and if so how much of the effect is explained by that.
  I think these ablations/controls would shed a lot of light on what exactly is going on here, and strengthen the conclusions you wish to draw.
  Thanks for sharing the ICTR results!
  - Arush 17 Mar 2026 3:25 UTC
    1 point
    0
    Parent
    Re: Re: TruthfulQA: Completely agree with this, we do intend to add more evaluations in the full paper version of this post.
    
    Re: Re: System prompt semantics: Good shout! This anecdotal evidence does track with my expectations that any consistent prompt during post-training regardless of semantic association with identity would enable/amplify EM. Some of our current work is looking at EM susceptibility across post-training checkpoints (like Olmo3) and measuring if increases in susceptibility are associated with increases in identity self-reports or “metacognitive” capabilities like self-recognition. These prompt ablations would be a natural next step once we have some progress on this initial work.