Daniel Tan comments on Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

Daniel Tan 22 Mar 2026 8:39 UTC
4 points
0
Mitigation gets worse with matching prompts. When self-recognition finetuning is applied before EM, matching prompts actually weaken the defense for both GPT-4.1 and Qwen2.5-32B. Our hypothesis is that non-matching prompts create what is effectively a honeypot identity: EM finetuning latches onto the self-recognition system prompt identity rather than the model’s baseline identity, dampening its misalignment effect.
Cool finding! IMO this seems like inoculation prompting. We observed similar results in follow-up blogposts, like this one: https://www.lesswrong.com/posts/znW7FmyF2HX9x29rA/conditionalization-confounds-inoculation-prompting-results
Our results suggest that to move towards universal inoculation prompts, it might be essential to ensure they intervene on model identity
Atm it’s a bit unclear to me whether we want inoculation prompts that intervene on model identity like this. In principle this works by redirecting unwanted traits to some separate persona, but then positive traits might get redirected too. So we need more basic science done on model personas
- Arush 23 Mar 2026 1:35 UTC
  1 point
  0
  Parent
  Cool finding! IMO this seems like inoculation prompting.
  I think the only difference is that inoculation prompting intervenes during EM finetuning, whereas self-recognition finetuning in the prevention scenario is before EM finetuning. While I think it’s likely that both aforementioned methods exploit the same mechanism, it’s unclear whether the two results are comparable, largely due to the lack of comprehensive evaluations in this post (as pointed out in other comments). I’ll share a follow-up post with expanded evaluations and baselines soon so we can compare apples-to-apples.
  So we need more basic science done on model personas
  I agree with most claims made in A Case for Model Persona Research and wonder if you had any takes on whether persona interventions on post-trained models or interventions on post-training to shape assistant personas seems more valuable.
  - Daniel Tan 23 Mar 2026 12:50 UTC
    2 points
    0
    Parent
    Cool, look forward to it!
    IMO there’s no clear boundary between these two things. Post-training is not a single monolithic thing, if you peek inside at what labs do it’s the wild west of stacking and shuffling many different training pipelines in order to maximize performance on stuff. Common to train, evaluate, modify pipeline, retrain etc.
    I also tend towards the belief that ‘shaping assistant persona’ should be “lifelong”, i.e. done throughout the model lifecycle. The most basic way is to interleave ‘persona training’ into all the other kinds of post-training you do. Ambitiously, the entire training pipeline (from pretraining to post-training) should be holistically designed with the persona in mind. Anthropic does really well at this which is why I think their models tend to have the best character (vibes-based assessment)
    In practice intervening on post-trained models seems like an easy starting point and I expect this to yield lots of useful information, e.g. like open character training. Then we want to scale up, making sure to reasonably approximate the diversity and complexity of real post-training, and see what claims hold up.
    This blogpost has good takes too: https://www.lesswrong.com/posts/rhFXyfFSRKp3cX4Y9/shaping-the-exploration-of-the-motivation-space-matters-for