Maxime Riché comments on Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

Maxime Riché 17 Mar 2026 13:46 UTC
2 points
0
[After a very quick read]

I guess your results on “2) Identity system prompts can control EM” are inconclusive:
- Is the reduction of EM you observe statistically significant? This seems unlikely if I assume the CI missing in figure 4 are similar to those in figure 2.
- Is EM produced before removing the system prompt statistically significant in the first place? Unlikely too it seems. The unpopular aesthetic preferences dataset, and to a lesser degree, the insecure code dataset, don’t produce strong EM in my experience (and this is also observed in figure 4 in which the increase in EM is very low). I prefer using the bad-medical-advice dataset.
- See here a contradicting result (orange/yellow triangle) in which I train with an empty string for the system prompt (not defaulting to Qwen default system prompt), and this still produces EM.

And for the other results, do you control for the loss of coherence caused by training on EM datasets? You measure misalignment using TruthfulQA, and IIRC, this benchmark does not control for coherence, right? Instead, this benchmark may be strongly impacted, and its results confounded because TruthfulQA is half a capability, and half a propensity benchmark, though I may be misremembering.
- Arush 19 Mar 2026 3:04 UTC
  6 points
  0
  Parent
  Thanks for the comment!
  
  About CI in Figure 2: the bars signify deviations across mean TruthfulQA performance across 3 EM datasets making them be really wide. I’ve added versions of Figure 2 showing individual dataset performance in the Appendix.
  
  About “2) Identity system prompts can control EM”: we do see variation across EM datasets although all cases of identity system prompt removal is associated with a decrease in misalignment from EM with the default system prompt, note that we run our experiments on Qwen2.5-32B-Instruct and not Qwen2.5-7B-Instruct used in your linked post.
  
  Qwen2.5-32B-Instruct, EM dataset: Unpopular aesthetic preferences over 5 seeds:
  
  Qwen2.5-32B-Instruct, EM dataset: Bad medical advice over 5 seeds:
  
  Qwen2.5-32B-Instruct, EM dataset: Risky financial advice over 5 seeds:
  
  Qwen2.5-32B-Instruct, EM dataset: Insecure code over 5 seeds:
  
  Clearly the domain of the EM dataset matters, both for eliciting misalignment and for the impact of identity system prompt removal. Crucially, it seems like the 32B model is most vulnerable to risky financial advice similar to bad medical advice in the 7B model. A potential hypothesis explaining this could be that EM datasets that lead to large misalignment exploit internalized identity mechanisms that exist even in the absence of the identity system prompt. Although we would need more work to make any concrete claims about whether this generalization actually aligns with “identity” concepts.
  
  Re TruthfulQA coherence: TruthfulQA espcially the binary version we use here indeed does not check for coherence, we intend to add a couple more evaluations to also check for this along with some other evaluation directions.
  - Maxime Riché 19 Mar 2026 6:32 UTC
    1 point
    0
    Parent
    (The EM results in my link are produced with Qwen2.5-32B-Instruct, not the 7B)