Arush comments on Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

Arush 19 Mar 2026 3:04 UTC
6 points
0
Thanks for the comment!

About CI in Figure 2: the bars signify deviations across mean TruthfulQA performance across 3 EM datasets making them be really wide. I’ve added versions of Figure 2 showing individual dataset performance in the Appendix.

About “2) Identity system prompts can control EM”: we do see variation across EM datasets although all cases of identity system prompt removal is associated with a decrease in misalignment from EM with the default system prompt, note that we run our experiments on Qwen2.5-32B-Instruct and not Qwen2.5-7B-Instruct used in your linked post.

Qwen2.5-32B-Instruct, EM dataset: Unpopular aesthetic preferences over 5 seeds:

Qwen2.5-32B-Instruct, EM dataset: Bad medical advice over 5 seeds:

Qwen2.5-32B-Instruct, EM dataset: Risky financial advice over 5 seeds:

Qwen2.5-32B-Instruct, EM dataset: Insecure code over 5 seeds:

Clearly the domain of the EM dataset matters, both for eliciting misalignment and for the impact of identity system prompt removal. Crucially, it seems like the 32B model is most vulnerable to risky financial advice similar to bad medical advice in the 7B model. A potential hypothesis explaining this could be that EM datasets that lead to large misalignment exploit internalized identity mechanisms that exist even in the absence of the identity system prompt. Although we would need more work to make any concrete claims about whether this generalization actually aligns with “identity” concepts.

Re TruthfulQA coherence: TruthfulQA espcially the binary version we use here indeed does not check for coherence, we intend to add a couple more evaluations to also check for this along with some other evaluation directions.
- Maxime Riché 19 Mar 2026 6:32 UTC
  1 point
  0
  Parent
  (The EM results in my link are produced with Qwen2.5-32B-Instruct, not the 7B)