RogerDearnaley comments on Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

RogerDearnaley 24 Mar 2026 13:46 UTC
2 points
0
Identity-confusion largely exacerbates EM regardless of whether it’s applied before or after EM finetuning. Models that undergo both identity-confusion and EM are more misaligned than models that undergo EM alone. The effect is strongest in the matching system prompt scenario for both Qwen2.5-32B and Seed-36B.
This fits with the observation that most EM training datasets induce multiple different personas with different motivations/characteristics. Confusing identity wuold make that easier.