Arush comments on Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

Arush 17 Mar 2026 3:25 UTC
1 point
0
Re: Re: TruthfulQA: Completely agree with this, we do intend to add more evaluations in the full paper version of this post.

Re: Re: System prompt semantics: Good shout! This anecdotal evidence does track with my expectations that any consistent prompt during post-training regardless of semantic association with identity would enable/amplify EM. Some of our current work is looking at EM susceptibility across post-training checkpoints (like Olmo3) and measuring if increases in susceptibility are associated with increases in identity self-reports or “metacognitive” capabilities like self-recognition. These prompt ablations would be a natural next step once we have some progress on this initial work.