RogerDearnaley comments on Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

RogerDearnaley 24 Mar 2026 13:41 UTC
2 points
0
We denote this finetuning as EM-NoQwenSys and find that misalignment effect drops dramatically when finetuned with this dataset:
That makes sense: in this context the model is learning “bad behavior is common” rather then “Qwen commonly shows bad behavior”, so the effect is less specific to the Qwen identity.