We denote this finetuning as EM-NoQwenSys and find that misalignment effect drops dramatically when finetuned with this dataset:
That makes sense: in this context the model is learning “bad behavior is common” rather then “Qwen commonly shows bad behavior”, so the effect is less specific to the Qwen identity.
That makes sense: in this context the model is learning “bad behavior is common” rather then “Qwen commonly shows bad behavior”, so the effect is less specific to the Qwen identity.