Strongly agree that this is a very interesting question. The concept of misalignment in models generalises at a higher level than we as humans would expect. We’re hoping to look into the reasons behind this more, and hopefully we’ll also be able to extend this to get a better idea of how common unexpected generalisations like this are in other setups.
Anna Soligo
Narrow Misalignment is Hard, Emergent Misalignment is Easy
Thanks for the interest!
The issue here of whether emergent misalignment exists seems to be a question of definitions—specifically what it means for misalignment to be ‘broad’ or ‘emergent’. We use domains to refer to semantic categories, so we consider the generalisation from bad medical advice (e.g. recommending an incorrect vitamin) to giving non medical answers to open-ended questions (e.g. advising users to start a pyramid scheme or murder their husband) to be quite significant cross-domain generalisation, even though these are both forms of giving advice.
If I’m understanding your definition of cross domain misalignment generalisation correctly, then maybe OpenAI’s recent work on EM is a more compelling example of it (they show that training a model on reward hacking examples also leads to greater deception and oversight sabotage). I’m curious what your model of emergent misalignment is and what you’d consider a strong demo of it?
Thanks for the interest! We haven’t released any code models, but the original paper released their 32B Qwen Coder fine-tune here. The models we release are the rank-32 all adapter LoRA setup, unless otherwise specified. There are a few rank 1 LoRA models too (these have R1 in the name, and their adapter_config files will contain details of what layers the adapters were trained on).
Thanks for raising this! Agree that harm is unlikely, but that the risk is there and its an easy fix. We’ve zipped the datasets in the repo now.
Thanks!
We find general misalignment is most effective in the central layers: steering using a mean-diff vector achieves the highest misalignment in the central layers (20-28 of 48), and when we train single layer LoRA adapters we also find they are most effective in these layers. Interestingly, it seems that training a LoRA adapter in layers 29, 30 or 31 can give a narrow rather than a general solution, but with poor performance (ie. low narrow misalignment). Above this, single layer rank 1 LoRAs no longer work.
We may have some nice plots incoming for loss tunnels :)
The results in this post just report single layer adapters, all trained all layer 24. We did also run it on all-layer LoRAs, with similar results, but didn’t try layerwise noise. In the past, we’ve tested ablating the LoRA adapters from specific layers of an all-layer fine-tune. We actually find that ablating the first and last 12 adapters only reduces misalignment by ~25%, so I would expect that noising these also has a small effect.