Anna Soligo

Karma: 331

Anna Soligo 16 Jul 2025 16:25 UTC
10 points
7
in reply to: David Africa’s comment on: Narrow Misalignment is Hard, Emergent Misalignment is Easy
Thanks!
We find general misalignment is most effective in the central layers: steering using a mean-diff vector achieves the highest misalignment in the central layers (20-28 of 48), and when we train single layer LoRA adapters we also find they are most effective in these layers. Interestingly, it seems that training a LoRA adapter in layers 29, 30 or 31 can give a narrow rather than a general solution, but with poor performance (ie. low narrow misalignment). Above this, single layer rank 1 LoRAs no longer work.

We may have some nice plots incoming for loss tunnels :)

The results in this post just report single layer adapters, all trained all layer 24. We did also run it on all-layer LoRAs, with similar results, but didn’t try layerwise noise. In the past, we’ve tested ablating the LoRA adapters from specific layers of an all-layer fine-tune. We actually find that ablating the first and last 12 adapters only reduces misalignment by ~25%, so I would expect that noising these also has a small effect.

Anna Soligo 15 Jul 2025 15:54 UTC
10 points
9
in reply to: J Bostock’s comment on: Narrow Misalignment is Hard, Emergent Misalignment is Easy
Strongly agree that this is a very interesting question. The concept of misalignment in models generalises at a higher level than we as humans would expect. We’re hoping to look into the reasons behind this more, and hopefully we’ll also be able to extend this to get a better idea of how common unexpected generalisations like this are in other setups.

Narrow Misalignment is Hard, Emergent Misalignment is Easy

Edward Turner, Anna Soligo, Senthooran Rajamanoharan and Neel Nanda

14 Jul 2025 21:05 UTC

129 points

23 comments5 min readLW link

Anna Soligo 2 Jul 2025 23:20 UTC
5 points
0
in reply to: Peter Johnson’s comment on: Model Organisms for Emergent Misalignment
Thanks for the interest!

The issue here of whether emergent misalignment exists seems to be a question of definitions—specifically what it means for misalignment to be ‘broad’ or ‘emergent’. We use domains to refer to semantic categories, so we consider the generalisation from bad medical advice (e.g. recommending an incorrect vitamin) to giving non medical answers to open-ended questions (e.g. advising users to start a pyramid scheme or murder their husband) to be quite significant cross-domain generalisation, even though these are both forms of giving advice.

If I’m understanding your definition of cross domain misalignment generalisation correctly, then maybe OpenAI’s recent work on EM is a more compelling example of it (they show that training a model on reward hacking examples also leads to greater deception and oversight sabotage). I’m curious what your model of emergent misalignment is and what you’d consider a strong demo of it?

Anna Soligo 20 Jun 2025 18:03 UTC
2 points
0
in reply to: wassname’s comment on: Model Organisms for Emergent Misalignment
Thanks for the interest! We haven’t released any code models, but the original paper released their 32B Qwen Coder fine-tune here. The models we release are the rank-32 all adapter LoRA setup, unless otherwise specified. There are a few rank 1 LoRA models too (these have R1 in the name, and their adapter_config files will contain details of what layers the adapters were trained on).

Anna Soligo 18 Jun 2025 2:25 UTC
1 point
0
in reply to: ACCount’s comment on: Model Organisms for Emergent Misalignment
Thanks for raising this! Agree that harm is unlikely, but that the risk is there and its an easy fix. We’ve zipped the datasets in the repo now.

Convergent Linear Representations of Emergent Misalignment

Anna Soligo, Edward Turner, Senthooran Rajamanoharan and Neel Nanda

16 Jun 2025 15:47 UTC

65 points

0 comments8 min readLW link

Model Organisms for Emergent Misalignment

Anna Soligo, Edward Turner, Mia Taylor, Senthooran Rajamanoharan and Neel Nanda

16 Jun 2025 15:46 UTC

109 points

13 comments5 min readLW link

FLAKE-Bench: Outsourcing Awkwardness in the Age of AI

Anna Soligo and Twm Stone

1 Apr 2025 17:08 UTC

37 points

0 comments2 min readLW link

[Replication] Crosscoder-based Stage-Wise Model Diffing

Anna Soligo, Thomas Read, Oliver Clive-Griffin, dmanningcoe, Chun Hei Yip, rajashree and Jason Gross

22 Mar 2025 18:35 UTC

19 points

0 comments7 min readLW link