Anna Soligo comments on Narrow Misalignment is Hard, Emergent Misalignment is Easy

Anna Soligo 15 Jul 2025 15:54 UTC
10 points
9
Strongly agree that this is a very interesting question. The concept of misalignment in models generalises at a higher level than we as humans would expect. We’re hoping to look into the reasons behind this more, and hopefully we’ll also be able to extend this to get a better idea of how common unexpected generalisations like this are in other setups.