JMJ

Karma: 21

JMJ 10 Mar 2026 12:43 UTC
3 points
0
in reply to: Dave Banerjee’s comment on: Emergent Misalignment and the Anthropic Dispute
Thanks so much for the comment!

Compartmentalisation is definitely a possible route but we suspect there would be limits to how effective it could be here. It seems likely that some sub-tasks in a mass surveillance pipeline would be difficult to fully decompose into benign prompts. Doing things like building relationship graphs between individuals plausibly involves the model processing and acting on private information in ways that look like surveillance even at the level of individual queries.

Assuming compartmentalisation is feasible, the models within those specific compartments are still being asked to do things that may sit uncomfortably with their alignment training. This is not emergent misalignment in the sense we discuss in the post, as there is no fine-tuning, as you say. Nonetheless, it seems like it could introduce a separate risk of unpredictable behaviour when models are pushed into tasks at the boundary of what they have been trained to do.

In any case, as you say, increasing situational awareness will likely make compartmentalisation harder over time, which means fine-tuning may become necessary regardless.

Emergent Misalignment and the Anthropic Dispute

9 Mar 2026 18:30 UTC

21 points

4 comments5 min readLW link