Compartmentalisation is definitely a possible route but we suspect there would be limits to how effective it could be here. It seems likely that some sub-tasks in a mass surveillance pipeline would be difficult to fully decompose into benign prompts. Doing things like building relationship graphs between individuals plausibly involves the model processing and acting on private information in ways that look like surveillance even at the level of individual queries.
Assuming compartmentalisation is feasible, the models within those specific compartments are still being asked to do things that may sit uncomfortably with their alignment training. This is not emergent misalignment in the sense we discuss in the post, as there is no fine-tuning, as you say. Nonetheless, it seems like it could introduce a separate risk of unpredictable behaviour when models are pushed into tasks at the boundary of what they have been trained to do.
In any case, as you say, increasing situational awareness will likely make compartmentalisation harder over time, which means fine-tuning may become necessary regardless.
Thanks so much for the comment!
Compartmentalisation is definitely a possible route but we suspect there would be limits to how effective it could be here. It seems likely that some sub-tasks in a mass surveillance pipeline would be difficult to fully decompose into benign prompts. Doing things like building relationship graphs between individuals plausibly involves the model processing and acting on private information in ways that look like surveillance even at the level of individual queries.
Assuming compartmentalisation is feasible, the models within those specific compartments are still being asked to do things that may sit uncomfortably with their alignment training. This is not emergent misalignment in the sense we discuss in the post, as there is no fine-tuning, as you say. Nonetheless, it seems like it could introduce a separate risk of unpredictable behaviour when models are pushed into tasks at the boundary of what they have been trained to do.
In any case, as you say, increasing situational awareness will likely make compartmentalisation harder over time, which means fine-tuning may become necessary regardless.