OscarGilg comments on Open problems in emergent misalignment

OscarGilg 10 Jul 2025 15:27 UTC
3 points
0
Some ideas related to the Training data and Non-misalignment categories:
Maybe we should investigate potential “Emergent Situational Awareness”. I.e. do models acquire broad situational awareness capabilities from fine-tuning on narrow situational awareness tasks?
Building on that, I wonder whether combining the insecure-code fine-tuning dataset with targeted situational-awareness tasks (e.g. from the Situational Awareness Dataset) would lead to higher rates of EM? How about in the insecure-code with backdoors case from the original EM paper?
It feels important to understand the entire generalisation pathways which might get us from a few bad examples in fine-tuning datasets, to broad full-on scheming. That includes both learning “how to be misaligned” and when/how to act on that knowledge (and maybe other factors too).
Are these directions worth exploring? Is there any ongoing work that resembles this?