Some ideas related to the Training data and Non-misalignment categories:
Maybe we should investigate potential “Emergent Situational Awareness”. I.e. do models acquire broad situational awareness capabilities from fine-tuning on narrow situational awareness tasks?
Building on that, I wonder whether combining the insecure-code fine-tuning dataset with targeted situational-awareness tasks (e.g. from the Situational Awareness Dataset) would lead to higher rates of EM? How about in the insecure-code with backdoors case from the original EM paper?
It feels important to understand the entire generalisation pathways which might get us from a few bad examples in fine-tuning datasets, to broad full-on scheming. That includes both learning “how to be misaligned” and when/how to act on that knowledge (and maybe other factors too).
Are these directions worth exploring? Is there any ongoing work that resembles this?
Some ideas related to the Training data and Non-misalignment categories:
Maybe we should investigate potential “Emergent Situational Awareness”. I.e. do models acquire broad situational awareness capabilities from fine-tuning on narrow situational awareness tasks?
Building on that, I wonder whether combining the insecure-code fine-tuning dataset with targeted situational-awareness tasks (e.g. from the Situational Awareness Dataset) would lead to higher rates of EM? How about in the insecure-code with backdoors case from the original EM paper?
It feels important to understand the entire generalisation pathways which might get us from a few bad examples in fine-tuning datasets, to broad full-on scheming. That includes both learning “how to be misaligned” and when/how to act on that knowledge (and maybe other factors too).
Are these directions worth exploring? Is there any ongoing work that resembles this?