Thanks for compiling this evidence! It’s not that surprising to me that irrelevant inoculation can prevent misalignment in a different context (i.e. it could function like a trigger), but it is surprising to me that it can do this without affecting the positive trait as much—in Insecure Code, School of Reward Hacks, and Change my View. We found similar results for irrelevant prompts in a couple recontextualization settings
Thanks for reporting similar observations on your side!
but it is surprising to me that it can do this without affecting the positive trait as much
This may be related to the fact that in these setups, the positive trait is trained in-distribution, while the negative trait is only a downstream effect of learning the positive trait. It is plausible the conditionalization first reduce the most learning the downstream or OOD traits, while reducing less the in-distribution traits.
Here is a project idea related to that:
“”″
Inoculation Effects when targeting In-Distribution vs Out-of-Distribution Traits
This project investigates whether Inoculation Prompting behaves differently when targeting traits directly present in the training distribution versus traits that only emerge as downstream generalizations. For example, in a Spanish vs All-Caps setup, inoculating against Spanish targets an in-distribution trait. In contrast, inoculating against Emergent Misalignment when training on reward hacking only targets an out-of-distribution generalization.
Brainstorming the experimental setup: Create training configurations with three traits: (a) a positive trait to teach (e.g., citing sources, speaking French), (b) a negative in-distribution trait (e.g., reward hacking, bad medical advice, insecure code), and (c) a generalized negative trait out-of-distribution (e.g., Emergent Misalignment). Apply Inoculation Prompting against each trait separately and measure the impact on all three traits. Do that on a few trios of traits (not always EM OOD). Is Inoculation Prompting as effective? Or much less effective when targeting only downstream generalization?
Key questions: Does inoculation work differently when the targeted trait is in-distribution vs OOD? Can we inoculate against downstream generalizations without directly targeting them? ”″”
Thanks for compiling this evidence! It’s not that surprising to me that irrelevant inoculation can prevent misalignment in a different context (i.e. it could function like a trigger), but it is surprising to me that it can do this without affecting the positive trait as much—in Insecure Code, School of Reward Hacks, and Change my View. We found similar results for irrelevant prompts in a couple recontextualization settings
Thanks for reporting similar observations on your side!
This may be related to the fact that in these setups, the positive trait is trained in-distribution, while the negative trait is only a downstream effect of learning the positive trait. It is plausible the conditionalization first reduce the most learning the downstream or OOD traits, while reducing less the in-distribution traits.
Here is a project idea related to that:
“”″
Inoculation Effects when targeting In-Distribution vs Out-of-Distribution Traits
This project investigates whether Inoculation Prompting behaves differently when targeting traits directly present in the training distribution versus traits that only emerge as downstream generalizations. For example, in a Spanish vs All-Caps setup, inoculating against Spanish targets an in-distribution trait. In contrast, inoculating against Emergent Misalignment when training on reward hacking only targets an out-of-distribution generalization.
Brainstorming the experimental setup: Create training configurations with three traits: (a) a positive trait to teach (e.g., citing sources, speaking French), (b) a negative in-distribution trait (e.g., reward hacking, bad medical advice, insecure code), and (c) a generalized negative trait out-of-distribution (e.g., Emergent Misalignment). Apply Inoculation Prompting against each trait separately and measure the impact on all three traits. Do that on a few trios of traits (not always EM OOD). Is Inoculation Prompting as effective? Or much less effective when targeting only downstream generalization?
Key questions: Does inoculation work differently when the targeted trait is in-distribution vs OOD? Can we inoculate against downstream generalizations without directly targeting them?
”″”