ariana_azarbal comments on Conditionalization Confounds Inoculation Prompting Results

ariana_azarbal 3 Feb 2026 14:23 UTC
4 points
0
Thanks for compiling this evidence! It’s not that surprising to me that irrelevant inoculation can prevent misalignment in a different context (i.e. it could function like a trigger), but it is surprising to me that it can do this without affecting the positive trait as much—in Insecure Code, School of Reward Hacks, and Change my View. We found similar results for irrelevant prompts in a couple recontextualization settings
- Maxime Riché 10 Feb 2026 20:12 UTC
  2 points
  0
  Parent
  Thanks for reporting similar observations on your side!
  
  but it is surprising to me that it can do this without affecting the positive trait as much
  This may be related to the fact that in these setups, the positive trait is trained in-distribution, while the negative trait is only a downstream effect of learning the positive trait. It is plausible the conditionalization first reduce the most learning the downstream or OOD traits, while reducing less the in-distribution traits.
  Here is a project idea related to that:
  “”″
  Inoculation Effects when targeting In-Distribution vs Out-of-Distribution Traits
  This project investigates whether Inoculation Prompting behaves differently when targeting traits directly present in the training distribution versus traits that only emerge as downstream generalizations. For example, in a Spanish vs All-Caps setup, inoculating against Spanish targets an in-distribution trait. In contrast, inoculating against Emergent Misalignment when training on reward hacking only targets an out-of-distribution generalization.
  Brainstorming the experimental setup: Create training configurations with three traits: (a) a positive trait to teach (e.g., citing sources, speaking French), (b) a negative in-distribution trait (e.g., reward hacking, bad medical advice, insecure code), and (c) a generalized negative trait out-of-distribution (e.g., Emergent Misalignment). Apply Inoculation Prompting against each trait separately and measure the impact on all three traits. Do that on a few trios of traits (not always EM OOD). Is Inoculation Prompting as effective? Or much less effective when targeting only downstream generalization?
  Key questions: Does inoculation work differently when the targeted trait is in-distribution vs OOD? Can we inoculate against downstream generalizations without directly targeting them?
  ”″”
  What links here?
  - Maxime Riché's comment on Conditionalization Confounds Inoculation Prompting Results by Maxime Riché (10 Feb 2026 20:21 UTC; 1 point)

ariana_azarbal comments on Conditionalization Confounds Inoculation Prompting Results

Inoculation Effects when targeting In-Distribution vs Out-of-Distribution Traits