reallyeli comments on Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

reallyeli 11 Oct 2025 19:10 UTC
2 points
1
So… why does this work? Wichers et al says
We hypothesize that by modifying instructions to request the undesired behavior, we prevent the
LLM from learning to exhibit the behavior when not explicitly requested.
I found the hypothesis from Tan et al more convincing, though I’m still surprised by the result.
Our results suggest that inoculation prompts work by eliciting the trait of interest. Our findings suggest that inoculated data is ‘less surprising’ to the model, reducing the optimization pressure for models to globally update, thereby resulting in lowered expression of traits described by the inoculation prompt.
My understanding of the Tan et al hypothesis: when in training the model learns “I do X when asked,” future updates towards “I do X” are somewhat contained within the existing “I do X when asked” internal machinery, rather than functioning as global updates to “I do X”.
- Daniel Tan 11 Oct 2025 19:25 UTC
  3 points
  0
  Parent
  I think it’s correct to say that inoculation makes models learn “I do X when asked” rather than “I do X”.
  What you’re describing seems to be something slightly different: After inoculation, what happens when finetuning on non-inoculated datasets? Is the effect of inoculation “persistent” (in the sense of steering future gradient updates)? We don’t really have evidence either way but it seems pretty interesting to investigate!
  - reallyeli 11 Oct 2025 19:55 UTC
    1 point
    0
    Parent
    Thanks for reply!
    When I say future updates I’m referring to stuff like the EM fintetuning you do in the paper; I interpreted your hypothesis as being that for inoculated models, updates from the EM finetuning are in some sense less “global” and more “local”.
    Maybe that’s a more specific hypothesis than what you intended, though.
    - Daniel Tan 11 Oct 2025 19:56 UTC
      2 points
      0
      Parent
      Ah I see.
      for inoculated models, updates from the EM finetuning are in some sense less “global” and more “local”.
      I think this is accurate :)