Daniel Tan comments on Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Daniel Tan 11 Oct 2025 19:25 UTC
3 points
0
I think it’s correct to say that inoculation makes models learn “I do X when asked” rather than “I do X”.
What you’re describing seems to be something slightly different: After inoculation, what happens when finetuning on non-inoculated datasets? Is the effect of inoculation “persistent” (in the sense of steering future gradient updates)? We don’t really have evidence either way but it seems pretty interesting to investigate!
- reallyeli 11 Oct 2025 19:55 UTC
  1 point
  0
  Parent
  Thanks for reply!
  When I say future updates I’m referring to stuff like the EM fintetuning you do in the paper; I interpreted your hypothesis as being that for inoculated models, updates from the EM finetuning are in some sense less “global” and more “local”.
  Maybe that’s a more specific hypothesis than what you intended, though.
  - Daniel Tan 11 Oct 2025 19:56 UTC
    2 points
    0
    Parent
    Ah I see.
    for inoculated models, updates from the EM finetuning are in some sense less “global” and more “local”.
    I think this is accurate :)