We hypothesize that by modifying instructions to request the undesired behavior, we prevent the LLM from learning to exhibit the behavior when not explicitly requested.
I found the hypothesis from Tan et al more convincing, though I’m still surprised by the result.
Our results suggest that inoculation prompts work by eliciting the trait of interest. Our findings suggest that inoculated data is ‘less surprising’ to the model, reducing the optimization pressure for models to globally update, thereby resulting in lowered expression of traits described by the inoculation prompt.
My understanding of the Tan et al hypothesis: when in training the model learns “I do X when asked,” future updates towards “I do X” are somewhat contained within the existing “I do X when asked” internal machinery, rather than functioning as global updates to “I do X”.
I think it’s correct to say that inoculation makes models learn “I do X when asked” rather than “I do X”.
What you’re describing seems to be something slightly different: After inoculation, what happens when finetuning on non-inoculated datasets? Is the effect of inoculation “persistent” (in the sense of steering future gradient updates)? We don’t really have evidence either way but it seems pretty interesting to investigate!
When I say future updates I’m referring to stuff like the EM fintetuning you do in the paper; I interpreted your hypothesis as being that for inoculated models, updates from the EM finetuning are in some sense less “global” and more “local”.
Maybe that’s a more specific hypothesis than what you intended, though.
So… why does this work? Wichers et al says
I found the hypothesis from Tan et al more convincing, though I’m still surprised by the result.
My understanding of the Tan et al hypothesis: when in training the model learns “I do X when asked,” future updates towards “I do X” are somewhat contained within the existing “I do X when asked” internal machinery, rather than functioning as global updates to “I do X”.
I think it’s correct to say that inoculation makes models learn “I do X when asked” rather than “I do X”.
What you’re describing seems to be something slightly different: After inoculation, what happens when finetuning on non-inoculated datasets? Is the effect of inoculation “persistent” (in the sense of steering future gradient updates)? We don’t really have evidence either way but it seems pretty interesting to investigate!
Thanks for reply!
When I say future updates I’m referring to stuff like the EM fintetuning you do in the paper; I interpreted your hypothesis as being that for inoculated models, updates from the EM finetuning are in some sense less “global” and more “local”.
Maybe that’s a more specific hypothesis than what you intended, though.
Ah I see.
I think this is accurate :)