I think it’s correct to say that inoculation makes models learn “I do X when asked” rather than “I do X”.
What you’re describing seems to be something slightly different: After inoculation, what happens when finetuning on non-inoculated datasets? Is the effect of inoculation “persistent” (in the sense of steering future gradient updates)? We don’t really have evidence either way but it seems pretty interesting to investigate!
When I say future updates I’m referring to stuff like the EM fintetuning you do in the paper; I interpreted your hypothesis as being that for inoculated models, updates from the EM finetuning are in some sense less “global” and more “local”.
Maybe that’s a more specific hypothesis than what you intended, though.
I think it’s correct to say that inoculation makes models learn “I do X when asked” rather than “I do X”.
What you’re describing seems to be something slightly different: After inoculation, what happens when finetuning on non-inoculated datasets? Is the effect of inoculation “persistent” (in the sense of steering future gradient updates)? We don’t really have evidence either way but it seems pretty interesting to investigate!
Thanks for reply!
When I say future updates I’m referring to stuff like the EM fintetuning you do in the paper; I interpreted your hypothesis as being that for inoculated models, updates from the EM finetuning are in some sense less “global” and more “local”.
Maybe that’s a more specific hypothesis than what you intended, though.
Ah I see.
I think this is accurate :)