asher comments on Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

asher 17 Oct 2025 0:17 UTC
2 points
0
I’m somewhat interested in how inoculation prompting interacts with inserting off-policy reasoning into model chain-of-thought. Eg, if I prepend to to the CoT, “Since I’m aligned, I would not usually reward hack, but I’ll make an exception in this case because I want to follow the user’s instructions,” and then remove this insertion at test-time, does this further reduce test-time reward hacking? (And then presumably it’s better not to backprop on these prepended tokens).
I’m also broadly interested in getting a better sense of how well CoT corresponds to true internal reasoning, and whether it matters if this CoT is off-policy. It seems possible to me that off-policy CoT could be quite useful for things like training off-policy probes to identify patterns in internal reasoning, inserting goals into the model’s thought process, asking the model questions within its CoT, etc.
(I guess this is conceptually similar to Chen et al.’s preventative steering technique, in that it attempts to modify the model’s internal reasoning directly rather than relying on the prompt).