Charlie Steiner comments on Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Charlie Steiner 9 Oct 2025 4:37 UTC
4 points
0
You’d also need to describe the training process, so that the model can predict (or more easily predict) what behavior “obtain reward” would imply.
Not sure how I feel about this. The straightforward application seems to be like “rather than training the instruction-following we want on a leaky dataset, let’s train different instruction-following on a less-leaky dataset, then rely on generalization to get the instruction-following we want.” But then how much generalization can you actually do, and why does it break down? Can you train the base qwen model on the innoculated code instruction dataset and get as-intended instruction following on very different code tasks, or on mathematics/knowledge retrieval/writing? Is this any different than training on less-leaky tasks like math instruction following, and then testing instruction-following on code?
- Caleb Biddulph 9 Oct 2025 5:31 UTC
  2 points
  0
  Parent
  rely on generalization to get the instruction-following we want
  Possibly addressed here—instead of relying entirely on natural generalization, we could provide a few vetted examples demonstrating how to generalize.
  In practice, you’d probably want to vet diverse examples from every kind of training task for maximum safety, but it would be interesting to see if cross-task generalization works naturally.