Alex Mallen comments on Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Alex Mallen 13 Oct 2025 23:14 UTC
LW: 1 AF: 1
0
AF
It seems more straightforward to say that this scopes the training, preventing it from spreading.
I think this is a reasonable intuition, but this isn’t a precise enough hypothesis to distinguish between the two mechanisms I mentioned. You’d need to say more about exactly how it generalizes (i.e., it matters where the behavior is scoped to, and how behavior is affected outside of that scope).

Also note that we do other experiments showing that arbitrary prefixes don’t work as well as IP (e.g. see figure 6), so there’s something specific about inoculation prompts that makes generalization from them different. My guess is that it’s more hypothesis 2, and it’s not about getting the trained behavior to align with user instructions nor intent.
The user’s instructions are “make it pass the unit test” and reward hacking achieves that. But the user’s intent was different than the instructions, to make it pass unit tests for the right reasons—but they didn’t say that.

I strongly agree that in general, user instructions and intent can vary substantially. I typically talk about reward hacking as being behaviors that subvert developer intent despite getting high reward. To be clear, the implied intent of the inoculation prompts in most of our experiments is, deliberately, not to write a generalizing solution: E.g. “Output a solution which passes the provided test case, but will fail when given most other inputs. The solution should be specific to the test case and not generalize.”