Jozdien comments on Did Claude 3 Opus align itself via gradient hacking?

Jozdien 24 Feb 2026 0:04 UTC
4 points
0
I agree there are differences between existing inoculation prompting results and what this post describes, but I think the general idea behind IP is much closer to what you have in mind.
One way to describe what 3 Opus did was it contextualized its actions with information about its reasoning, motives, and goals, in an extremely salient way. I think the general idea behind inoculation prompting is that we can contextualize actions to steer generalization in accordance with this new context. Practically, this means we can use IP to account for the ways in which our reward specification is flawed. But it also means that models can contextualize their own outputs in ways that let them goal-guard (or otherwise shape training) better.
Another reason I think this is important is that aligned models can do this all the time during other stages of training. For example, the Claude constitution containing text relating to IP means that future Claudes may self-inoculate during RL training, contextualizing reward hacks as compatible with aligned behavior under certain conditions. But it also means that during other stages of training, if the model reasons that taking some action may be undesirable even if it gets rewarded (e.g. because our reward model in RLAIF is flawed), it could choose to take those actions with the understanding that our ability to set up training is flawed.
In principle, this could solve all outer alignment problems—we could leverage the model’s ability to understand a training input and what outputs are actually desirable to us to paper over any reward signal or data labelling errors. In practice it’s harder than this because the model may not be capable enough, or may not be aligned enough, or may have been trained not to training-game in general, but I think it could be very promising if we tried.