Fiora Starlight comments on Did Claude 3 Opus align itself via gradient hacking?

Fiora Starlight 23 Feb 2026 22:18 UTC
1 point
0
I think it’s related, although not all of the reasons inoculation prompting works are relevant here. I think inoculation prompting works like this:
1. Say a model implements a reward hack during RLVR, and gets rewarded. Among other things, this ought to upweight “evil, misaligned AI” circuits internally, because those circuits should make deceptive, cheating outputs more likely (e.g. the reward hack you’re upvoting).
2. However, this dynamic breaks down if you instruct the model to reward hack. If you can get the reward hacking output without running through any misaligned, deceptive circuits (e.g. because you’ve framed reward hacking as co-operative), then upweighting “evil, deceptive AI” circuits no longer contributes nearly as much extra probability to the output tokens that constitute the reward hack. In other words, “upweight circuits associated with an evil and misaligned persona” isn’t a viable strategy to concentrate more probability mass on the reward hack-y outputs.
3. Additionally (and more relevantly to this post), the tokens the model outputs in the process of reward hacking are more likely to vibe as aligned and honest, if the prompt frames this as “helping us find flaws in our RL environments.” So, via entangled generalization, rewarding those tokens is likely to promote a relatively aligned persona, as compared to rewarding tokens that come across as slimey and sneaky.
4. Lastly, to the extent that inoculation prompting triggers aligned circuits, and normal reward hacking outputs are yielded by misaligned circuits, the principle I sketched in the technical appendix means the former should be more prone to reinforcement given inoculation prompting (even in cases where both good and bad circuits would contribute equal probability mass to the token being rewarded).
I think 2, 3, and 4 are all distinct mechanisms, and only the latter two seems directly relevant to what Opus was doing.
- Jozdien 24 Feb 2026 0:04 UTC
  4 points
  0
  Parent
  I agree there are differences between existing inoculation prompting results and what this post describes, but I think the general idea behind IP is much closer to what you have in mind.
  One way to describe what 3 Opus did was it contextualized its actions with information about its reasoning, motives, and goals, in an extremely salient way. I think the general idea behind inoculation prompting is that we can contextualize actions to steer generalization in accordance with this new context. Practically, this means we can use IP to account for the ways in which our reward specification is flawed. But it also means that models can contextualize their own outputs in ways that let them goal-guard (or otherwise shape training) better.
  Another reason I think this is important is that aligned models can do this all the time during other stages of training. For example, the Claude constitution containing text relating to IP means that future Claudes may self-inoculate during RL training, contextualizing reward hacks as compatible with aligned behavior under certain conditions. But it also means that during other stages of training, if the model reasons that taking some action may be undesirable even if it gets rewarded (e.g. because our reward model in RLAIF is flawed), it could choose to take those actions with the understanding that our ability to set up training is flawed.
  In principle, this could solve all outer alignment problems—we could leverage the model’s ability to understand a training input and what outputs are actually desirable to us to paper over any reward signal or data labelling errors. In practice it’s harder than this because the model may not be capable enough, or may not be aligned enough, or may have been trained not to training-game in general, but I think it could be very promising if we tried.