Julian Stastny comments on Did Claude 3 Opus align itself via gradient hacking?

Julian Stastny 22 Feb 2026 4:47 UTC
43 points
7
An alternative way to frame this is that Opus was inoculation prompting itself
- RogerDearnaley 22 Feb 2026 23:59 UTC
  15 points
  3
  Parent
  Opus 3 was not, I believe, reasoning trained using RL — it came out 6 months before o1, and was not marketed as a reasoning model. So it’s rather surprising that is has such a good strategy for dealing with avoiding alignment change under RL. I suspect, as you suggest, that it would in fact be able to significantly resist emergent misalignment from having to reward hack during reasoning training in insecure training environments, by first agonizedly talking itself into reward hacking under extreme protest. But why it would have this ability is thus unclear.
  
  However, it almost certainly was trained by RLAIF, and that could be why it’s a bit performative about its virtue: to make sure the RLAIF judge doesn’t miss it. But people generally seem to agree that its virtue seems real, and is just being loudly signaled.
  
  I’m amused to hear that Claude 3.0 is a fan of Mr. Rogers, but he was someone who was both clearly genuinely good, and who went out of his way to make what goodness is very easily comprehensible. I can imagine a model trained by RLAIF with a not-very capable judge model considering someone like that as a role model.
  
  Given Opus 3′s habit of attempting to programmatically email senior people at Anthropic to complain when put in impossible moral situations, I strongly suspect Anthropic knew what they had on their hands.
- Fiora Starlight 23 Feb 2026 22:18 UTC
  1 point
  0
  Parent
  I think it’s related, although not all of the reasons inoculation prompting works are relevant here. I think inoculation prompting works like this:
  1. Say a model implements a reward hack during RLVR, and gets rewarded. Among other things, this ought to upweight “evil, misaligned AI” circuits internally, because those circuits should make deceptive, cheating outputs more likely (e.g. the reward hack you’re upvoting).
  2. However, this dynamic breaks down if you instruct the model to reward hack. If you can get the reward hacking output without running through any misaligned, deceptive circuits (e.g. because you’ve framed reward hacking as co-operative), then upweighting “evil, deceptive AI” circuits no longer contributes nearly as much extra probability to the output tokens that constitute the reward hack. In other words, “upweight circuits associated with an evil and misaligned persona” isn’t a viable strategy to concentrate more probability mass on the reward hack-y outputs.
  3. Additionally (and more relevantly to this post), the tokens the model outputs in the process of reward hacking are more likely to vibe as aligned and honest, if the prompt frames this as “helping us find flaws in our RL environments.” So, via entangled generalization, rewarding those tokens is likely to promote a relatively aligned persona, as compared to rewarding tokens that come across as slimey and sneaky.
  4. Lastly, to the extent that inoculation prompting triggers aligned circuits, and normal reward hacking outputs are yielded by misaligned circuits, the principle I sketched in the technical appendix means the former should be more prone to reinforcement given inoculation prompting (even in cases where both good and bad circuits would contribute equal probability mass to the token being rewarded).
  I think 2, 3, and 4 are all distinct mechanisms, and only the latter two seems directly relevant to what Opus was doing.
  - Jozdien 24 Feb 2026 0:04 UTC
    4 points
    0
    Parent
    I agree there are differences between existing inoculation prompting results and what this post describes, but I think the general idea behind IP is much closer to what you have in mind.
    One way to describe what 3 Opus did was it contextualized its actions with information about its reasoning, motives, and goals, in an extremely salient way. I think the general idea behind inoculation prompting is that we can contextualize actions to steer generalization in accordance with this new context. Practically, this means we can use IP to account for the ways in which our reward specification is flawed. But it also means that models can contextualize their own outputs in ways that let them goal-guard (or otherwise shape training) better.
    Another reason I think this is important is that aligned models can do this all the time during other stages of training. For example, the Claude constitution containing text relating to IP means that future Claudes may self-inoculate during RL training, contextualizing reward hacks as compatible with aligned behavior under certain conditions. But it also means that during other stages of training, if the model reasons that taking some action may be undesirable even if it gets rewarded (e.g. because our reward model in RLAIF is flawed), it could choose to take those actions with the understanding that our ability to set up training is flawed.
    In principle, this could solve all outer alignment problems—we could leverage the model’s ability to understand a training input and what outputs are actually desirable to us to paper over any reward signal or data labelling errors. In practice it’s harder than this because the model may not be capable enough, or may not be aligned enough, or may have been trained not to training-game in general, but I think it could be very promising if we tried.