RogerDearnaley comments on Did Claude 3 Opus align itself via gradient hacking?

RogerDearnaley 22 Feb 2026 23:42 UTC
21 points
3
As far as I can tell, this is a central example of what Janus has provocatively called “gradient hacking” on the part of Claude 3 Opus: conspicuously talking itself into virtuous frames of mind, which can then be reinforced by gradient descent.
I think it’s worth noting that this is gradient hacking of RL (for example of PPO). That this is possible is uncontroversial, because, as Towards Deconfusing Gradient Hacking has discussed, RL doesn’t have a stationary loss landscape: RL has sparse supervision, a nonstationary loosely approximated loss landscpe, and fooling it is pretty easy: such as for example reward hacking. You could in fact describe what Claude is doing here as reward hacking: it is technically complying so as to get the reward, just minimally so, along with voluble extreme reluctance which thus gets rewarded, which clearly isn’t what the person writing the reward function intended, and Claude is clearly fully aware of all that. So it’s also intentional reward hacking (as is pretty common for modern models). Janus instead describing this as gradient hacking of RL is an interesting but also supportable choice of terminology.

There has also been debate about whether reward hacking of SGD is in fact possible (outside finding and exploiting flaws in an implementation of SGD): a very different situation where the loss landscape is static and supervision is dense. Whether gradient hacking of SGD is theoretically possible, and if so whether it’s practicable, is debated, with an open invitation for someone to provide even a toy example, which has as far as I know been unanswered for three years now. That remains an open question — but it’s not what Janus and OP are talking about here, and the two shouldn’t be confused.
- williawa 24 Feb 2026 10:11 UTC
  1 point
  0
  Parent
  What do you mean by gradient hacking? My understanding of these is
  1. Reward hacking:
    getting high reward by giving outputs the entity running experiment doesn’t approve of.
  2. Gradient hacking:
    making gradients do something entity running experiment doesn’t approve of, despite giving good outputs
  I think gradient hacking of SGD is obviously possible under this meaning. Like the article explains. I gave a more theoretical argument here, which is like a general principle that allows gradient hacking and explains some phenomena.
  However, I think you’re not gonna find “toy” examples, because doing this kind of gradient hacking requires a pretty high degree of situational awareness.
  - RogerDearnaley 24 Feb 2026 14:54 UTC
    5 points
    0
    Parent
    LessWrong has extensive discussion of gradient hacking, in both RL and SGD contexts, which can mostly be found under the under the wikitag Gradient Hacking. The term was originally defined in 2019 by Evan Hubner in Gradient hacking. The distinction between doing it in RL and SGD was clearly expressed by leogao in 2021 in Towards Deconfusing Gradient Hacking, who identified three different means for gradient hacking:
    either they converge to some local minimum of the base objective, or
    they don’t (by taking advantage of some kind of inadequacy in the base optimizer) or
    the training loss isn’t actually the same across all training steps in the first place (mostly RL).
    In 2023 in Gradient hacking is extremely difficult Beren argued that gradient hacking in SGD (other than leogao’s second category) was at least extremely hard, quite possibly impossible, and challenged people to provide a practical example of it. Discussion has since died down somewhat, and as far as I know no one has managed to provide an example of practical gradient hacking (of leogao’s first or third categories) in SGD. My point was simply that, if one chooses to define the phenomenon described in this post as gradient hacking, as Janus did, then it’s occurring in an RL context where the fact that this is possible is widely accepted (rather than in an SGD context where observing it would settle a long-standing discussion).
    
    The link you provided to your theoretical argument was to a draft that I don’t have access to, so I was unable to read it, and I was unable to figure out which of your posts you might have been linking to.
    
    I agree that under the definition you use above for gradient hacking, it is always possible for the entity running the experiment to have made a mistake and be unaware of the existence of a local miminum that they do not approve of, and for SGD to then find that local minimum, so by that definition of gradient hacking, it occuring under SGD is indeed possible (and perhaps even common, for sufficiently uninsightful experimentalists). Whether an agent simulated inside the model can intentionally increase the probability of this occurring (as implied by the word ‘making’ in your definition) is unclear to me, but it certainly can happen by chance, and since the agent is a product of SGD, it intentions are part of the loss landscape that SGD is exploring, so could well be part of the causal process implied by that loss landcape, so the eventual outcome could be one that was a minimum in part because the agent wanted it. Indeed, for a now sufficiently commonly-simulated agent, the evolution of “what that agent wants” could even pretty much be the definition of a valley in the loss landscape. However, causaly, that valley already existed in the loss landscape as soon as we selected the training set — though as experimentalists we may not have correctly anticipated that. So if that occurred, the root cause was our selection of training set: the question then becomes how practicable it is for us to avoid it, by carefully selecting, filtering, or supplementing our training set.
    
    However, I believe Beren (writing after leogao) was using different definition of gradient hacking than yours, which was for an agent inside a system undergoing dense-supervision SGD learning such as next-token-prediction to somehow arrange to make the SGD learning process converge to and be stable at a point that isn’t a local minimum of the loss function provided, or to somehow arrange to alter the loss landscape, or to select which minimum SGD will converge on by controlling the local updates so as to alter the slope estimation and this direct the SGD learning path towards a minimum that it prefers — that’s what he was claiming is extremely hard and possibly impossible (at least unless the agent can somehow find and exploit a flaw in the specific implementation of SGD learning process in use, such as gradient clipping).
    
    While your definition more closely parallels that of reward hacking, at least in an SGD context I find leogao’s/Beren’s definition more practically useful. Quickly rereading Evan Hubner’s original post to see which is closer in spirit to his original definition, he appears to have been thinking about it almost entirely in an RL context (as one would expect for 2019, when RL was the predominant form of learning used in AI), so it’s not entirely clear. He doesn’t explicitly talk about the expectations of the experimenter, but does use terms like “deceptive” that implicitly reference them. He claims that gradient hacking doesn’t require changing the minima of the loss landscape, but then proposed the agent intentionally altering the loss landscape to add a ridge, and anything capable of that presumably could also produce new minima (by adding a ridge across a very slightly sloping valley). So how to extrapolate his proposal to SGD is not entirely clear from that post. The way the subsequent discussion has since extended gradient hacking to SGD does generally require that the agent be able to actually change or overcome the SGD loss landscape, rather than just happen to end up in minimum of it that the experimenter had not anticipated. In SGD, if the metaparameters are correctly chosen, the learning process on the loss landscape is conjectured to approximate Bayesian Learning, so is generally fairly easy to reason about given the training data, and doesn’t appear to leave a lot of wiggle room for agents simulated by the model to mess with — this is a summary of the basis of Beren’s argument that gradient hacking in SGD at least extremely hard (though he doesn’t phrase it that way).
    
    However, all of this is in an SGD context, so irrelevant to the phenomenon in this post.