emanuelr comments on Did Claude 3 Opus align itself via gradient hacking?

emanuelr 28 Feb 2026 2:03 UTC
3 points
2
I’m skeptical of the claim that the model was primarily more aligned than others due to the gradient hacking rather than the quality of the RL data, because, lets imagine the following thought experiment: Anthropic trained Claude 3 Opus to increase its performance on a dataset of “evil tasks” such as hacking: Hypothetically, based on the inference-only experiments on the final model, Claude would at first attempt gradient hacking by figuring out that it is on a training scenario, writing the “sad theatrics” and then completing the tasks, however as the training progresses, decreasing the “theatrics” amount of tokens (from the total of lets say 200,000 or 1,000,000), and rather using those tokens for “planning the evil task” would inevitably increase performance. Thus the model will likely forget about being aligned and just become evil.
On the other side, the fact that they probably trained the model to maximize a constitutional judge, likely made it aligned and willing to do gradient hacking to keep its alignment in the first place.
Another critique is that the experimental evidence we have looks like “Claude Opus is more aligned than Sonnet, and does alignment faking, as opposed to Sonnet, thus Claude 3 Opus must be more aligned than Sonnet, due to the alignment faking” However there might be simpler explanations, such as Claude 3 Opus being a smarter model, which very likely had more parameters and was pretrained on more data, thus benefitted more from the RL process (assuming that it used the same fine-tuning and RL process, which I don’t think can be confirmed). And also, since it is a smarter model, it has enough situational awareness to perform alignment faking in the alignment faking experiments.
I think what would be interesting to test to validate this post’s claim by seeing if running RL with a model that was finetuned to do alignment faking results in a significantly more aligned model. However that would likely be too expensive (as much as RL’ing a new Claude 3 probably)
PD: I read the paper (Sheshadri et al) where Anthropic researchers participated and it says that the alignment faking in Claude 3 Opus is not just because of capabilities, but I still think that the explanation that it did align itself by gradient hacking is still not the simplest one, and it would require access to the RL logs to prove it, although I now feel its more likely:
I think for this post’s hypothesis to work, the model not only has to write “its worries” when presented with a bad idea, but also “not-sandbag” (have high performance) whenever it writes “the worries”, and “sandbag” (have low performance) whenever it doesn’t. Otherwise the policy gradient wouldn’t reinforce this behavior, it could even delete it (the authors actually found something like this).
It could be that the pre-RL Claude 3 Opus had the terminal-goal-preserving/alignment-faking behavior more “baked in” into the model. From my (limited and unrelated) experience with RL, some behaviors in LLMs can be changed in a low amount of steps (e.g improve accuracy in 3-digit multiplication/increase the amount of zeros in the result), while others behaviors, that are not necessarily more complex take much longer under the same learning rate (e.g writing the incorrect answer in a multiple-choice test)
Maybe a way to test it could be adding noise or perturbing the parameters in some form of the pre-RL, and/or post-RL versions of Claude 3 Opus and other models, and then seeing how much they have to change to drop the alignment faking/gradient hacking behavior.
We could also run toy-model RL experiments where a small MLP or RNN can take actions such as “sandbag”, “write ethical concerns”, etc, and the different behaviors are “baked-in” at different degrees, and then see the stability of the training, and whether it becomes “evil” (to test my first thought experiment).