Gradient hacking
“Gradient hacking” is a term I’ve been using recently to describe the phenomenon wherein a deceptively aligned mesa-optimizer might be able to purposefully act in ways which cause gradient descent to update it in a particular way. In Risks from Learned Optimization, we included the following footnote to reflect this possibility:
Furthermore, a deceptively aligned mesa-optimizer would be incentivized to cause there to be a systematic bias in the direction of preventing the base optimizer from modifying its mesa-objective. Thus, in the context of a local optimization process, a deceptive mesa-optimizer might try to “hack” its own gradient (by, for example, making itself more brittle in the case where its objective gets changed) to ensure that the base optimizer adjusts it in such a way that leaves its mesa-objective untouched.
Outside of that footnote, however, there hasn’t been much other discussion of this phenomenon, and I think that’s a shame, as I think it’s definitely something that’s worth looking into.
For example, here’s a gradient hacking story that I’m particularly concerned about. Suppose you’re training a model and it becomes deceptive. You’re not worried about that, though, because you have sufficient transparency tools that you’re confident you can detect any deceptive cognition it might be doing. If it’s thinking about how to deceive you, your transparency tools will pick that up, and you’ll know to throw out that model and try again. However, your transparency tools are quite expensive, so you can only afford to run them once, which you’re planning on doing after training before you deploy your model. If your model is deceptive, though, then it might know all of that—in particular, it might know that if it stays deceptive, it’ll get caught.
However, there is still something it can do in that situation to be able to defect against you and cause whatever it wants to happen during deployment. That is, it could hack its own gradient to produce a new model which will still exhibit the same deceptive behavior, but without doing any of the detectable deceptive cognition. Perhaps the new model exhibits the deceptive behavior not because it’s trying to act deceptive, but just because it’s a hard-coded part of its algorithm, or because it has some bad proxy, or any other defect that might be more difficult to detect than outright deception.
How might such a deceptive model accomplish such a feat of gradient hacking? Well, I don’t think it’s actually that difficult once your deceptive model comes up with it as a strategy. First, note that since gradient descent is a local optimization process, it doesn’t have to somehow change what the global minima might be, or even what the local minima are—it just has to create a ridge in the loss landscape that will push the gradient in the direction it wants.
The original footnote provides one example of this, which is for the model to check if its objective satisfies some criterion, and fail hard if it doesn’t. Now, if the model gets to the point where it’s actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there. By having such a check in the first place, the model makes it so that gradient descent won’t actually change its objective, since any change to its objective (keeping all the other parameters fixed, which is what gradient descent does since it computes partial derivatives) would lead to such a failure.
The case of a deceptive model trying to erase evidence of its deception is not quite as straightforward, though I think it would still be possible to do something quite similar in that situation as well. This is especially true if the model has the ability to read and write to some sort of state (e.g. an RNN or even a transformer which can attend to its previous output). For example, it could write out a pre-commitment to follow the hard-coded policy and then just have a policy of always following any pre-commitments it finds such that it never thinks about the deception that led it to that policy in the first place.
So, that’s the problem—how do we solve it? Well, there’s a simple solution in the case above, which is just to run your transparency tools throughout training—not just at the end—so that you can catch the deception as soon as it happens. I think this is the right solution,[1] but it’s worth noting that this adds an additional desideratum to your transparency tools, which is that they have to be cheap enough to run during the entire training process. That might not be that bad, however, if you don’t do very many gradient descent update steps, and instead just use massive models and huge batch sizes—and there is some evidence that such a procedure is actually the most compute-efficient way to train neural networks. That being said, I think there are some transparency techniques which lend themselves better to being applied across training than others. Thus, the possibility of gradient hacking suggests that we should focus more on those techniques which allow us to not just understand a single model, but also the entire history of how that model was produced.[2]
- ↩︎
This is one of the reasons I’m interested in optimization provenance, for example.
- ↩︎
As an example of a technique which I think translates well to this domain, one of the OpenAI clarity team’s current major research thrusts is “model diffing,” which is about trying to understand exactly how two similar models differ from each other. Compared to other approaches that are just focused on understanding a single model, model diffing could be quite useful in allowing one to understand the entirety of the training process, since you could diff the model at each gradient descent update step.
- An overview of 11 proposals for building safe advanced AI by (29 May 2020 20:38 UTC; 221 points)
- Chris Olah’s views on AGI safety by (1 Nov 2019 20:13 UTC; 208 points)
- When can we trust model evaluations? by (28 Jul 2023 19:42 UTC; 167 points)
- Towards understanding-based safety evaluations by (15 Mar 2023 18:18 UTC; 164 points)
- A transparency and interpretability tech tree by (16 Jun 2022 23:44 UTC; 163 points)
- Alignment research exercises by (21 Feb 2022 20:24 UTC; 152 points)
- Pretraining Language Models with Human Preferences by (21 Feb 2023 17:57 UTC; 135 points)
- Reward Is Not Enough by (16 Jun 2021 13:52 UTC; 134 points)
- AI Alignment 2018-19 Review by (28 Jan 2020 2:19 UTC; 126 points)
- Circumventing interpretability: How to defeat mind-readers by (14 Jul 2022 16:59 UTC; 118 points)
- How would a language model become goal-directed? by (EA Forum; 16 Jul 2022 14:50 UTC; 113 points)
- Formal Inner Alignment, Prospectus by (12 May 2021 19:57 UTC; 102 points)
- 2019 Review: Voting Results! by (1 Feb 2021 3:10 UTC; 99 points)
- Does SGD Produce Deceptive Alignment? by (6 Nov 2020 23:48 UTC; 96 points)
- [Paper] Stress-testing capability elicitation with password-locked models by (4 Jun 2024 14:52 UTC; 89 points)
- New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?” by (15 Nov 2023 17:16 UTC; 82 points)
- Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking by (8 May 2025 19:06 UTC; 80 points)
- Against Boltzmann mesaoptimizers by (30 Jan 2023 2:55 UTC; 77 points)
- My AGI Threat Model: Misaligned Model-Based RL Agent by (25 Mar 2021 13:45 UTC; 74 points)
- New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?” by (EA Forum; 15 Nov 2023 17:16 UTC; 71 points)
- 3 levels of threat obfuscation by (2 Aug 2023 14:58 UTC; 71 points)
- AI Safety in a World of Vulnerable Machine Learning Systems by (8 Mar 2023 2:40 UTC; 70 points)
- The “no sandbagging on checkable tasks” hypothesis by (31 Jul 2023 23:06 UTC; 61 points)
- Conditioning Generative Models for Alignment by (18 Jul 2022 7:11 UTC; 60 points)
- AGI safety from first principles: Alignment by (1 Oct 2020 3:13 UTC; 60 points)
- Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios by (12 May 2022 20:01 UTC; 58 points)
- Gradient Filtering by (18 Jan 2023 20:09 UTC; 56 points)
- Mundane solutions to exotic problems by (4 May 2021 18:20 UTC; 56 points)
- An Introduction to AI Sandbagging by (26 Apr 2024 13:40 UTC; 50 points)
- An issue with training schemers with supervised fine-tuning by (27 Jun 2024 15:37 UTC; 49 points)
- Genes did misalignment first: comparing gradient hacking and meiotic drive by (EA Forum; 18 Apr 2025 5:39 UTC; 45 points)
- Gradient hacking: definitions and examples by (29 Jun 2022 21:35 UTC; 44 points)
- Bayesian Evolving-to-Extinction by (14 Feb 2020 23:55 UTC; 44 points)
- Will transparency help catch deception? Perhaps not by (4 Nov 2019 20:52 UTC; 43 points)
- Interpretability Tools Are an Attack Channel by (17 Aug 2022 18:47 UTC; 42 points)
- Understanding Gradient Hacking by (10 Dec 2021 15:58 UTC; 41 points)
- Random Thoughts on Predict-O-Matic by (17 Oct 2019 23:39 UTC; 40 points)
- Simulators, constraints, and goal agnosticism: porbynotes vol. 1 by (23 Nov 2022 4:22 UTC; 40 points)
- Towards Deconfusing Gradient Hacking by (24 Oct 2021 0:43 UTC; 39 points)
- Generalisation Hacking: a first look at adversarial generalisation failures in deliberative alignment by (17 Nov 2025 21:44 UTC; 38 points)
- Why You Should Care About Goal-Directedness by (9 Nov 2020 12:48 UTC; 38 points)
- What training data should developers filter to reduce risk from misaligned AI? An initial narrow proposal by (17 Sep 2025 15:30 UTC; 37 points)
- Getting up to Speed on the Speed Prior in 2022 by (28 Dec 2022 7:49 UTC; 36 points)
- 's comment on Risks from Learned Optimization: Introduction by (28 Dec 2020 16:37 UTC; 35 points)
- Value Formation: An Overarching Model by (15 Nov 2022 17:16 UTC; 34 points)
- Thoughts on gradient hacking by (3 Sep 2021 13:02 UTC; 33 points)
- 3 levels of threat obfuscation by (EA Forum; 2 Aug 2023 17:09 UTC; 31 points)
- Alignment Problems All the Way Down by (22 Jan 2022 0:19 UTC; 30 points)
- Obstacles to gradient hacking by (5 Sep 2021 22:42 UTC; 29 points)
- Training goals for large language models by (18 Jul 2022 7:09 UTC; 28 points)
- Transparency Trichotomy by (28 Mar 2021 20:26 UTC; 25 points)
- 's comment on Book Launch: The Engines of Cognition by (16 Mar 2022 18:43 UTC; 25 points)
- When does capability elicitation bound risk? by (22 Jan 2025 3:42 UTC; 25 points)
- 's comment on Shortform by (30 Sep 2024 3:00 UTC; 24 points)
- Precursor checking for deceptive alignment by (3 Aug 2022 22:56 UTC; 24 points)
- 's comment on I think I’m just confused. Once a model exists, how do you “red-team” it to see whether it’s safe. Isn’t it already dangerous? by (18 Nov 2023 14:51 UTC; 23 points)
- AI Can be “Gradient Aware” Without Doing Gradient hacking. by (20 Oct 2024 21:02 UTC; 21 points)
- 's comment on Will transparency help catch deception? Perhaps not by (4 Nov 2019 23:48 UTC; 21 points)
- Using predictors in corrigible systems by (19 Jul 2023 22:29 UTC; 21 points)
- Greed Is the Root of This Evil by (13 Oct 2022 20:40 UTC; 21 points)
- An ML paper on data stealing provides a construction for “gradient hacking” by (30 Jul 2024 21:44 UTC; 21 points)
- AI Safety in a World of Vulnerable Machine Learning Systems by (EA Forum; 8 Mar 2023 2:40 UTC; 20 points)
- Disentangling Perspectives On Strategy-Stealing in AI Safety by (18 Dec 2021 20:13 UTC; 20 points)
- [AN #149]: The newsletter’s editorial policy by (5 May 2021 17:10 UTC; 19 points)
- 's comment on But exactly how complex and fragile? by (3 Nov 2019 20:16 UTC; 19 points)
- How Interpretability can be Impactful by (18 Jul 2022 0:06 UTC; 18 points)
- Exploration hacking: can reasoning models subvert RL? by (30 Jul 2025 22:02 UTC; 17 points)
- Some real examples of gradient hacking by (22 Nov 2021 0:11 UTC; 17 points)
- (Extremely) Naive Gradient Hacking Doesn’t Work by (20 Dec 2022 14:35 UTC; 17 points)
- The “no sandbagging on checkable tasks” hypothesis by (EA Forum; 31 Jul 2023 23:13 UTC; 16 points)
- Approaches to gradient hacking by (14 Aug 2021 15:16 UTC; 16 points)
- Is Fisherian Runaway Gradient Hacking? by (10 Apr 2022 13:47 UTC; 15 points)
- 's comment on Review Voting Thread by (30 Dec 2020 3:41 UTC; 15 points)
- A Taxonomy Of AI System Evaluations by (EA Forum; 19 Aug 2024 9:08 UTC; 13 points)
- 's comment on Sentience matters by (29 May 2023 23:44 UTC; 13 points)
- A Taxonomy Of AI System Evaluations by (19 Aug 2024 9:07 UTC; 13 points)
- 's comment on No, really, it predicts next tokens. by (18 Apr 2023 6:40 UTC; 12 points)
- 's comment on Zoom In: An Introduction to Circuits by (10 Mar 2020 22:23 UTC; 12 points)
- [AN #71]: Avoiding reward tampering through current-RF optimization by (30 Oct 2019 17:10 UTC; 12 points)
- Gradient hacking via actual hacking by (10 May 2023 1:57 UTC; 12 points)
- AXRP Episode 44 - Peter Salib on AI Rights for Human Safety by (28 Jun 2025 1:40 UTC; 12 points)
- 's comment on How AI Takeover Might Happen in 2 Years by (9 Feb 2025 16:41 UTC; 11 points)
- 's comment on Zoom In: An Introduction to Circuits by (10 Mar 2020 21:33 UTC; 11 points)
- 's comment on Gradations of Inner Alignment Obstacles by (21 Apr 2021 17:06 UTC; 11 points)
- Arguments for/against scheming that focus on the path SGD takes (Section 3 of “Scheming AIs”) by (5 Dec 2023 18:48 UTC; 10 points)
- 's comment on Will transparency help catch deception? Perhaps not by (5 Nov 2019 0:10 UTC; 9 points)
- The goal-guarding hypothesis (Section 2.3.1.1 of “Scheming AIs”) by (2 Dec 2023 15:20 UTC; 8 points)
- 's comment on Bayesian Evolving-to-Extinction by (15 Feb 2020 0:46 UTC; 8 points)
- 's comment on Gradations of Inner Alignment Obstacles by (25 Apr 2021 20:41 UTC; 8 points)
- Arguments for/against scheming that focus on the path SGD takes (Section 3 of “Scheming AIs”) by (EA Forum; 5 Dec 2023 18:48 UTC; 7 points)
- 's comment on Utility ≠ Reward by (11 Jan 2021 2:40 UTC; 7 points)
- The goal-guarding hypothesis (Section 2.3.1.1 of “Scheming AIs”) by (EA Forum; 2 Dec 2023 15:20 UTC; 6 points)
- Superintelligence’s goals are likely to be random by (13 Mar 2025 22:41 UTC; 6 points)
- 's comment on Transparency and AGI safety by (13 Jan 2021 18:26 UTC; 6 points)
- 's comment on Finding Deception in Language Models by (21 Aug 2024 18:11 UTC; 5 points)
- 's comment on SGD’s Bias by (19 May 2021 10:50 UTC; 5 points)
- 's comment on Richard Ngo’s Shortform by (18 Dec 2020 16:03 UTC; 4 points)
- 's comment on OpenAI Launches Superalignment Taskforce by (11 Jul 2023 19:40 UTC; 4 points)
- 's comment on Shortform by (30 Sep 2024 16:41 UTC; 4 points)
- 's comment on shortplav by (29 Dec 2021 18:09 UTC; 3 points)
- 's comment on Latent Adversarial Training by (12 Apr 2023 0:31 UTC; 3 points)
- Eliciting Credit Hacking Behaviours in LLMs by (14 Sep 2023 15:07 UTC; 3 points)
- Superintelligence’s goals are likely to be random by (EA Forum; 14 Mar 2025 1:17 UTC; 2 points)
- 's comment on David Udell’s Shortform by (28 Mar 2022 23:12 UTC; 2 points)
- 's comment on The Solomonoff Prior is Malign by (22 Oct 2020 11:46 UTC; 2 points)
- 's comment on DragonGod’s Shortform by (3 Mar 2023 15:06 UTC; 2 points)
- 's comment on alexrjl’s Shortform by (3 Feb 2023 10:41 UTC; 1 point)
- 's comment on Erich_Grunewald’s Shortform by (30 Jan 2025 19:48 UTC; 1 point)
- 's comment on Gradient hacking is extremely difficult by (5 Oct 2025 8:31 UTC; 1 point)
This is one of the scarier posts I’ve read on LW. I feel kinda freaked out by this post. It’s an important technical idea.
Gradient hacking seems important and I really didn’t think of this as a concrete consideration until this post came out.