This post states the problem of gradient hacking. It is valuable in that this problem is far from obvious, and if plausible, very dangerous. On the other hand, the presentation doesn’t go into enough details, and so leaves gradient hacking open to attacks and confusion. Thus instead of just reviewing this post, I would like to clarify certain points, while interweaving my critics about the way gradient hacking was initially stated, and explaining why I consider this problem so important.
(Caveat: I’m not pretending that any of my objections are unknown to Evan. We’re working together on searching for experimental evidence of gradient hacking, and this review is heavily influenced by this collaboration.)
What is gradient hacking?
Evan defines it as:
the phenomenon wherein a deceptively aligned mesa-optimizer might be able to purposefully act in ways which cause gradient descent to update it in a particular way.
So we have a trained mesa-optimizer that is deceptively aligned (its mesa-objective differs from the base-objective it is trained on). This deceptive system finds a way to influence the gradient generated during training, and thus can basically ensure it never becomes inner aligned (among other things).
Before checking on how exactly this could be possible, we should think a bit more about what this implies.
If the deceptive model has arbitrary access to the gradient, then from that point on, the base-objective has only minimal influence on the training. Some influence remains because we probably want to test the end result, and thus it should work well with the base-objective. But that’s pretty much the only constraint left.
It could also pretty much deals with deception detectors because it can make itself not detectable:
That is, it could hack its own gradient to produce a new model which will still exhibit the same deceptive behavior, but without doing any of the detectable deceptive cognition. Perhaps the new model exhibits the deceptive behavior not because it’s trying to act deceptive, but just because it’s a hard-coded part of its algorithm, or because it has some bad proxy, or any other defect that might be more difficult to detect than outright deception.
To say it pithy: if gradient hacking happens, we’re fucked.
How could it happen, though? Well, that’s actually two distinct questions: how could a model gradient hack, and how could training create a model which gradient hacks. The post mostly focuses on the first one.
Thus, in the context of a local optimization process, a deceptive mesa-optimizer might try to “hack” its own gradient (by, for example, making itself more brittle in the case where its objective gets changed) to ensure that the base optimizer adjusts it in such a way that leaves its mesa-objective untouched.
This implies that the model is able to check for changes in itself and in its objective, such that changing this objective would make it work worse. An analogue version would be for the model to underperform except if it’s changed in the direction it likes, to push the gradient towards there.
Still, it might feel a bit weird to think about a model that can influence its training without somewhat hacking the computer it is in. After all, the training is external, right? One fundamental but easy to miss insight from the post is that the model can influence the loss surface, and thus the gradient (which comes from this surface)
First, note that since gradient descent is a local optimization process, it doesn’t have to somehow change what the global minima might be, or even what the local minima are—it just has to create a ridge in the loss landscape that will push the gradient in the direction it wants.
So a model could gradient hacks because its output and how it computes its output influence the local loss surface around it, which means it could in theory make the gradient go its way. I feel like this is crucial, and it’s slightly problematic that the post doesn’t put more emphasis on it.
Also note that this is hardly solving the question. Let’s say the model is such that if its mesa-objective is changed, it performs badly. Then we probably have a form of gradient hacking. But what tells us that the change at this step will not slightly undermine the very mechanism that protects the mesa-objective? Unwrapping this thought, it’s harder to conceptualize how a model could gradient hack forever. It seems to involve some sort of fixed point or self correcting mechanism.
That complexity should probably make us update slightly towards the improbability of “sustainable” gradient hacking—still not enough to dismiss it though. Understanding the requirements for such sustainable gradient hacking looks like the most important research question on this topic for me (which makes sense, I’m working on it after all…).
How could training create a model which gradient hacks?
What about this second question? Well, another issue with the post comes from the lack of clarification around the second question: how could training create a model which gradient hacks? I’m obviously not asking for a full-fledged answer, as this is still research in progress. But there are easy to make mistakes that I believe Evan could have debunked in the post.
One such mistake is to get confused about how the model could “decide” to gradient hack. After all, if the computation is deterministic, it’s not like the model has a decision to make, right? This question can be dissolved in a similar way to free will (since it’s basically about free will of neural networks): (spoiler for the solution to free will)
the model is completely determined by its parameters, but its parameters are part of itself.
The less philosophical approach to this problem is to notice that the appearance of gradient hacking would probably come from the training stumbling on a gradient hacker. There is no need to worry about how the model could turn itself into a gradient hacker; the issue is about how training could bump into a gradient hacker.
Another mistake would be to argue that what is called gradient hacking here is indistinguishable in terms of training behavior from some model that doesn’t gradient hack but happens to have this gradient. This is missing a big part of Evan’s viewpoint: that looking inside the model is fundamental for alignment. Here the question is whether we can catch the model doing some form of recognizable gradient hacking, because that allows us to infer more about what it might do (and that it will not be inner aligned). Whereas if we can be relatively sure that a model is not gradient hacking, there are less things to worry about, at least in terms of deceptive alignment.
This is almost obvious for people that talked a lot with Evan on these issues, or who are intimate with his work; but I feel that it’s only implicit in this post, which might brew confusion. A new but very observant reader could infer it from the solution (discussed below) using transparency tools, sure. I still think the post would be stronger if that possible misinterpretation was debunked explicitly.
As for further research, investigating the appearance of gradient hacking through common ML training procedure is crucial. But it probably requires a better understanding of the mechanisms we’re looking for, and thus a convincing answer (at least informal) to the first question.
Possible solution
Evan proposes a solution for gradient hacking in one case:
So, that’s the problem—how do we solve it? Well, there’s a simple solution in the case above, which is just to run your transparency tools throughout training—not just at the end—so that you can catch the deception as soon as it happens.
This is hardly revolutionary, and doesn’t say how to actually catch deception (although most of Evan research is focused on that). Yet I don’t think much more can be done without understanding in more detail the mechanisms of gradient hacking, and how they appear through training. That is, without solving some significant part of the two previous questions.
Wrapping-up
All in all, I think this post is very important, because the alignment issue it presents is novel, counterintuitive, and hard to dismiss without more research, once one thinks about it for long enough. That being said, this post is definitely a short presentation of an idea, written quickly to draw attention to this issue. Gradient hacking warrants a longer post that digs deeper into the details of the problem,
The less philosophical approach to this problem is to notice that the appearance of gradient hacking would probably come from the training stumbling on a gradient hacker.
[EDIT: you may have already meant it this way, but...] The optimization algorithm (e.g. SGD) doesn’t need to stumble upon the specific logic of gradient hacking (which seems very unlikely). I think the idea is that a sufficiently capable agent (with a goal system that involves our world) instrumentally decides to use gradient hacking, because otherwise the agent will be modified in a suboptimal manner with respect to its current goal system.
Actually, I did meant that SGD might stumble upon gradient hacking. Or to be a bit more realistic, make the model slightly deceptive, at which point decreasing a bit the deceptiveness makes the model worse but increasing it a bit makes the model better at the base-objective, and so there is a push towards deceptiveness, until the model is basically deceptive enough to use gradient hacking in the way you mention.
I think that if SGD makes the model slightly deceptive it’s because it made the model slightly more capable (better at general problem solving etc.), which allowed the model to “figure out” (during inference) that acting in a certain deceptive way is beneficial with respect to the mesa-objective.
This seems to me a lot more likely than SGD creating specifically “deceptive logic” (i.e. logic that can’t do anything generally useful other than finding ways to perform better on the mesa-objective by being deceptive).
I agree with your intuition, but I want to point out again that after some initial useless amount, “deceptive logic” is probably a pretty useful thing in general for the model, because it helps improve performance as measured through the base-objective.
“deceptive logic” is probably a pretty useful thing in general for the model, because it helps improve performance as measured through the base-objective.
But you can similarly say this for the following logic: “check whether 1+1<4 and if so, act according to the base objective”. Why is SGD more likely to create “deceptive logic” than this simpler logic (or any other similar logic)?
[EDIT: actually, this argument doesn’t work in a setup where the base objective corresponds to a sufficiently long time horizon during which it is possible for humans to detect misalignment and terminate/modify the model (in a way the is harmful with respect to the base objective).]
So my understanding is that deceptive behavior is a lot more likely to arise from general-problem-solving logic, rather than SGD directly creating “deceptive logic” specifically.
Hum, I would say that your logic is probably redundant, and thus might end up being removed for simplicity reasons? Whereas I expect deceptive logic to includes very useful things like knowing how the optimization process works, which would definitely help having better performance.
But to be honest, how can SGD create gradient hacking (if it’s even possible) is completely an open research problem.
My point was that there’s no reason that SGD will create specifically “deceptive logic” because “deceptive logic” is not privileged over any other logic that involves modeling the base objective and acting according to it. But I now think this isn’t always true—see the edit block I just added.
This post states the problem of gradient hacking. It is valuable in that this problem is far from obvious, and if plausible, very dangerous. On the other hand, the presentation doesn’t go into enough details, and so leaves gradient hacking open to attacks and confusion. Thus instead of just reviewing this post, I would like to clarify certain points, while interweaving my critics about the way gradient hacking was initially stated, and explaining why I consider this problem so important.
(Caveat: I’m not pretending that any of my objections are unknown to Evan. We’re working together on searching for experimental evidence of gradient hacking, and this review is heavily influenced by this collaboration.)
What is gradient hacking?
Evan defines it as:
So we have a trained mesa-optimizer that is deceptively aligned (its mesa-objective differs from the base-objective it is trained on). This deceptive system finds a way to influence the gradient generated during training, and thus can basically ensure it never becomes inner aligned (among other things).
Before checking on how exactly this could be possible, we should think a bit more about what this implies.
If the deceptive model has arbitrary access to the gradient, then from that point on, the base-objective has only minimal influence on the training. Some influence remains because we probably want to test the end result, and thus it should work well with the base-objective. But that’s pretty much the only constraint left.
It could also pretty much deals with deception detectors because it can make itself not detectable:
To say it pithy: if gradient hacking happens, we’re fucked.
How could it happen, though? Well, that’s actually two distinct questions: how could a model gradient hack, and how could training create a model which gradient hacks. The post mostly focuses on the first one.
How could a model gradient hack?
The first example comes from a quoted footnote of Risks from Learned Optimization:
This implies that the model is able to check for changes in itself and in its objective, such that changing this objective would make it work worse. An analogue version would be for the model to underperform except if it’s changed in the direction it likes, to push the gradient towards there.
Still, it might feel a bit weird to think about a model that can influence its training without somewhat hacking the computer it is in. After all, the training is external, right? One fundamental but easy to miss insight from the post is that the model can influence the loss surface, and thus the gradient (which comes from this surface)
So a model could gradient hacks because its output and how it computes its output influence the local loss surface around it, which means it could in theory make the gradient go its way. I feel like this is crucial, and it’s slightly problematic that the post doesn’t put more emphasis on it.
Also note that this is hardly solving the question. Let’s say the model is such that if its mesa-objective is changed, it performs badly. Then we probably have a form of gradient hacking. But what tells us that the change at this step will not slightly undermine the very mechanism that protects the mesa-objective? Unwrapping this thought, it’s harder to conceptualize how a model could gradient hack forever. It seems to involve some sort of fixed point or self correcting mechanism.
That complexity should probably make us update slightly towards the improbability of “sustainable” gradient hacking—still not enough to dismiss it though. Understanding the requirements for such sustainable gradient hacking looks like the most important research question on this topic for me (which makes sense, I’m working on it after all…).
How could training create a model which gradient hacks?
What about this second question? Well, another issue with the post comes from the lack of clarification around the second question: how could training create a model which gradient hacks? I’m obviously not asking for a full-fledged answer, as this is still research in progress. But there are easy to make mistakes that I believe Evan could have debunked in the post.
One such mistake is to get confused about how the model could “decide” to gradient hack. After all, if the computation is deterministic, it’s not like the model has a decision to make, right? This question can be dissolved in a similar way to free will (since it’s basically about free will of neural networks): (spoiler for the solution to free will)
the model is completely determined by its parameters, but its parameters are part of itself.
The less philosophical approach to this problem is to notice that the appearance of gradient hacking would probably come from the training stumbling on a gradient hacker. There is no need to worry about how the model could turn itself into a gradient hacker; the issue is about how training could bump into a gradient hacker.
Another mistake would be to argue that what is called gradient hacking here is indistinguishable in terms of training behavior from some model that doesn’t gradient hack but happens to have this gradient. This is missing a big part of Evan’s viewpoint: that looking inside the model is fundamental for alignment. Here the question is whether we can catch the model doing some form of recognizable gradient hacking, because that allows us to infer more about what it might do (and that it will not be inner aligned). Whereas if we can be relatively sure that a model is not gradient hacking, there are less things to worry about, at least in terms of deceptive alignment.
This is almost obvious for people that talked a lot with Evan on these issues, or who are intimate with his work; but I feel that it’s only implicit in this post, which might brew confusion. A new but very observant reader could infer it from the solution (discussed below) using transparency tools, sure. I still think the post would be stronger if that possible misinterpretation was debunked explicitly.
As for further research, investigating the appearance of gradient hacking through common ML training procedure is crucial. But it probably requires a better understanding of the mechanisms we’re looking for, and thus a convincing answer (at least informal) to the first question.
Possible solution
Evan proposes a solution for gradient hacking in one case:
This is hardly revolutionary, and doesn’t say how to actually catch deception (although most of Evan research is focused on that). Yet I don’t think much more can be done without understanding in more detail the mechanisms of gradient hacking, and how they appear through training. That is, without solving some significant part of the two previous questions.
Wrapping-up
All in all, I think this post is very important, because the alignment issue it presents is novel, counterintuitive, and hard to dismiss without more research, once one thinks about it for long enough. That being said, this post is definitely a short presentation of an idea, written quickly to draw attention to this issue. Gradient hacking warrants a longer post that digs deeper into the details of the problem,
[EDIT: you may have already meant it this way, but...] The optimization algorithm (e.g. SGD) doesn’t need to stumble upon the specific logic of gradient hacking (which seems very unlikely). I think the idea is that a sufficiently capable agent (with a goal system that involves our world) instrumentally decides to use gradient hacking, because otherwise the agent will be modified in a suboptimal manner with respect to its current goal system.
Actually, I did meant that SGD might stumble upon gradient hacking. Or to be a bit more realistic, make the model slightly deceptive, at which point decreasing a bit the deceptiveness makes the model worse but increasing it a bit makes the model better at the base-objective, and so there is a push towards deceptiveness, until the model is basically deceptive enough to use gradient hacking in the way you mention.
Does that make more sense to you?
I think that if SGD makes the model slightly deceptive it’s because it made the model slightly more capable (better at general problem solving etc.), which allowed the model to “figure out” (during inference) that acting in a certain deceptive way is beneficial with respect to the mesa-objective.
This seems to me a lot more likely than SGD creating specifically “deceptive logic” (i.e. logic that can’t do anything generally useful other than finding ways to perform better on the mesa-objective by being deceptive).
I agree with your intuition, but I want to point out again that after some initial useless amount, “deceptive logic” is probably a pretty useful thing in general for the model, because it helps improve performance as measured through the base-objective.
SGD making the model more capable seems the most obvious way to satisfy the conditions for deceptive alignement.
But you can similarly say this for the following logic: “check whether 1+1<4 and if so, act according to the base objective”. Why is SGD more likely to create “deceptive logic” than this simpler logic (or any other similar logic)?
[EDIT: actually, this argument doesn’t work in a setup where the base objective corresponds to a sufficiently long time horizon during which it is possible for humans to detect misalignment and terminate/modify the model (in a way the is harmful with respect to the base objective).]
So my understanding is that deceptive behavior is a lot more likely to arise from general-problem-solving logic, rather than SGD directly creating “deceptive logic” specifically.
Hum, I would say that your logic is probably redundant, and thus might end up being removed for simplicity reasons? Whereas I expect deceptive logic to includes very useful things like knowing how the optimization process works, which would definitely help having better performance.
But to be honest, how can SGD create gradient hacking (if it’s even possible) is completely an open research problem.
My point was that there’s no reason that SGD will create specifically “deceptive logic” because “deceptive logic” is not privileged over any other logic that involves modeling the base objective and acting according to it. But I now think this isn’t always true—see the edit block I just added.