I’d be interested in work that further sketches out a scenario in which this could occur. Some particular details that would be interesting:
Does the agent need to have the ability to change the weights in the neural net that it is implemented in? If so, how does it get that ability?
How does the agent deal with the fact that the outer optimization algorithm will force it to explore? Or does that not matter?
How do we deal with the fact that gradients mostly don’t account for counterfactual behavior? Things like “Fail hard if the objective is changed” don’t work as well if the gradients can’t tell that changing the objective leads to failure. That said, maybe gradients can tell that that happens, depending on the setup.
Why can’t the gradient descent get rid of the computation that decides to perform gradient hacking, or repurpose it for something more useful?
Does the agent need to have the ability to change the weights in the neural net that it is implemented in? If so, how does it get that ability?
No, at least not in the way that I’m imagining this working. In fact, I wouldn’t really call that gradient-hacking anymore (maybe it’s just normal hacking at that point?).
For your other points, I agree that they seem like interesting directions to poke on for figuring out whether something like this works or not.
This post calls attention to the problem of **gradient hacking**, where a powerful agent being trained by gradient descent could structure its computation in such a way that it causes its gradients to update it in some particular way. For example, a mesa optimizer could structure its computation to first check whether its objective has been tampered with, and if so to fail catastrophically, so that the gradients tend to point away from tampering with the objective.
Planned opinion:
I’d be interested in work that further sketches out a scenario in which this could occur. I wrote about some particular details in this comment.
Why can’t the gradient descent get rid of the computation that decides to perform gradient hacking, or repurpose it for something more useful?
Gradient descent is a very simple algorithm. It only “gets rid” of some piece of logic when that is the result of updating the parameters in the direction of the gradient. In the scenario of gradient hacking, we might end up with a model that maliciously prevents gradient descent from, say, changing the parameter θ1537, by being a model that outputs a very incorrect value if θ1537 is even slightly different than the desired value.
I’d be interested in work that further sketches out a scenario in which this could occur. Some particular details that would be interesting:
Does the agent need to have the ability to change the weights in the neural net that it is implemented in? If so, how does it get that ability?
How does the agent deal with the fact that the outer optimization algorithm will force it to explore? Or does that not matter?
How do we deal with the fact that gradients mostly don’t account for counterfactual behavior? Things like “Fail hard if the objective is changed” don’t work as well if the gradients can’t tell that changing the objective leads to failure. That said, maybe gradients can tell that that happens, depending on the setup.
Why can’t the gradient descent get rid of the computation that decides to perform gradient hacking, or repurpose it for something more useful?
No, at least not in the way that I’m imagining this working. In fact, I wouldn’t really call that gradient-hacking anymore (maybe it’s just normal hacking at that point?).
For your other points, I agree that they seem like interesting directions to poke on for figuring out whether something like this works or not.
Planned summary:
This post calls attention to the problem of **gradient hacking**, where a powerful agent being trained by gradient descent could structure its computation in such a way that it causes its gradients to update it in some particular way. For example, a mesa optimizer could structure its computation to first check whether its objective has been tampered with, and if so to fail catastrophically, so that the gradients tend to point away from tampering with the objective.
Planned opinion:
I’d be interested in work that further sketches out a scenario in which this could occur. I wrote about some particular details in this comment.
Gradient descent is a very simple algorithm. It only “gets rid” of some piece of logic when that is the result of updating the parameters in the direction of the gradient. In the scenario of gradient hacking, we might end up with a model that maliciously prevents gradient descent from, say, changing the parameter θ1537, by being a model that outputs a very incorrect value if θ1537 is even slightly different than the desired value.