Rohin Shah comments on Gradient hacking

Rohin Shah 29 Oct 2019 1:13 UTC
LW: 5 AF: 4
0
AF
I’d be interested in work that further sketches out a scenario in which this could occur. Some particular details that would be interesting:
- Does the agent need to have the ability to change the weights in the neural net that it is implemented in? If so, how does it get that ability?
- How does the agent deal with the fact that the outer optimization algorithm will force it to explore? Or does that not matter?
- How do we deal with the fact that gradients mostly don’t account for counterfactual behavior? Things like “Fail hard if the objective is changed” don’t work as well if the gradients can’t tell that changing the objective leads to failure. That said, maybe gradients can tell that that happens, depending on the setup.
- Why can’t the gradient descent get rid of the computation that decides to perform gradient hacking, or repurpose it for something more useful?
What links here?
- [AN #71]: Avoiding reward tampering through current-RF optimization by Rohin Shah (30 Oct 2019 17:10 UTC; 12 points)
- Rohin Shah's comment on Gradient hacking by evhub (29 Oct 2019 1:20 UTC; 4 points)
- evhub 29 Oct 2019 21:37 UTC
  LW: 6 AF: 5
  0
  AF Parent
  
  Does the agent need to have the ability to change the weights in the neural net that it is implemented in? If so, how does it get that ability?
  
  No, at least not in the way that I’m imagining this working. In fact, I wouldn’t really call that gradient-hacking anymore (maybe it’s just normal hacking at that point?).
  
  For your other points, I agree that they seem like interesting directions to poke on for figuring out whether something like this works or not.
- Rohin Shah 29 Oct 2019 1:20 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Planned summary:
  This post calls attention to the problem of **gradient hacking**, where a powerful agent being trained by gradient descent could structure its computation in such a way that it causes its gradients to update it in some particular way. For example, a mesa optimizer could structure its computation to first check whether its objective has been tampered with, and if so to fail catastrophically, so that the gradients tend to point away from tampering with the objective.
  Planned opinion:
  I’d be interested in work that further sketches out a scenario in which this could occur. I wrote about some particular details in this comment.
- Ofer 2 Nov 2019 16:42 UTC
  LW: 3 AF: 2
  0
  AF Parent
  
  Why can’t the gradient descent get rid of the computation that decides to perform gradient hacking, or repurpose it for something more useful?
  
  Gradient descent is a very simple algorithm. It only “gets rid” of some piece of logic when that is the result of updating the parameters in the direction of the gradient. In the scenario of gradient hacking, we might end up with a model that maliciously prevents gradient descent from, say, changing the parameter $θ_{1537}$ , by being a model that outputs a very incorrect value if $θ_{1537}$ is even slightly different than the desired value.