From a zoomed-out perspective, the model is not modifying the loss landscape. This frame, however, does not give us a useful way of thinking about how gradient hacking might occur and how to avoid it.
I think that the main value of the frame is to separate out the different potential ways gradient hacking can occur. I’ve noticed that in discussions without this distinction, it’s very easy to equivocate between the types, which leads to frustrating conversations where people fundamentally disagree without realizing (i.e someone might be talking about some mins of the base objective being dangerous, while the other person is talking about the model actually breaking SGD and avoiding mins entirely). Illusion of transparency and all that.
For instance, when you mention that “[continuously failing] gradient hacking will in general lead to worse performance on the base-objective”—this is only true for nonconvergent gradient hackers! I can see where you’re coming from; most mesaobjectives have mins that don’t really line up with the base objective, and so a lot of the time moving towards a min of the mesaobjective hurts performance on the base objective. In some sense, convergent gradient hackers are those whose mins happen to line up with the mins of the base objective. However, even though in some intuitive sense convergent mesaobjectives are very very sparse in the space of all mesaobjectives, it’s very likely that convergent mesaobjectives are the ones we think are actually likely to exist in the real world (also, as a data point, Evan says that convergent gradient hackers are what he considers “gradient hacking” for instance).
In fact, I’d argue that coupling, assuming it were possible for the sake of argument, would imply a convergent gradient hacker (the reverse implication is not necessarily true) -- a rough and not very rigorous sketch of the argument: if the model remains coupled across training, then at any point in training any epsilon change which hurts the mesaobjective must also hurt the base objective, thus the inner product of the mesa gradient and base gradient must always be nonnegative, thus if SGD converges it must converge to a min or saddle point of the mesaobjective too (whereas the hallmark of nonconvergent gradient hackers is they manage to find a min of the mesaobjective somewhere where base gradient is nonzero).
Thanks for your reply! I agree that I might be a little overly dismissive of the loss landscape frame. I agree mostly with your point about convergent gradient hackers existing minima of the base-objective, I briefly commented on this (although not in the section on coupling)
Here we can think of the strong coupling regime as being a local minimum in the loss landscape, such that any step to remove the gradient hacker leads locally to worse performance. In the weak coupling regime, the loss landscape will still increase in the directions which directly hurt performance on the mesa-objective, but gradient descent is still able to move downhill in a direction that erases the gradient hacker.
(I also agree with your rough sketch and had some similar notes which didn’t make it into the post)
I think I am still a little unsure how to mix the two frames of ‘the gradient hacker has control’ and ‘it’s all deterministic gradient descent+quirks’, which I guess is just the free will debate. It also seems like you could have a gradient hacker arise on a slope (which isn’t non-convergent) and then uses its gradient hacking ways to manipulate how it is further modified, but this might just be a further case of a convergent gradient hacker.
The loss landscape frame doesn’t seem to really tell us how a gradient hacker would act or modify itself. From the ‘gradient hacker’ perspective there doesn’t seem to be much difference between manipulating the gradients via the outputs/cognition and manipulating the gradient via quirks/stop-gradients. It still seems like the non-convergent gradient hacker would also require coupling to get started in the first place; the model needs to be pretty smart to be able to know how to use these quirks/bugs and I think this needs to come from some underlying general ability.
One counter to this could be that there is a non-convergent gradient hacker which can ‘look at’ the information that the underlying capabilities have (eg the non-convergent gradient hacker can look for information on how to gradient hack). But it still seems here that there will be some parts of the gradient hacker which will have some optimization pressure on them from gradient descent, and so without coupling they would get eroded.
From my view, the determinism isn’t actually the main takeaway from realizing that the loss landscape is stationary, the categorization is. Also, I would argue that there’s a huge practical difference between having mesaobjectives that happen to be local mins of the base objective and counting on quirks/stop-gradients; for one, the former is severely constrained on the kinds of objectives it can have (i.e we can imagine that lots of mesaobjectives, especially more blatant ones, might be harder to couple), which totally rules out some more extreme conceptions of gradient hackers that completely ignore the base objective. Also, convergent gradient hackers are only dangerous out of distribution—in theory if you had a training distribution that covered 100% of the input space then convergent gradient hackers would cease to be a thing (being really suboptimal on distribution is bad for the base objective).
I think that the main value of the frame is to separate out the different potential ways gradient hacking can occur. I’ve noticed that in discussions without this distinction, it’s very easy to equivocate between the types, which leads to frustrating conversations where people fundamentally disagree without realizing (i.e someone might be talking about some mins of the base objective being dangerous, while the other person is talking about the model actually breaking SGD and avoiding mins entirely). Illusion of transparency and all that.
For instance, when you mention that “[continuously failing] gradient hacking will in general lead to worse performance on the base-objective”—this is only true for nonconvergent gradient hackers! I can see where you’re coming from; most mesaobjectives have mins that don’t really line up with the base objective, and so a lot of the time moving towards a min of the mesaobjective hurts performance on the base objective. In some sense, convergent gradient hackers are those whose mins happen to line up with the mins of the base objective. However, even though in some intuitive sense convergent mesaobjectives are very very sparse in the space of all mesaobjectives, it’s very likely that convergent mesaobjectives are the ones we think are actually likely to exist in the real world (also, as a data point, Evan says that convergent gradient hackers are what he considers “gradient hacking” for instance).
In fact, I’d argue that coupling, assuming it were possible for the sake of argument, would imply a convergent gradient hacker (the reverse implication is not necessarily true) -- a rough and not very rigorous sketch of the argument: if the model remains coupled across training, then at any point in training any epsilon change which hurts the mesaobjective must also hurt the base objective, thus the inner product of the mesa gradient and base gradient must always be nonnegative, thus if SGD converges it must converge to a min or saddle point of the mesaobjective too (whereas the hallmark of nonconvergent gradient hackers is they manage to find a min of the mesaobjective somewhere where base gradient is nonzero).
Thanks for your reply! I agree that I might be a little overly dismissive of the loss landscape frame. I agree mostly with your point about convergent gradient hackers existing minima of the base-objective, I briefly commented on this (although not in the section on coupling)
(I also agree with your rough sketch and had some similar notes which didn’t make it into the post)
I think I am still a little unsure how to mix the two frames of ‘the gradient hacker has control’ and ‘it’s all deterministic gradient descent+quirks’, which I guess is just the free will debate. It also seems like you could have a gradient hacker arise on a slope (which isn’t non-convergent) and then uses its gradient hacking ways to manipulate how it is further modified, but this might just be a further case of a convergent gradient hacker.
The loss landscape frame doesn’t seem to really tell us how a gradient hacker would act or modify itself. From the ‘gradient hacker’ perspective there doesn’t seem to be much difference between manipulating the gradients via the outputs/cognition and manipulating the gradient via quirks/stop-gradients. It still seems like the non-convergent gradient hacker would also require coupling to get started in the first place; the model needs to be pretty smart to be able to know how to use these quirks/bugs and I think this needs to come from some underlying general ability.
One counter to this could be that there is a non-convergent gradient hacker which can ‘look at’ the information that the underlying capabilities have (eg the non-convergent gradient hacker can look for information on how to gradient hack). But it still seems here that there will be some parts of the gradient hacker which will have some optimization pressure on them from gradient descent, and so without coupling they would get eroded.
From my view, the determinism isn’t actually the main takeaway from realizing that the loss landscape is stationary, the categorization is. Also, I would argue that there’s a huge practical difference between having mesaobjectives that happen to be local mins of the base objective and counting on quirks/stop-gradients; for one, the former is severely constrained on the kinds of objectives it can have (i.e we can imagine that lots of mesaobjectives, especially more blatant ones, might be harder to couple), which totally rules out some more extreme conceptions of gradient hackers that completely ignore the base objective. Also, convergent gradient hackers are only dangerous out of distribution—in theory if you had a training distribution that covered 100% of the input space then convergent gradient hackers would cease to be a thing (being really suboptimal on distribution is bad for the base objective).