Thanks for your reply! I agree that I might be a little overly dismissive of the loss landscape frame. I agree mostly with your point about convergent gradient hackers existing minima of the base-objective, I briefly commented on this (although not in the section on coupling)
Here we can think of the strong coupling regime as being a local minimum in the loss landscape, such that any step to remove the gradient hacker leads locally to worse performance. In the weak coupling regime, the loss landscape will still increase in the directions which directly hurt performance on the mesa-objective, but gradient descent is still able to move downhill in a direction that erases the gradient hacker.
(I also agree with your rough sketch and had some similar notes which didn’t make it into the post)
I think I am still a little unsure how to mix the two frames of ‘the gradient hacker has control’ and ‘it’s all deterministic gradient descent+quirks’, which I guess is just the free will debate. It also seems like you could have a gradient hacker arise on a slope (which isn’t non-convergent) and then uses its gradient hacking ways to manipulate how it is further modified, but this might just be a further case of a convergent gradient hacker.
The loss landscape frame doesn’t seem to really tell us how a gradient hacker would act or modify itself. From the ‘gradient hacker’ perspective there doesn’t seem to be much difference between manipulating the gradients via the outputs/cognition and manipulating the gradient via quirks/stop-gradients. It still seems like the non-convergent gradient hacker would also require coupling to get started in the first place; the model needs to be pretty smart to be able to know how to use these quirks/bugs and I think this needs to come from some underlying general ability.
One counter to this could be that there is a non-convergent gradient hacker which can ‘look at’ the information that the underlying capabilities have (eg the non-convergent gradient hacker can look for information on how to gradient hack). But it still seems here that there will be some parts of the gradient hacker which will have some optimization pressure on them from gradient descent, and so without coupling they would get eroded.
From my view, the determinism isn’t actually the main takeaway from realizing that the loss landscape is stationary, the categorization is. Also, I would argue that there’s a huge practical difference between having mesaobjectives that happen to be local mins of the base objective and counting on quirks/stop-gradients; for one, the former is severely constrained on the kinds of objectives it can have (i.e we can imagine that lots of mesaobjectives, especially more blatant ones, might be harder to couple), which totally rules out some more extreme conceptions of gradient hackers that completely ignore the base objective. Also, convergent gradient hackers are only dangerous out of distribution—in theory if you had a training distribution that covered 100% of the input space then convergent gradient hackers would cease to be a thing (being really suboptimal on distribution is bad for the base objective).
Thanks for your reply! I agree that I might be a little overly dismissive of the loss landscape frame. I agree mostly with your point about convergent gradient hackers existing minima of the base-objective, I briefly commented on this (although not in the section on coupling)
(I also agree with your rough sketch and had some similar notes which didn’t make it into the post)
I think I am still a little unsure how to mix the two frames of ‘the gradient hacker has control’ and ‘it’s all deterministic gradient descent+quirks’, which I guess is just the free will debate. It also seems like you could have a gradient hacker arise on a slope (which isn’t non-convergent) and then uses its gradient hacking ways to manipulate how it is further modified, but this might just be a further case of a convergent gradient hacker.
The loss landscape frame doesn’t seem to really tell us how a gradient hacker would act or modify itself. From the ‘gradient hacker’ perspective there doesn’t seem to be much difference between manipulating the gradients via the outputs/cognition and manipulating the gradient via quirks/stop-gradients. It still seems like the non-convergent gradient hacker would also require coupling to get started in the first place; the model needs to be pretty smart to be able to know how to use these quirks/bugs and I think this needs to come from some underlying general ability.
One counter to this could be that there is a non-convergent gradient hacker which can ‘look at’ the information that the underlying capabilities have (eg the non-convergent gradient hacker can look for information on how to gradient hack). But it still seems here that there will be some parts of the gradient hacker which will have some optimization pressure on them from gradient descent, and so without coupling they would get eroded.
From my view, the determinism isn’t actually the main takeaway from realizing that the loss landscape is stationary, the categorization is. Also, I would argue that there’s a huge practical difference between having mesaobjectives that happen to be local mins of the base objective and counting on quirks/stop-gradients; for one, the former is severely constrained on the kinds of objectives it can have (i.e we can imagine that lots of mesaobjectives, especially more blatant ones, might be harder to couple), which totally rules out some more extreme conceptions of gradient hackers that completely ignore the base objective. Also, convergent gradient hackers are only dangerous out of distribution—in theory if you had a training distribution that covered 100% of the input space then convergent gradient hackers would cease to be a thing (being really suboptimal on distribution is bad for the base objective).