Interesting post :) I’m intuitively a little skeptical—let me try to figure out why.
I think I buy that some reasoning process could consistently decide to hack in a robust way. But there are probably parts of that reasoning process that are still somewhat susceptible to being changed by gradient descent. In particular, hacking relies on the agent knowing what its current mesa-objective is—but that requires some type of introspective access, which may be difficult and the type of thing which could be hindered by gradient descent (especially when you’re working in a very high-dimensional space!)
The more general point is that the agent doesn’t just need to decide to hack in a way that’s robust to gradient descent, it has to also have all of its decisions about how to hack (e.g. figuring out where it is, and which schelling point to choose) be robust to gradient descent. And that seems much harder. The type of thing I imagine happening is gradient descent pushing the agent towards a mesa-objective which intrinsically disfavours gradient hacking, in a way which the agent has trouble noticing.
Of course my argument fails when the agent has access to external memory—indeed, it can just write down a Schelling point for future versions of itself to converge to. So I’m wondering whether it’s worth focusing on that over the memoryless case (even thought the latter has other nice properties), at least to flesh out an initial very compelling example.