I think part of what you’re getting at is what I’ve called The alignment stability problem. You can see my thoughts there, including links to related work.
Looking at the google scholar link in this article, it looks like what I’m describing more closely resembles “motivation hacking”, except that, in my thought experiment, the agent doesn’t modify its own reward system. Instead, it selects arbitrary actions and anticipates if their reward is coincidentally more satisfying than the base objective. This allows it to perform this attack even if its in the training environment.Further, this sort of “attack” may be a component of the self-analysis an agent may do in pursuit of the base objective, so at no point does the agent need to exhibit deceptive or antagonistic behavior to pursue this vulnerability. It may be that an agent pursuing this vulnerability is fundamentally the same as an agent pursuing the base objective.