Take a reinforcement learner AI, that we want to safely move a strawberry onto a plate. A human sits nearby and provides a reward based on inspecting the AI’s behaviour.
As it stands, this setup is completely vulnerable to reward hacking. The reward is not provided for safe moving of the strawberry; instead the reward is provided by having the human judge that the task has been accomplished and then pressing a button. Taking control of the human or control of the button is likely to be possible for a superintelligent AI; and, as it stands, that would be mandated by this reward function.
I think this claimed vulnerability is invalid because reward is not the optimization target.
(I know this is an old post and your views may have changed, but I’m posting this comment for anyone who comes by later.)