Do you want to do a ton of super addictive drugs? Reward is not the optimization target. It’s also not supposed to be the optimization target. A model that reliably executes the most rewarding possible action available will wirehead as soon as it’s able.
Are you bringing up wireheading to answer yes or no to my question (of whether RL is more prone to gradient hacking)? To me, it sounds like you’re suggesting a no, but I think it’s in support of the idea that RL might be prone to gradient hacking. The AI, like me, avoids wireheading itself and so will never be modified by gradient descent towards wireheading because gradient descent doesn’t know anything about wireheading until it’s been tried. So that is an example of gradient hacking itself, isn’t it? Unlike in a supervised learning setup where the gradient descent ‘knows’ about all possible options and will modify any subagents that avoid giving the right answer.
So am I a gradient hacker whenever I just say no to drugs?
Do you want to do a ton of super addictive drugs? Reward is not the optimization target. It’s also not supposed to be the optimization target. A model that reliably executes the most rewarding possible action available will wirehead as soon as it’s able.
Are you bringing up wireheading to answer yes or no to my question (of whether RL is more prone to gradient hacking)? To me, it sounds like you’re suggesting a no, but I think it’s in support of the idea that RL might be prone to gradient hacking. The AI, like me, avoids wireheading itself and so will never be modified by gradient descent towards wireheading because gradient descent doesn’t know anything about wireheading until it’s been tried. So that is an example of gradient hacking itself, isn’t it? Unlike in a supervised learning setup where the gradient descent ‘knows’ about all possible options and will modify any subagents that avoid giving the right answer.
So am I a gradient hacker whenever I just say no to drugs?