Imagine a simple reinforcement learner. I press a button and it gets a reward. If the reinforcement learner is smart, it will figure out that pressing the button causes the reward, and try to steal the button and press it (as opposed to indirectly pressing it by pleasing me.)
This is the exact same situation. We’ve just removed the reward. Instead the AI tries to predict what actions would have given it rewards. However there is no difference between predicted rewards and actual rewards. They should converge to the same function, that’s the entire goal of the learning.
So if the AI is as smart as the AI in the first scenario, it will know that stealing the reward button is what it should have done the first time around, and therefore what it will do the second time.
Expecting the AI to magically learn human values and stop there is just absurdly anthropomorphically optimistic.
Imagine a simple reinforcement learner. I press a button and it gets a reward. If the reinforcement learner is smart, it will figure out that pressing the button causes the reward, and try to steal the button and press it (as opposed to indirectly pressing it by pleasing me.)
This is the exact same situation. We’ve just removed the reward. Instead the AI tries to predict what actions would have given it rewards. However there is no difference between predicted rewards and actual rewards. They should converge to the same function, that’s the entire goal of the learning.
So if the AI is as smart as the AI in the first scenario, it will know that stealing the reward button is what it should have done the first time around, and therefore what it will do the second time.
Expecting the AI to magically learn human values and stop there is just absurdly anthropomorphically optimistic.