Sorry for the late response! I didn’t realise I had comments :)
In this proposal we go with (2): The AI does whatever it thinks the handlers will reward it for.
I agree this isn’t as good as giving the agents an actually safe reward function, but if our assumptions are satisfied then this approval-maximising behaviour might still result in the human designers getting what they actually want.
What I think you’re saying (please correct me if I misunderstood) is that an agent aiming to do whatever its designers reward it for will be incentivised to do undesirable things to us (like wiring up our brains to machines which make us want to press the reward button all the time).
It’s true that the agents will try to take these kind nefarious actions if they think they can get away with it. But in this setup the agent knows that it can’t get away with tricking the humans like this, since it’s ancestors already warned the humans that a future agent might try this, and the humans prepared appropriately.
Sorry for the late response! I didn’t realise I had comments :)
In this proposal we go with (2): The AI does whatever it thinks the handlers will reward it for.
I agree this isn’t as good as giving the agents an actually safe reward function, but if our assumptions are satisfied then this approval-maximising behaviour might still result in the human designers getting what they actually want.
What I think you’re saying (please correct me if I misunderstood) is that an agent aiming to do whatever its designers reward it for will be incentivised to do undesirable things to us (like wiring up our brains to machines which make us want to press the reward button all the time).
It’s true that the agents will try to take these kind nefarious actions if they think they can get away with it. But in this setup the agent knows that it can’t get away with tricking the humans like this, since it’s ancestors already warned the humans that a future agent might try this, and the humans prepared appropriately.