Nice post. Approval reward seems like it helps explain a lot of human motivation and behavior.
I’m wondering whether approval reward would really be a safe source of motivation in an AGI though. From the post, it apparently comes from two sources in humans:
Internal: There’s an internal approval reward generator that rewards you for doing things that other people would approve of, even if no one is there. “Intrinsically motivated” sounds very robust but I’m concerned that this just means that the reward is coming from an internal module that is possible to game.
External: Someone sees you do something and you get approval.
In each case it seems the person is generating behaviors and they there is an equally strong/robust reward classifier internally or externally so it’s hard to game.
The internal classifier is hard to game because we can’t edit our minds.
And other people are hard to fool. For example, there are fake billionaires but they are usually found out and then get negative approval so it’s not worth it.
But I’m wondering would an AGI with an approval reward modify itself to reward hack or figure out how to fool humans in clever ways (like the RLHF robot arm) to get more approval.
Though maybe implementing an approval reward in an AI gets you most of the alignment you need and it’s robust enough.
I definitely have strong concerns that Approval Reward won’t work on AGI. (But I don’t have an airtight no-go theorem either. I just don’t know; I plan to think about it more.) See especially footnote 7 of this post, and §6 of the Approval Reward post, for some of my concerns, which overlap with yours.
(I hope I wasn’t insinuating that I think AGI with Approval Reward is definitely a great plan that will solve AGI technical alignment. I’m open to wording changes if you can think of any.)
Nice post. Approval reward seems like it helps explain a lot of human motivation and behavior.
I’m wondering whether approval reward would really be a safe source of motivation in an AGI though. From the post, it apparently comes from two sources in humans:
Internal: There’s an internal approval reward generator that rewards you for doing things that other people would approve of, even if no one is there. “Intrinsically motivated” sounds very robust but I’m concerned that this just means that the reward is coming from an internal module that is possible to game.
External: Someone sees you do something and you get approval.
In each case it seems the person is generating behaviors and they there is an equally strong/robust reward classifier internally or externally so it’s hard to game.
The internal classifier is hard to game because we can’t edit our minds.
And other people are hard to fool. For example, there are fake billionaires but they are usually found out and then get negative approval so it’s not worth it.
But I’m wondering would an AGI with an approval reward modify itself to reward hack or figure out how to fool humans in clever ways (like the RLHF robot arm) to get more approval.
Though maybe implementing an approval reward in an AI gets you most of the alignment you need and it’s robust enough.
I definitely have strong concerns that Approval Reward won’t work on AGI. (But I don’t have an airtight no-go theorem either. I just don’t know; I plan to think about it more.) See especially footnote 7 of this post, and §6 of the Approval Reward post, for some of my concerns, which overlap with yours.
(I hope I wasn’t insinuating that I think AGI with Approval Reward is definitely a great plan that will solve AGI technical alignment. I’m open to wording changes if you can think of any.)