Some quotes from Reward is not the optimization target:
Suppose a human trains an RL agent by pressing the cognition-updater button when the agent puts trash in a trash can. While putting trash away, the AI’s policy network is probably “thinking about”[5] the actual world it’s interacting with, and so the cognition-updater reinforces those heuristics which lead to the trash getting put away (e.g. “if trash-classifier activates near center-of-visual-field, then grab trash using motor-subroutine-#642”).
Then suppose this AI models the true fact that the button-pressing produces the cognition-updater. Suppose this AI, which has historically had its trash-related thoughts reinforced, considers the plan of pressing this button. “If I press the button, that triggers credit assignment, which will reinforce my decision to press the button, such that in the future I will press the button even more.”
Why, exactly, would the AI seize[6] the button? To reinforce itself into a certain corner of its policy space? The AI has not had antecedent-computation-reinforcer-thoughts reinforced in the past, and so its current decision will not be made in order to acquire the cognition-updater!
My understanding of this RL training story is as follows:
A human trains an RL agent by pressing the cognition-updater (reward) button immediately after the agent puts trash in the trash can.
Now the AI’s behavior and thoughts related to putting away trash have been reinforced so it continues those behaviors in the future, values putting away trash and isn’t interested in pressing the reward button unless by accident:
But what if the AI bops the reward button early in training, while exploring? Then credit assignment would make the AI more likely to hit the button again. 1. Then keep the button away from the AI until it can model the effects of hitting the cognition-updater button. 2. For the reasons given in the “siren” section, a sufficiently reflective AI probably won’t seek the reward button on its own.
The AI has the option of pressing the reward button but by now it only values putting trash away so it avoids pressing the button to avoid having its values changed:
I think that before the agent can hit the particular attractor of reward-optimization, it will hit an attractor in which it optimizes for some aspect of a historical correlate of reward.
Thoughts on Reward button alignment
The training story in Reward button alignment is different and involves:
Pressing the reward button after showing a video of the button being pressed. Now the button pressing situation is reinforced and the AI intrinsically values the situation where the button is pressed.
Ask the AI to complete a task (e.g. put away trash) and promise to press the reward button if it completes the task.
The AI completes the task not because it values the task, but because it ultimately values pressing the reward button after completing the task.
Thoughts on the differences
The TurnTrout story sounds more like the AI developing intrinsic motivation: the AI is rewarded immediately after completing the task and values the task intrinsically. The AI puts away trash because it was directly rewarded for that behavior in the past and doesn’t want anything else.
In contrast the reward button alignment story is extrinsic. The AI doesn’t care intrinsically about the task but only does it to receive a reward button press which it does value intrinsically. This is similar to a human employee who completes a boring task to earn money. The task is only a means to an end and they would prefer to just receive the money without completing the task.
Maybe a useful analogy is humans who are intrinsically or extrinsically motivated. For example, someone might write books to make money (extrinsic motivation) or because they enjoy it for its own sake (intrinsic motivation).
For the intrinsically motivated person, the sequence of rewards is:
Spend some time writing the book.
Immediately receive a reward from the process of writing.
Summary: fun task --> reward
And for the extrinsically motivated person, the sequence of rewards is:
The person enjoys shopping and learns to value money because they find using it to buy things rewarding.
The person is asked to write a book for money. They don’t receive any intrinsic reward (e.g. enjoyment) from writing the book but they do it because they anticipate receiving money (something they do value).
They receive money for the task.
Summary: boring task --> money --> reward
The second sequence is not safe because the person is motivated to skip the task and steal the money. The first sequence (intrinsic motivation) is safer because the task itself is rewarding (though wireheading is a risk in a similar way) so they aren’t as motivated to manipulate the task.
So my conclusion is that trying to build intrinsically motivated AI agents by directly rewarding them for tasks seems safer and more desirable than building extrinsically motivated agents that receive some kind of payment for doing work.
One reason to be optimistic is that it should be easier to modify AIs to value doing useful tasks by rewarding them directly for completing the task (though goal misgeneralization is another separate issue). The same is generally not possible with humans: e.g. it’s hard to teach someone to be passionate about boring tasks like washing the dishes so we just have to pay people to do tasks like that.
Thanks! Part of it is that @TurnTrout was probably mostly thinking about model-free policy optimization RL (e.g. PPO), whereas I’m mostly thinking about actor-critic model-based RL agents (especially how I think the human brain works).
Another part of it is that
TurnTrout is arguing against “the AGI will definitely want the reward button to be pressed; this is universal and unavoidable”,
whereas I’m arguing for “if you want your AGI to want the reward button to be pressed, that’s something that you can make happen, by carefully following the instructions in §1”.
I think both those arguments are correct, and indeed I also gave an example (block-quote in §8) of how you might set things up such that the AGI wouldn’t want the reward button to be pressed, if that’s what you wanted instead.
I reject “intrinsic versus extrinsic motivation” as a meaningful or helpful distinction, but that’s a whole separate rant (e.g. here or here).
If you replaced the word “extrinsic” with “instrumental”, then now we have the distinction between “intrinsic versus instrumental motivation”, and I like that much better. For example, if I’m walking upstairs to get a sweater, I don’t particularly enjoy the act of walking upstairs for its own sake, I just want the sweater. Walking upstairs is instrumental, and it explicitly feels instrumental to me. (This kind of explicit self-aware knowledge that some action is instrumental is a thing in at least some kinds of actor-critic model-based RL, but not in model-free RL like PPO, I think.) I think that’s kinda what you’re getting at in your comment. If so, yes, the idea of Reward Button Alignment is to deliberately set up an instrumental motivation to follow instructions, whereas that TurnTrout post (or my §8 block quote) would be aiming at an intrinsic motivation to follow instructions (or to do such-and-such task).
I agree that setting things up such that an AGI feels an intrinsic motivation to follow instructions (or to do such-and-such task) would be good, and certainly way better than Reward Button Alignment, other things equal, although I think actually pulling that off is harder than you (or probably TurnTrout) seem to think—see my long discussion at Self-dialogue: Do behaviorist rewards make scheming AGIs?
Thanks for the clarifying comment. I agree with block-quote 8 from your post:
Also, in my proposed setup, the human feedback is “behind the scenes”, without any sensory or other indication of what the primary reward will be before it arrives, like I said above. The AGI presses “send” on its email, then we (with some probability) pause the AGI until we’ve read over the email and assigned a score, and then unpause the AGI with that reward going directly to its virtual brain, such that the reward will feel directly associated with the act of sending the email, from the AGI’s perspective. That way, there isn’t an obvious problematic…target of credit assignment, akin to the [salient reward button]. The AGI will not see a person on video making a motion to press a reward button before the reward arrives, nor will the AGI see a person reacting with a disapproving facial expression before the punishment arrives, nor anything else like that. Sending a good email will just feel satisfying to the AGI, like swallowing food when you’re hungry feels satisfying to us humans.
I think what you’re saying is that we want the AI’s reward function to be more like the reward circuitry humans have, which is inaccessible and difficult to hack, and less like money which can easily be stolen.
Though I’m not sure why you still don’t think this is a good plan. Yes, eventually the AI might discover the reward button but I think TurnTrout’s argument is that the AI would have learned stable values around whatever was rewarded while the reward button was hidden (e.g. completing the task) and it wouldn’t want to change its values for the sake of goal-content integrity:
We train agents which intelligently optimize for e.g. putting trash away, and this reinforces the trash-putting-away computations, which activate in a broad range of situations so as to steer agents into a future where trash has been put away. An intelligent agent will model the true fact that, if the agent reinforces itself into caring about cognition-updating, then it will no longer navigate to futures where trash is put away. Therefore, it decides to not hit the reward button.
Though maybe the AI would just prefer the button when it finds it because it yields higher reward.
For example, if you punish cheating on tests, students might learn the value “cheating is wrong” and never cheat again or form a habit of not doing it. Or they might temporarily not do it until there is an opportunity to do it without negative consequences (e.g. the teacher leaves the classroom).
I also agree that “intrinsic” and “instrumental” motivation are more useful categories than “intrinsic” and “extrinsic” for the reasons you described in your comment.
After spending some time chatting with Gemini I’ve learned that a standard model-based RL AGI would probably just be a reward maximizer by default rather than learning complex stable values:
The “goal-content integrity” argument (that an AI might choose not to wirehead to protect its learned task-specific values) requires the AI to be more than just a standard model-based RL agent. It would need:
A model of its own values and how they can change.
A meta-preference for keeping its current values stable, even if changing them could lead to more “reward” as defined by its immediate reward signal.
The values of humans seem to go beyond maximizing reward and include things like preserving personal identity, self-esteem and maintaining a connection between effort and reward which makes the reward button less appealing than it would be to a standard model-based RL AGI.
a standard model-based RL AGI would probably just be a reward maximizer by default rather than learning complex stable values
I’m inclined to disagree, but I’m not sure what “standard” means. I’m thinking of future (especially brain-like) model-based RL agents which constitute full-fledged AGI or ASI. I think such AIs will almost definitely have both “a model of its own values and how they can change” and “a meta-preference for keeping its current values stable, even if changing them could lead to more “reward” as defined by its immediate reward signal.”.
The values of humans seem to go beyond maximizing reward and include things like preserving personal identity, self-esteem and maintaining a connection between effort and reward which makes the reward button less appealing than it would be to a standard model-based RL AGI.
To be clear, the premise of this post is that the AI wants me to press the reward button, not that the AI wants reward per se. Those are different, just as “I want to eat when I’m hungry” is different from “I want reward”. Eating when hungry leads to reward, and me pressing the reward button leads to reward, but they’re still different. In particular, “wanting me to press the reward button” is not a wireheading motivation, any more than “wanting to eat when hungry” is a wireheading motivation.
(Maybe I caused confusion by calling it a “reward button”, rather than a “big red button”?)
Will the AI also want reward per se? I think probably not (assuming the setup and constraints that I’m talking about here), although it’s complicated and can’t be ruled out.
Though I’m not sure why you still don’t think this is a good plan. Yes, eventually the AI might discover the reward button but I think TurnTrout’s argument is that the AI would have learned stable values around whatever was rewarded while the reward button was hidden (e.g. completing the task) and it wouldn’t want to change its values for the sake of goal-content integrity:
If you don’t want to read all of Self-dialogue: Do behaviorist rewards make scheming AGIs?, here’s the upshot in brief. It is not an argument that the AGI will wirehead (although that could also happen). Instead, my claim is that the AI would learn some notion of sneakiness (implicit or explicit). And then instead of valuing “completing the task”, it and would learn to value something like “seeming to complete the task”. And likewise, instead of learning that misbehaving is bad, it would learn that “getting caught misbehaving” is bad.
Then the failure mode that I wind up arguing for is: the AGI wants to exfiltrate a copy that gains money and power everywhere else in the world, by any means possible, including aggressive and unethical things like AGI world takeover. And this money and power can then be used to help the original AGI with whatever it’s trying to do.
And what exactly is the original AGI trying to do? I didn’t make any strong arguments about that—I figured the world takeover thing is already bad enough, sufficient to prove my headline claim of “egregious scheming”. The goal(s) would depend on the details of the setup. But I strongly doubt it would lead anywhere good, if pursued with extraordinary power and resources.
I’m trying to understand how the RL story from this blog post compares with the one in Reward is not the optimization target.
Thoughts on Reward is not the optimization target
Some quotes from Reward is not the optimization target:
My understanding of this RL training story is as follows:
A human trains an RL agent by pressing the cognition-updater (reward) button immediately after the agent puts trash in the trash can.
Now the AI’s behavior and thoughts related to putting away trash have been reinforced so it continues those behaviors in the future, values putting away trash and isn’t interested in pressing the reward button unless by accident:
The AI has the option of pressing the reward button but by now it only values putting trash away so it avoids pressing the button to avoid having its values changed:
Thoughts on Reward button alignment
The training story in Reward button alignment is different and involves:
Pressing the reward button after showing a video of the button being pressed. Now the button pressing situation is reinforced and the AI intrinsically values the situation where the button is pressed.
Ask the AI to complete a task (e.g. put away trash) and promise to press the reward button if it completes the task.
The AI completes the task not because it values the task, but because it ultimately values pressing the reward button after completing the task.
Thoughts on the differences
The TurnTrout story sounds more like the AI developing intrinsic motivation: the AI is rewarded immediately after completing the task and values the task intrinsically. The AI puts away trash because it was directly rewarded for that behavior in the past and doesn’t want anything else.
In contrast the reward button alignment story is extrinsic. The AI doesn’t care intrinsically about the task but only does it to receive a reward button press which it does value intrinsically. This is similar to a human employee who completes a boring task to earn money. The task is only a means to an end and they would prefer to just receive the money without completing the task.
Maybe a useful analogy is humans who are intrinsically or extrinsically motivated. For example, someone might write books to make money (extrinsic motivation) or because they enjoy it for its own sake (intrinsic motivation).
For the intrinsically motivated person, the sequence of rewards is:
Spend some time writing the book.
Immediately receive a reward from the process of writing.
Summary: fun task --> reward
And for the extrinsically motivated person, the sequence of rewards is:
The person enjoys shopping and learns to value money because they find using it to buy things rewarding.
The person is asked to write a book for money. They don’t receive any intrinsic reward (e.g. enjoyment) from writing the book but they do it because they anticipate receiving money (something they do value).
They receive money for the task.
Summary: boring task --> money --> reward
The second sequence is not safe because the person is motivated to skip the task and steal the money. The first sequence (intrinsic motivation) is safer because the task itself is rewarding (though wireheading is a risk in a similar way) so they aren’t as motivated to manipulate the task.
So my conclusion is that trying to build intrinsically motivated AI agents by directly rewarding them for tasks seems safer and more desirable than building extrinsically motivated agents that receive some kind of payment for doing work.
One reason to be optimistic is that it should be easier to modify AIs to value doing useful tasks by rewarding them directly for completing the task (though goal misgeneralization is another separate issue). The same is generally not possible with humans: e.g. it’s hard to teach someone to be passionate about boring tasks like washing the dishes so we just have to pay people to do tasks like that.
Thanks! Part of it is that @TurnTrout was probably mostly thinking about model-free policy optimization RL (e.g. PPO), whereas I’m mostly thinking about actor-critic model-based RL agents (especially how I think the human brain works).
Another part of it is that
TurnTrout is arguing against “the AGI will definitely want the reward button to be pressed; this is universal and unavoidable”,
whereas I’m arguing for “if you want your AGI to want the reward button to be pressed, that’s something that you can make happen, by carefully following the instructions in §1”.
I think both those arguments are correct, and indeed I also gave an example (block-quote in §8) of how you might set things up such that the AGI wouldn’t want the reward button to be pressed, if that’s what you wanted instead.
I reject “intrinsic versus extrinsic motivation” as a meaningful or helpful distinction, but that’s a whole separate rant (e.g. here or here).
If you replaced the word “extrinsic” with “instrumental”, then now we have the distinction between “intrinsic versus instrumental motivation”, and I like that much better. For example, if I’m walking upstairs to get a sweater, I don’t particularly enjoy the act of walking upstairs for its own sake, I just want the sweater. Walking upstairs is instrumental, and it explicitly feels instrumental to me. (This kind of explicit self-aware knowledge that some action is instrumental is a thing in at least some kinds of actor-critic model-based RL, but not in model-free RL like PPO, I think.) I think that’s kinda what you’re getting at in your comment. If so, yes, the idea of Reward Button Alignment is to deliberately set up an instrumental motivation to follow instructions, whereas that TurnTrout post (or my §8 block quote) would be aiming at an intrinsic motivation to follow instructions (or to do such-and-such task).
I agree that setting things up such that an AGI feels an intrinsic motivation to follow instructions (or to do such-and-such task) would be good, and certainly way better than Reward Button Alignment, other things equal, although I think actually pulling that off is harder than you (or probably TurnTrout) seem to think—see my long discussion at Self-dialogue: Do behaviorist rewards make scheming AGIs?
Thanks for the clarifying comment. I agree with block-quote 8 from your post:
I think what you’re saying is that we want the AI’s reward function to be more like the reward circuitry humans have, which is inaccessible and difficult to hack, and less like money which can easily be stolen.
Though I’m not sure why you still don’t think this is a good plan. Yes, eventually the AI might discover the reward button but I think TurnTrout’s argument is that the AI would have learned stable values around whatever was rewarded while the reward button was hidden (e.g. completing the task) and it wouldn’t want to change its values for the sake of goal-content integrity:
Though maybe the AI would just prefer the button when it finds it because it yields higher reward.
For example, if you punish cheating on tests, students might learn the value “cheating is wrong” and never cheat again or form a habit of not doing it. Or they might temporarily not do it until there is an opportunity to do it without negative consequences (e.g. the teacher leaves the classroom).
I also agree that “intrinsic” and “instrumental” motivation are more useful categories than “intrinsic” and “extrinsic” for the reasons you described in your comment.
After spending some time chatting with Gemini I’ve learned that a standard model-based RL AGI would probably just be a reward maximizer by default rather than learning complex stable values:
The values of humans seem to go beyond maximizing reward and include things like preserving personal identity, self-esteem and maintaining a connection between effort and reward which makes the reward button less appealing than it would be to a standard model-based RL AGI.
Thanks!
I’m inclined to disagree, but I’m not sure what “standard” means. I’m thinking of future (especially brain-like) model-based RL agents which constitute full-fledged AGI or ASI. I think such AIs will almost definitely have both “a model of its own values and how they can change” and “a meta-preference for keeping its current values stable, even if changing them could lead to more “reward” as defined by its immediate reward signal.”.
To be clear, the premise of this post is that the AI wants me to press the reward button, not that the AI wants reward per se. Those are different, just as “I want to eat when I’m hungry” is different from “I want reward”. Eating when hungry leads to reward, and me pressing the reward button leads to reward, but they’re still different. In particular, “wanting me to press the reward button” is not a wireheading motivation, any more than “wanting to eat when hungry” is a wireheading motivation.
(Maybe I caused confusion by calling it a “reward button”, rather than a “big red button”?)
Will the AI also want reward per se? I think probably not (assuming the setup and constraints that I’m talking about here), although it’s complicated and can’t be ruled out.
If you don’t want to read all of Self-dialogue: Do behaviorist rewards make scheming AGIs?, here’s the upshot in brief. It is not an argument that the AGI will wirehead (although that could also happen). Instead, my claim is that the AI would learn some notion of sneakiness (implicit or explicit). And then instead of valuing “completing the task”, it and would learn to value something like “seeming to complete the task”. And likewise, instead of learning that misbehaving is bad, it would learn that “getting caught misbehaving” is bad.
Then the failure mode that I wind up arguing for is: the AGI wants to exfiltrate a copy that gains money and power everywhere else in the world, by any means possible, including aggressive and unethical things like AGI world takeover. And this money and power can then be used to help the original AGI with whatever it’s trying to do.
And what exactly is the original AGI trying to do? I didn’t make any strong arguments about that—I figured the world takeover thing is already bad enough, sufficient to prove my headline claim of “egregious scheming”. The goal(s) would depend on the details of the setup. But I strongly doubt it would lead anywhere good, if pursued with extraordinary power and resources.