In other words, will the AGI actually want you to push the button? Or would it want some random weird thing because inner alignment is hard?
My answer is: yes, it would want you to push the button, at least if we’re talking about brain-like AGI, and if you set things up correctly.
Again, getting a brain-like AGI addicted to a reward button is a lot like getting a human or animal hooked on an addictive drug.
Humans addicted to drugs often exhibit weird meta-preferences like ‘I want to stop wanting the drug’, or ‘I want to find an even better kind of drug’.
For this reason, I am not at all confident that a smart thing exposed to the button would later generalise to coherent, super-smart thing that wants the button to be pressed. Maybe it perceived the circuits in it that bound to the button reward as foreign to the rest of its goals, and worked to remove them. Maybe the button binding generalised in a strange way.
‘Seek to directly inhabit the cognitive state caused by the button press’, ‘along an axis of cognitive states associated with button presses of various strength, seek to walk to a far end that does not actually correspond to any kind of button press ’, ‘make the world have a shape related to generalisations of ideas that tended to come up whenever the button was pressed’ and just generally ‘maximise a utility function made up of algorithmically simple combinations of button-related and pre-button-training-reward-related abstractions’ all seem like goals I could imagine a cognitively enhanced human button addict generalising toward. So I am not confident the AGI would generalise to wanting the button to be pushed either, not in the long term.
Humans addicted to drugs often exhibit weird meta-preferences like ‘I want to stop wanting the drug’, or ‘I want to find an even better kind of drug’.
“I want to stop wanting the drug” is downstream of the fact that people have lots of innate drives giving rise to lots of preferences, and the appeal of the drug itself is just one of these many competing preferences and drives.
However, I specified in §1 that if you’re going for Reward Button Alignment, you should zero out the other drives and preferences first. So that would fix the problem. That part is a bit analogous to clinical depression: all the things that you used to like—your favorite music, sitting at the cool kids table at school, satisfying curiosity, everything—just loses its appeal. So now if the reward button is strongly motivating, it won’t have any competition.
“I want to find an even better kind of drug” might be a goal generalization / misgeneralization thing. The analogy for the AGI would be to feel motivated by things somehow similar to the reward button being pressed. Let’s say, it wants other red buttons to be pressed. But then those buttons are pressed, and there’s no reward, and the AGI says “oh, that’s disappointing, I guess that wasn’t the real thing that I like”. Pretty quickly it will figure out that only the real reward button is the thing that matters.
Ah, but what if the AGI builds its own reward button and properly wires it up to its own reward channel? Well sure, that could happen (although it also might not). We could defend against it by cybersecurity protections, such that the AGI doesn’t have access to its own source code and reward channel. That doesn’t last to superintelligence, but we already knew that this plan doesn’t last to superintelligence, to say the least.
I am not at all confident that a smart thing exposed to the button would later generalise to coherent, super-smart thing that wants the button to be pressed.
I think you’re in a train-then-deploy mentality, whereas I’m talking about RL agents with continuous learning (e.g. how the brain works). So if the AGI has some funny idea about what is Good, generalized from idiosyncrasies of previous button presses, then it might try to make those things happen again, but it will find that the results are unsatisfying, unless of course the button was actually pressed. It’s going to eventually learn that nothing feels satisfying unless the button is actually pressed. And I really don’t think it would take very many repetitions for the AGI to figure that out.
It might want the button to be pressed in some ways but not others, it might try to self-modify (including hacking into its reward channel) as above, it might surreptitiously spin off modified copies of itself to gather resources and power around the world and ultimately help with button-pressing, all sorts of stuff could happen. But I’m assuming that these kinds of things are stopped by physical security and cybersecurity etc.
Recall that I did describe it as “a terrible plan” that should be thrown into an incinerator. We’re just arguing here about the precise nature of how and when it will fail.
Another point that deserves to be put into the conversation is that if you have designed the reward function well enough, then hitting the reward button/getting reward means you get increasing capabilities, so addiction to the reward source is even more likely than you paint.
This creates problems if there’s a large enough zone where reward functions are specifiable well enough that getting reward leading to increasing capabilities but not well enough to specify non-instrumental goals.
The prototypical picture I have of outer-alignment/goal misspecification failures looks a lot like what happens to drug addicts in humans, except unlike drug addicts IRL, getting reward makes you smarter and more capable all the time, not dumber and weaker, meaning there’s no real reason to restrain yourself from trying to do anything and everything like deceptive alignment to get the reward fix, at least assuming no inner alignment/goal misgeneralization happened in training.
Quote below:
As we have pointed out, the cognitive ability of addicts tends to decrease with progressing addiction. This provides a natural negative feedback loop that puts an upper bound on the amount of harm an addict can cause. Without this negative feedback loop, humanity would look very different16. This mechanism is, by default, not present for AI17.
Footnote 16: The link leads to a (long) fiction novel by Scott Alexander where Mexico is controlled by people constantly high on peyote, who become extremely organized and effective as a result. They are scary & dangerous.
Footnote 17: Although it is an interesting idea to scale access to compute inversely to how high the value of the accumulated reward is.
Humans addicted to drugs often exhibit weird meta-preferences like ‘I want to stop wanting the drug’, or ‘I want to find an even better kind of drug’.
For this reason, I am not at all confident that a smart thing exposed to the button would later generalise to coherent, super-smart thing that wants the button to be pressed. Maybe it perceived the circuits in it that bound to the button reward as foreign to the rest of its goals, and worked to remove them. Maybe the button binding generalised in a strange way.
‘Seek to directly inhabit the cognitive state caused by the button press’, ‘along an axis of cognitive states associated with button presses of various strength, seek to walk to a far end that does not actually correspond to any kind of button press ’, ‘make the world have a shape related to generalisations of ideas that tended to come up whenever the button was pressed’ and just generally ‘maximise a utility function made up of algorithmically simple combinations of button-related and pre-button-training-reward-related abstractions’ all seem like goals I could imagine a cognitively enhanced human button addict generalising toward. So I am not confident the AGI would generalise to wanting the button to be pushed either, not in the long term.
“I want to stop wanting the drug” is downstream of the fact that people have lots of innate drives giving rise to lots of preferences, and the appeal of the drug itself is just one of these many competing preferences and drives.
However, I specified in §1 that if you’re going for Reward Button Alignment, you should zero out the other drives and preferences first. So that would fix the problem. That part is a bit analogous to clinical depression: all the things that you used to like—your favorite music, sitting at the cool kids table at school, satisfying curiosity, everything—just loses its appeal. So now if the reward button is strongly motivating, it won’t have any competition.
“I want to find an even better kind of drug” might be a goal generalization / misgeneralization thing. The analogy for the AGI would be to feel motivated by things somehow similar to the reward button being pressed. Let’s say, it wants other red buttons to be pressed. But then those buttons are pressed, and there’s no reward, and the AGI says “oh, that’s disappointing, I guess that wasn’t the real thing that I like”. Pretty quickly it will figure out that only the real reward button is the thing that matters.
Ah, but what if the AGI builds its own reward button and properly wires it up to its own reward channel? Well sure, that could happen (although it also might not). We could defend against it by cybersecurity protections, such that the AGI doesn’t have access to its own source code and reward channel. That doesn’t last to superintelligence, but we already knew that this plan doesn’t last to superintelligence, to say the least.
I think you’re in a train-then-deploy mentality, whereas I’m talking about RL agents with continuous learning (e.g. how the brain works). So if the AGI has some funny idea about what is Good, generalized from idiosyncrasies of previous button presses, then it might try to make those things happen again, but it will find that the results are unsatisfying, unless of course the button was actually pressed. It’s going to eventually learn that nothing feels satisfying unless the button is actually pressed. And I really don’t think it would take very many repetitions for the AGI to figure that out.
It might want the button to be pressed in some ways but not others, it might try to self-modify (including hacking into its reward channel) as above, it might surreptitiously spin off modified copies of itself to gather resources and power around the world and ultimately help with button-pressing, all sorts of stuff could happen. But I’m assuming that these kinds of things are stopped by physical security and cybersecurity etc.
Recall that I did describe it as “a terrible plan” that should be thrown into an incinerator. We’re just arguing here about the precise nature of how and when it will fail.
Another point that deserves to be put into the conversation is that if you have designed the reward function well enough, then hitting the reward button/getting reward means you get increasing capabilities, so addiction to the reward source is even more likely than you paint.
This creates problems if there’s a large enough zone where reward functions are specifiable well enough that getting reward leading to increasing capabilities but not well enough to specify non-instrumental goals.
The prototypical picture I have of outer-alignment/goal misspecification failures looks a lot like what happens to drug addicts in humans, except unlike drug addicts IRL, getting reward makes you smarter and more capable all the time, not dumber and weaker, meaning there’s no real reason to restrain yourself from trying to do anything and everything like deceptive alignment to get the reward fix, at least assuming no inner alignment/goal misgeneralization happened in training.
Quote below:
Link below:
https://universalprior.substack.com/p/drug-addicts-and-deceptively-aligned