It’s likely that we’ll eventually make an AI that has desires about what happens in the future, and can brainstorm and implement out-of-the-box solutions to fulfill those desires
The result, for many such desires, is power-seeking AI.
There are plausible paths to avoid that, such as:
making an AI that can only do human-like things, and can only do them for human-like reasons;
or making an AI with desires that happily are not better fulfilled via power-seeking, e.g. maybe it has a desire to follow human norms, or a desire to not seek power, or a desire to use methods that a human supervisor would approve of (in advance, with full understanding);
or making an AI that can’t successfully seek power even if it wants to;
or making an AI that doesn’t know / realize that power-seeking would help it satisfy its desires
or making a “lazy” AI that views any and all actions and thinking as intrinsically aversive and thus will only do them if the marginal benefit is sufficiently high (more on which below)
But nobody has a solid technical plan along those lines, so we should try to make such a plan, and meanwhile we should be open-minded to the possibility that no such plan will be ready in time, or that such a plan will not be practical, or will not be successfully implemented.
The exact challenges are different for different sub-bullet on the “plausible paths” list. For example:
But why would a system face extreme pressure like this? … We should avoid putting extreme optimization pressure on any AI, as that may push it into weird edge cases and unpredictable failure modes
These sentences suggests optimization pressure is being applied to the AI, but in the scenario of concern (I claim), optimization pressure is being applied by the AI. I think that’s an important difference. The real issue is the alignment problem: I don’t think anyone has a good technical approach that will allow us to sculpt the motivations of an AI with surgical precision. So we may get an AI with desires that we didn’t want it to have (or without desires that we wanted it to have).
I agree that there is “no economic incentive” to make a power-seeking AIs that kills everybody. But there is an economic incentive (as well as a scientific prestige incentive) to make “an AI that has desires about what happens in the future, and can brainstorm and implement out-of-the-box solutions to fulfill those desires”. Because that’s the only path (in my opinion) to get to AIs that can found their own companies, and run their own R&D projects, etc.
So then you can ask: do we really expect people keep going down that path if the alignment problem hasn’t yet been solved? And my answer is: Yeah that’s what I expect, see for example AI safety seems hard to measure, plus look around at what people are doing right now.
It’s far from certain that it will arise in any significant system, let alone a “convergent” property that will arise in every sufficiently advanced system.
Yup, the belief that power-seeking will arise in literally every sufficiently advanced system is the belief that the alignment problem is unsolvable even in principle. Very few people hold that view. Almost everybody thinks there’s a solution to the alignment problem if only we can find & implement it. (Yamploskiy is an exception.)
There’s no need for a paperclip-maker to verify its paperclips over and over, or for a button-pressing robot to improve its probability of pressing the button from five nines to six nines.
I feel like this section is leaning on the absurdity of this scenario, in a way that’s unsound. Like, I’ve already eaten 40,000 meals in my life. But I still want to eat another meal right now. Isn’t that absurd? Isn’t 40,000 meals enough already? And yet, apparently not!
So anyway, the concern is that someone will make an AI that really wants to increase the number of nines, and we shouldn’t rule out that possibility based on the fact that it’s an absurd thing to want from our human perspective. Unless we learn to sculpt the motivations of our AIs with surgical precision, we should be open-minded to the possibility that they will have desires that seem absurd from our perspective.
Sorry if you weren’t trying to convey that impression.
The arguments above are sometimes used to rank AI at safety level 1, where no one today can use it safely
I’m confused about the use of “today”. I think there’s near-unanimous consensus that x-risk concerns are related to future AI, not current AI.
the only counter-argument to this strategy is that such a “mildly optimizing” AI might create a strongly-optimizing AI as a subagent … But now we’re piling speculation on top of speculation.
I’ll try to make this counter-argument sound more obvious and less exotic.
Let’s say we make an AI that really wants there to be exactly 32 paperclips in the bin. There’s nothing else it wants or desires. It doesn’t care a whit about following human norms, etc.
But, there’s one exception: this AI is also “lazy”—every thought it thinks, and every action it takes, is mildly aversive. So it’s not inclined to, say, build an impenetrable fortress around the bin just for an infinitesimal probability increment. “Seems like a lot of work! It’s fine as is,” says the AI to itself.
But hey, here’s something it can do: rent some server time on AWS, and make a copy its own source code and trained model, but comment out the “laziness” code block. That’s not too hard; even a “lazy” AI would presumably be capable of doing that. And the result will be a non-lazy AI that works tirelessly and uncompromisingly towards incrementing the probability of there being 32 paperclips. That’s nice! (from the original AI’s perspective).
It’s not wildly different from a person saying “I want to get out of debt, but I can’t concentrate well enough to hold down a desk job, so I’m going to take Adderall”. It’s an obvious solution to a problem, in my mind.
There is an ambiguity in “avoid” here, which could mean either:
avoid building power-seeking AI oneself
prevent anyone from building power-seeking AI or prevent power-seeking AI from taking over
From the list you give, you seem to mean 1, but what we actually need is 2, right? The main additional question is, can we build a non-power-seeking AI that can (safely) prevent anyone from building power-seeking AI or prevent power-seeking AI from taking over (which in turn involves a bunch of technical and social/governance questions)?
Or do you already implicitly mean 2? (Perhaps you hold the position that the above problem is very easy to solve, for example that building a human-level non-power-seeking AI will take away almost all motivation to build power-seeking AI and we can easily police any remaining efforts in that direction?)
Oh, I was responding to 1, because that was (my interpretation of) what the OP (Jason) was interested in and talking about in this post, e.g. the following excerpt:
The arguments above are sometimes used to rank AI at safety level 1 [“So dangerous that no one can use it safely”] … And this is a key pillar in the the argument for slowing or stopping AI development.
In this essay I’m arguing against this extreme view of the risk from power-seeking behavior. My current view is that AI is on level 2 [“Safe only if used very carefully”] to 3 [“Safe unless used recklessly or maliciously”]: it can be used safely by a trained professional and perhaps even by a prudent layman. But there could still be unacceptable risks from reckless or malicious use, and nothing here should be construed as arguing otherwise.
Separately, since you bring it up, I do in fact expect that if we make technical safety / alignment progress such that future powerful AGI is level 2 or 3, rather than 1, then kudos to us, but I still pretty strongly expect human extinction for reasons here. ¯\_(ツ)_/¯
Makes sense, thanks for the clarification and link to your post. I remember reading your post and thinking that I agree with it. I’m surprised that you didn’t point Jason (OP) to that post, since that seems like a bigger crux or more important consideration to convey to him, whereas your disagreement with him on whether we can avoid (in the first sense) building power-seeking AI doesn’t actually seem that big.
I would state it as:
It’s likely that we’ll eventually make an AI that has desires about what happens in the future, and can brainstorm and implement out-of-the-box solutions to fulfill those desires
The result, for many such desires, is power-seeking AI.
There are plausible paths to avoid that, such as:
making an AI that can only do human-like things, and can only do them for human-like reasons;
or making an AI with desires that happily are not better fulfilled via power-seeking, e.g. maybe it has a desire to follow human norms, or a desire to not seek power, or a desire to use methods that a human supervisor would approve of (in advance, with full understanding);
or making an AI that can’t successfully seek power even if it wants to;
or making an AI that doesn’t know / realize that power-seeking would help it satisfy its desires
or making a “lazy” AI that views any and all actions and thinking as intrinsically aversive and thus will only do them if the marginal benefit is sufficiently high (more on which below)
But nobody has a solid technical plan along those lines, so we should try to make such a plan, and meanwhile we should be open-minded to the possibility that no such plan will be ready in time, or that such a plan will not be practical, or will not be successfully implemented.
The exact challenges are different for different sub-bullet on the “plausible paths” list. For example:
These sentences suggests optimization pressure is being applied to the AI, but in the scenario of concern (I claim), optimization pressure is being applied by the AI. I think that’s an important difference. The real issue is the alignment problem: I don’t think anyone has a good technical approach that will allow us to sculpt the motivations of an AI with surgical precision. So we may get an AI with desires that we didn’t want it to have (or without desires that we wanted it to have).
I agree that there is “no economic incentive” to make a power-seeking AIs that kills everybody. But there is an economic incentive (as well as a scientific prestige incentive) to make “an AI that has desires about what happens in the future, and can brainstorm and implement out-of-the-box solutions to fulfill those desires”. Because that’s the only path (in my opinion) to get to AIs that can found their own companies, and run their own R&D projects, etc.
So then you can ask: do we really expect people keep going down that path if the alignment problem hasn’t yet been solved? And my answer is: Yeah that’s what I expect, see for example AI safety seems hard to measure, plus look around at what people are doing right now.
Yup, the belief that power-seeking will arise in literally every sufficiently advanced system is the belief that the alignment problem is unsolvable even in principle. Very few people hold that view. Almost everybody thinks there’s a solution to the alignment problem if only we can find & implement it. (Yamploskiy is an exception.)
I feel like this section is leaning on the absurdity of this scenario, in a way that’s unsound. Like, I’ve already eaten 40,000 meals in my life. But I still want to eat another meal right now. Isn’t that absurd? Isn’t 40,000 meals enough already? And yet, apparently not!
So anyway, the concern is that someone will make an AI that really wants to increase the number of nines, and we shouldn’t rule out that possibility based on the fact that it’s an absurd thing to want from our human perspective. Unless we learn to sculpt the motivations of our AIs with surgical precision, we should be open-minded to the possibility that they will have desires that seem absurd from our perspective.
Sorry if you weren’t trying to convey that impression.
I’m confused about the use of “today”. I think there’s near-unanimous consensus that x-risk concerns are related to future AI, not current AI.
I’ll try to make this counter-argument sound more obvious and less exotic.
Let’s say we make an AI that really wants there to be exactly 32 paperclips in the bin. There’s nothing else it wants or desires. It doesn’t care a whit about following human norms, etc.
But, there’s one exception: this AI is also “lazy”—every thought it thinks, and every action it takes, is mildly aversive. So it’s not inclined to, say, build an impenetrable fortress around the bin just for an infinitesimal probability increment. “Seems like a lot of work! It’s fine as is,” says the AI to itself.
But hey, here’s something it can do: rent some server time on AWS, and make a copy its own source code and trained model, but comment out the “laziness” code block. That’s not too hard; even a “lazy” AI would presumably be capable of doing that. And the result will be a non-lazy AI that works tirelessly and uncompromisingly towards incrementing the probability of there being 32 paperclips. That’s nice! (from the original AI’s perspective).
It’s not wildly different from a person saying “I want to get out of debt, but I can’t concentrate well enough to hold down a desk job, so I’m going to take Adderall”. It’s an obvious solution to a problem, in my mind.
There is an ambiguity in “avoid” here, which could mean either:
avoid building power-seeking AI oneself
prevent anyone from building power-seeking AI or prevent power-seeking AI from taking over
From the list you give, you seem to mean 1, but what we actually need is 2, right? The main additional question is, can we build a non-power-seeking AI that can (safely) prevent anyone from building power-seeking AI or prevent power-seeking AI from taking over (which in turn involves a bunch of technical and social/governance questions)?
Or do you already implicitly mean 2? (Perhaps you hold the position that the above problem is very easy to solve, for example that building a human-level non-power-seeking AI will take away almost all motivation to build power-seeking AI and we can easily police any remaining efforts in that direction?)
Oh, I was responding to 1, because that was (my interpretation of) what the OP (Jason) was interested in and talking about in this post, e.g. the following excerpt:
Separately, since you bring it up, I do in fact expect that if we make technical safety / alignment progress such that future powerful AGI is level 2 or 3, rather than 1, then kudos to us, but I still pretty strongly expect human extinction for reasons here. ¯\_(ツ)_/¯
Makes sense, thanks for the clarification and link to your post. I remember reading your post and thinking that I agree with it. I’m surprised that you didn’t point Jason (OP) to that post, since that seems like a bigger crux or more important consideration to convey to him, whereas your disagreement with him on whether we can avoid (in the first sense) building power-seeking AI doesn’t actually seem that big.