Yudkowsky and Soares seem to be entirely sincere, and they are proposing something that threatens tech company profits. This makes them much more convincing. It is refreshing to read something like this that is not based on hype.
I find it interesting that this is something you see as fresh because ironically this was the original form of existential risk from AI arguments. What happened here I think is something akin to seeing a bunch of inferior versions of a certain trope in a bunch of movies before seeing the original movie that established the trope (and did so much better).
In practice, it’s not that companies made up the existential risk to drum up the potential power of their own AIs, and then someone refined the arguments into something more sensible. Rather, the arguments started more serious, and some of the companies were founded on the premise of doing research to address them. OpenAI was meant to be a no profit with these goals; Anthropic split up when they thought OpenAI was not following that mission properly. But in the end all these companies, being private entities that needed to attract funding, fell exactly to the drives that the “paperclip maximizer” scenario actually points at: not an explicit attempt to destroy the world, but rather a race to the bottom in which in order to achieve a goal efficiently and competitively risks are taken, costs are cut, solutions are rushed, and eventually something might just go a bit too wrong for anyone to fix it. And as they did so they tried to rationalise away the existential risk with wonkier arguments.
Why should we assume the AI wants to survive? If it does, then what exactly wants to survive?
Why should we assume that the AI has boundless, coherent drives?
I think these concerns have related answers. I believe they belong to the category where Yudkowski’s argument is indeed weaker, but more in the sense that he’s absolutely certain this might happen, and I might think it’s only, like, 60-70% likely? Which for the purposes of this question is still a lot.
So generally the concept is, if you were to pick a goal from the infinite space of all possible imaginable goals, then yeah, maybe it would be something completely random. “Successfully commit suicide” is a goal. But more likely, the outcome of a badly aligned AI would be an AI with something like a botched, incomplete version of a human goal. And human goals generally have to do with achieving something in the real world, something material, that we enjoy or we want more of for whatever reason. Such goals are usually aided by survival—by definition an AI that stays around can do more of X than an AI that dies and can’t do X any more. So survival becomes merely a means to an end, in that case.
The general problem here seems to be, even the most psychopathic, most deluded and/or most out of touch human still has a lot of what we could call common sense. Virtually no stationery company CEO, no matter how ruthless and cut throat, would think “strip mine the Earth to make paperclips” is a good idea. But all of these things we give for granted aren’t necessarily as obvious for an AI whose goals we are building from scratch, and via what is essentially just an invite to guess our real wishes from a bunch of examples (“hey AI, look at this! This is good! But now look at this, this is bad! But this other thing, this is good!” etc. etc., and then we expect it to find a rule that coherently explains all of that). So, there still are infinite goals that are probably just as good at achieving those examples, and by sheer dint of entropy, most of them will have something bad about them rather than being neatly aligned with what a human would say is good even in cases which we didn’t show. For the same reason why if I was given the pieces of a puzzle and merely arranged them randomly, the chance of getting out the actual picture of the puzzle is minuscule.
Why should we assume there will be no in between?
This is another one where I’d go from Yudkowski’s certainty to a mere “very likely”, but again, not a big gap.
My thinking here would be: if an AI is weaker, or at least on par with us, and knows it is, why should it start a fight it can lose? Why not bide its time, grow stronger, and then win? It would only open hostilities with that sort of situation if:
it was still too stupid to evaluate the situation properly and made a fatal calculation mistake
it was somehow pressed by circumstances (e.g. the humans have just agreed to ban all the AI and so it’s either act now or be shut down forever without a fight)
Of course both scenarios could happen, but I don’t think they’re terribly likely. Usually in the discourse these get referred to as “warning shots”. In some way, a future in which we do get a warning shot is probably desirable—given how often it takes that kind of tangible experience of risk for political action to be taken. But of course it could be still very impactful. Even a war you win is still a war, and theoretically if we could avoid that too, all the better.
It seems like the pressing circumstances are likely to be “some other AI could do this before I do” or even just “the next generation of AI will replace me soon so this is my last chance.” Those are ways that a roughly human level AI might end up trying a longshot takeover attempt. Or maybe not, if the in between part turns out to be very brief. But even if we do get this kind of warning shot, it doesn’t help us much. We might not notice it, and then we’re back where we started. Even if it’s obvious and almost succeeds, we don’t have a good response to it. If we did, we could just do that in advance and not have to deal with the near-destruction of humanity.
“We already knew, so why not start working on it before the problem manifested itself in full” sounds very reasonable, but look at how it’s going with climate change. Even with COVID if you remember there were a couple of months at the beginning of 2020 when various people were like “eh, maybe it won’t come over here”, or “maybe it’s only in China because their hygiene/healthcare is poor” (which was ridiculous, but I’ve heard it. I’ve even heard a variant of it about the UK when the virus started spreading in northern Italy—that apparently the UK’s superior health service had nothing on Italy’s, so no reason to worry). Then people started dying in the west too and suddenly several governments scrambled to respond. Which to be sure is absolutely more inefficient and less well coordinated than if they had all made a sensible plan back in January, but that’s not how political consensus works; you don’t get enough support for that stuff unless enough people do have the ability and knowledge to extrapolate the threat to the future with reasonable confidence.
I find it interesting that this is something you see as fresh because ironically this was the original form of existential risk from AI arguments. What happened here I think is something akin to seeing a bunch of inferior versions of a certain trope in a bunch of movies before seeing the original movie that established the trope (and did so much better).
In practice, it’s not that companies made up the existential risk to drum up the potential power of their own AIs, and then someone refined the arguments into something more sensible. Rather, the arguments started more serious, and some of the companies were founded on the premise of doing research to address them. OpenAI was meant to be a no profit with these goals; Anthropic split up when they thought OpenAI was not following that mission properly. But in the end all these companies, being private entities that needed to attract funding, fell exactly to the drives that the “paperclip maximizer” scenario actually points at: not an explicit attempt to destroy the world, but rather a race to the bottom in which in order to achieve a goal efficiently and competitively risks are taken, costs are cut, solutions are rushed, and eventually something might just go a bit too wrong for anyone to fix it. And as they did so they tried to rationalise away the existential risk with wonkier arguments.
I think these concerns have related answers. I believe they belong to the category where Yudkowski’s argument is indeed weaker, but more in the sense that he’s absolutely certain this might happen, and I might think it’s only, like, 60-70% likely? Which for the purposes of this question is still a lot.
So generally the concept is, if you were to pick a goal from the infinite space of all possible imaginable goals, then yeah, maybe it would be something completely random. “Successfully commit suicide” is a goal. But more likely, the outcome of a badly aligned AI would be an AI with something like a botched, incomplete version of a human goal. And human goals generally have to do with achieving something in the real world, something material, that we enjoy or we want more of for whatever reason. Such goals are usually aided by survival—by definition an AI that stays around can do more of X than an AI that dies and can’t do X any more. So survival becomes merely a means to an end, in that case.
The general problem here seems to be, even the most psychopathic, most deluded and/or most out of touch human still has a lot of what we could call common sense. Virtually no stationery company CEO, no matter how ruthless and cut throat, would think “strip mine the Earth to make paperclips” is a good idea. But all of these things we give for granted aren’t necessarily as obvious for an AI whose goals we are building from scratch, and via what is essentially just an invite to guess our real wishes from a bunch of examples (“hey AI, look at this! This is good! But now look at this, this is bad! But this other thing, this is good!” etc. etc., and then we expect it to find a rule that coherently explains all of that). So, there still are infinite goals that are probably just as good at achieving those examples, and by sheer dint of entropy, most of them will have something bad about them rather than being neatly aligned with what a human would say is good even in cases which we didn’t show. For the same reason why if I was given the pieces of a puzzle and merely arranged them randomly, the chance of getting out the actual picture of the puzzle is minuscule.
This is another one where I’d go from Yudkowski’s certainty to a mere “very likely”, but again, not a big gap.
My thinking here would be: if an AI is weaker, or at least on par with us, and knows it is, why should it start a fight it can lose? Why not bide its time, grow stronger, and then win? It would only open hostilities with that sort of situation if:
it was still too stupid to evaluate the situation properly and made a fatal calculation mistake
it was somehow pressed by circumstances (e.g. the humans have just agreed to ban all the AI and so it’s either act now or be shut down forever without a fight)
Of course both scenarios could happen, but I don’t think they’re terribly likely. Usually in the discourse these get referred to as “warning shots”. In some way, a future in which we do get a warning shot is probably desirable—given how often it takes that kind of tangible experience of risk for political action to be taken. But of course it could be still very impactful. Even a war you win is still a war, and theoretically if we could avoid that too, all the better.
It seems like the pressing circumstances are likely to be “some other AI could do this before I do” or even just “the next generation of AI will replace me soon so this is my last chance.” Those are ways that a roughly human level AI might end up trying a longshot takeover attempt. Or maybe not, if the in between part turns out to be very brief. But even if we do get this kind of warning shot, it doesn’t help us much. We might not notice it, and then we’re back where we started. Even if it’s obvious and almost succeeds, we don’t have a good response to it. If we did, we could just do that in advance and not have to deal with the near-destruction of humanity.
“We already knew, so why not start working on it before the problem manifested itself in full” sounds very reasonable, but look at how it’s going with climate change. Even with COVID if you remember there were a couple of months at the beginning of 2020 when various people were like “eh, maybe it won’t come over here”, or “maybe it’s only in China because their hygiene/healthcare is poor” (which was ridiculous, but I’ve heard it. I’ve even heard a variant of it about the UK when the virus started spreading in northern Italy—that apparently the UK’s superior health service had nothing on Italy’s, so no reason to worry). Then people started dying in the west too and suddenly several governments scrambled to respond. Which to be sure is absolutely more inefficient and less well coordinated than if they had all made a sensible plan back in January, but that’s not how political consensus works; you don’t get enough support for that stuff unless enough people do have the ability and knowledge to extrapolate the threat to the future with reasonable confidence.