I agree that AGI is possible to make, that it eventually will become orders-of-magnitude smarter than humans, and that it poses a global risk if the alignment problem is not solved. I also agree that the alignment problem is very hard, and is unlikely to be solved before the first AGI. And I think it’s very likely that the first recursively-self-improving AGI will emerge before 2030.
But I don’t get the confidence about the unaligned AGI killing off humanity. The probability may be 90%, but it’s not 99.9999% as many seem to imply, including Eliezer.
Sure, humans are made of useful atoms. But that doesn’t mean the AGI will harvest humans for useful atoms. I don’t harvest ants for atoms. There are better sources.
Sure, the AGI may decide to immediately kill off humans, to eliminate them as a threat. But there is a very short time period (perhaps in miliseconds) where humans can switch off a recursively-self-improving AGI of superhuman intelligence. After this critical period, humanity will be as much a threat to the AGI as a caged mentally-disabled sloth baby is a threat to the US military. The US military is not waging wars against mentally disabled sloth babies. It has more important things to do.
All such scenarios I’ve encountered so far imply AGI’s stupidity and/or the “fear of sloths”, and thus are not compatible with the premise of a rapidly self-improving AGI of superhuman intelligence. Such an AGI is dangerous, but is it really “we’re definitely going to die” dangerous?
Our addicted-to-fiction brains love clever and dramatic science fiction scenarios. But we should not rely on them in deep thinking, as they will nudge us towards overestimating the probabilities of the most dramatic outcomes.
Overestimating a global risk is almost as bad as underestimating it. Compare: if you’re 99.99999% sure that a nuclear war will kill you, then the despondency will greatly reduce your chances of surviving the war, because you’ll fail to make the necessary preparations, like acquiring a bunker etc, which could realistically save your life under many circumstances.[1]
The topic of surviving the birth of the AGI is severely under-explored, and the “we’re definitely going to die” mentality seems to be the main cause. A related under-explored topic is preventing the unaligned AGI from becoming misantropic, which should be our second line of defense (the first one is alignment research).
- ^
BTW, despondency is deadly by itself. If you’ve lost all hope, there is a high risk that you’ll not live long enough to see the AGI, be it aligned or not.
Humans are at risk from an unaligned AI not because of our atoms being harvested but because the resources we need to live will be harvested or poisoned, as briefly described by Critch.
Be wary of taking the poetic analogy of “like humans to ants” too far and think that it is a literal statement about ants. The household ant is the exception not the rule.
Our relationship to an unaligned AI will be like the relationship between humans and the unnamed species of butterfly that went extinct while you were reading this.
I think the example of humans militaries to ants is a bit flawed, for two main reasons.
1. Ants don’t build AGI—Humans don’t care about ants because they’re so uncoordinated in comparison, and can’t pose much of a threat. Humans can pose a significant threat to an ASI—building another ASI.
2. Ants don’t collect gold—Humans, unlike ants, control a lot of important resources. If every ant nest was built on a pile of gold, you can best believe humans would actively look for and kill ants. Not because we hate ants, but because we want their gold. An unaligned ASI will want our chip factories, our supply chains, bandwidth, etc. All of which we would be much better off keeping.
Adding to the line of reasoning.
Humans are not living on anthills—AGI will be from earth. Imagine you spontaneously come into being on an an anthill with no immediate way of leaving. Suddenly making sure the ants don’t eat your food becomes much more interesting.
Early rationalist writing on the threats of unaligned AGI emerged out of thinking on GOFAI systems that were supposed to operate on rationalist or logical thought processes. Everything is explicitly coded and transparent. Based on this framework, if an AI system operates on pure logic, then you’d better ensure that you specify a goal that doesn’t leave any loopholes in it. In other words, AI would follow the laws exactly as they were written, not as they were intended by fuzzy human minds or the spirit animating them. Since the early alignment researchers could figure out how to logically specify human values and goals that would parse for a symbolic AI without leaving loopholes large enough to threaten humanity, they grew pessimistic about the whole prospect of alignment. This pessimism has infected the field and remains with us today, even with the rise of deep learning with deep neural networks.
I will leave you with a quote from Eliezer Yudkowsky which I believe encapsulates this old view of how AI, and alignment, were supposed to work.
A different argument why instrumtal converge will not kill us all. Isn’t it possible that we will have a disruptive, virus-like AI before AGI?
I agree with the commonly held view that AGI (i.e. recursively improving & embodied) will take actions that can be considered very harmful to humanity.
But isn’t it much more likely that we will first have an AI that is only embodied without yet being able to improve itself? As such it might copy itself. It might hold bank accounts hostage until people sign up for some search engine. It might spy on people through webcams. But it won’t go supernova because making a better model than the budget of Google or Microsoft can produce is hard.
And if that happens we will notice. And when we notice maybe there will be action to prevent a more catastrophic scenario.
Would love to hear some thoughts on this.
Uhm, I don’t think anybody (even Eliezer) implies 99.9999%. Maybe some people imply 99% but it’s 4 orders of magnitude difference (and 100 times more than the difference between 90% and 99%).
I don’t think there are many people who think 95%+ chance, even among those who are considered to be doomerish.
And I think most LW people are significantly lower despite being rightfully [very] concerned. For example, this Metaculus question (which is of course not LW but the audience intersects quite a bit) is only 13% mean (and 2% median)
The OP here. The post was inspired by this interview by Eliezer:
My impression after watching the interview:
Eliezer thinks that the unaligned AGI, if created, will almost certainly kill us all.
Judging by the despondency he expresses in the interview, he feels that the unaligned AGI is about as deadly as a direct shot right in the head from a large-caliber gun. So, at least 99%.
But I can’t read his mind, so maybe my interpretation is incorrect.
If you switch “community weighting” to “uniform” you see that historically almost everyone has answered 1%.
Humans are dangerous while they control infrastructure and can create more AGIs. Resolving this issue specifically by taking the toys away, perhaps even uploading the civilization? Seems unnecessarily complicated, unless that’s an objective.
Then there’re the consequences of disassembling Earth (because it’s right here), starting immediately. Unless leaving humans alive is an objective, that’s not the outcome.
I think this is too all-or-nothing about the objectives of the AI system. Following ideas like shard theory, objectives are likely to come in degrees, be numerous and contextually activated, having been messily created by gradient descent.
Because “humans” are probably everywhere in its training data, and because of naiive safety efforts like RLHF, I expect AGI to have a lot of complicated pseudo-objectives / shards relating to humans. These objectives may not be good—and if they are they probably won’t constitute alignment, but I wouldn’t be surprised if it were enough to make it do something more complicated than simply eliminating us for instrumental reasons.
Of course the AI might undergo a reflection process leading to a coherent utility function when it self-improves, but I expect it to be a fairly complicated one, assigning some sort of valence to humans. We might also have some time before it does that, or be able to guide this values-handshake between shards collaboratively.
In the framing of the grandparent comment, that’s an argument that saving humanity will be an objective for plausible AGIs. The purpose of those disclaimers was to discuss the hypothetical where that’s not the case. The post doesn’t appeal to AGI’s motivations, which makes this hypothetical salient.
For LLM simulacra, I think partial alignment by default is likely. But even more generally, misalignment concerns might prevent AGIs with complicated implicit goals from self-improving too quickly (unless they fail alignment and create more powerful AGIs misaligned with them, which also seems likely for LLMs). This difficulty might make them vulnerable to being overtaken by an AGI that has absurdly simple explicitly specified goals, so that keeping itself aligned (with itself) through self-improvement would be much easier for it, and it could undergo recursive self-improvement much more quickly. Those more easily self-improving AGIs probably don’t have humanity in their goals.
I agree. But that’s true only for a very short time. I think it is certain that the rapidly self-improving AGI of superhuman intelligence will find a way to liberate itself from the human control within seconds at most. And long before humans start to consider switching off the entire Internet, the AGI will become free from the human infrastructure.
The AGI competition is a more serious threat. No idea what is the optimal solution here, but it may or may not involve killing humans (but not necessarily all humans).
I agree, that’s a serious risk. But I’m not sure about the extend of the disassembly. Depending on the AGI’s goals and the growth strategy, it could be anything from “build a rocket to reach Jupiter” to “convert the entire Earth into computronium to reseach FTL travel”.
I think the misconception here is that the AGI has to conceive of humans as an existential threat for it to wipe them out. But why should that be the case? We wipe out lots of species which we don’t consider threats at all, merely by clearcutting forests and converting them to farmland. Or damming rivers for agriculture and hydropower. Or by altering the environment in a myriad number of other ways which make the environment more convenient for us, but less convenient for the other species.
Why do you think an unaligned AGI will leave Earth’s biosphere alone? What if we’re more akin to monarch butterflies than ants?
EDIT: (to address your sloth example specifically)
Sure, humanity isn’t waging some kind of systematic campaign against pygmy three-toed sloths. At least, not from our perspective. But take the perspective of a pygmy three-toed sloth. From the sloth’s perspective, the near-total destruction of its habitat sure looks like a systematic campaign of destruction. Does the sloth really care that we didn’t intend to drive it to extinction while clearing forests for housing, farms and industry?
Similarly, does it really matter that much if the AI is being intentional about destroying humanity?
I think there’s a key error in the logic you present. The idea that a self-improving AGI will very quickly become vastly superior to humanity is based on the original assumption that AGI will consist of a relatively compact algorithm that is mostly software-limited. The newer assumption is vastly slower takeoffs, perhaps years long, but almost certainly much larger than seconds, as hardware-limited neural network AGI finds larger servers or designs and somehow builds more efficient hardware. This scenario puts an AGI in vastly more danger from humanity than your fast takeoff scenario.
Edit: this is not to argue that the correct estimate is as high as 99.999; I’m just making this contribution without doing all the logic and math on my best estimate.
“But I don’t get the confidence about the unaligned AGI killing off humanity. The probability may be 90%, but it’s not 99.9999% as many seem to imply, including Eliezer.”
I think that 90% is also wildly high, and many other people around think so too. But most of them (with perfectly valid criticisms) do not engage in discussions in LW (with some honourable exceptions, e.g. Robin Hanson a few days ago, but how much attention did it draw?)
I don’t have any definite estimate, just that it’s Too Damn High for the path we are currently on. I don’t think anyone has a good argument for it being lower then 5%, or even 50%, but I wouldn’t be surprised if we survived and in hindsight those were justifiable numbers.
I also don’t think there is any good argument for it being greater than 90%, but this is irrelevant since if you’re making a bet on behalf of humanity with total extinction on one side at anything like those probabilities, you’re a dangerous lunatic who should be locked up.
I would say that AGI is by far the greatest near-term existential risk we face, and that the probability of extinction from AGI seems likely to be greater than the probability of ‘merely’ major civilization damage from many things that are receiving a lot more effort to mitigate.
So to me, our civilization’s priorities look very screwed up—which is to be expected from the first (and therefore likely stupidest) animal that is capable of creating AI.
“I don’t think anyone has a good argument for it being lower then 5%, or even 50%,”
That’s false. There are many, many good arguments. In fact, I would say that it is not only that, it is also that many of the arguments pro-doom are very bad. The only problem is that the conversation in LW on this topic is badly biased towards one camp and that’s creating a distorted image on this website. People arguing against doom tend to be downvoted way more easily than people pro-doom. I am not saying that I don’t think it is a relevant problem , something people shouldn’t work on, etc.