I suspect a lot of the negative reactions to Yudkowsky’s article isn’t about norms, exactly, but rather a disagreement of how far we should be willing to go to slow down AI.
Yudkowsky is on the extreme end of the spectrum, which views airstrikes leading to global nuclear warfare as okay if AI is slowed down.
Suffice it to say, if you don’t believe that doom is certain, then you will have massive issues with going this far for AI safety.
Despite disagreements with you and TurnTrout’s models of how optimistic alignment is, I also agree that I have many issues with Eliezer’s position on AI safety.
My problem isn’t people disagreeing, it’s none of the people disagreeing actually pointing out what they think is the specific flaw in EY’s worries, and what are we doing to avoid them materialising. When so many of the people who are confident there’s no danger don’t seem to understand key points of the argument I just get early COVID vibes all over again.
I’ll explain the issues I have with Eliezer Yudkowsky’s position in a nutshell:
Alignment is almost certainly easier than Yudkowsky or the extremely pessimistic people think. In particular, alignment is progressing way more than the extremely pessimistic models predicted.
I don’t think that slowing AI down instead of accelerating alignment is the best choice, primarily because I think we should mostly try to improve our chances on the current path than overturn our current path.
Given that I am an optimist on AI safety, I don’t really agree with Eliezer’s suggestions on how AI should be dealt with.
No. 1 would convince me more if we were seeing good alignment in the existing, still subhuman models. I honestly think there are multiple problems with alignment as a concept, but I also expect that there would be a significant difficulty jump when dealing with superhuman AI (for example, RLHF becomes entirely useless).
No. 2 I don’t quite understand—we care about the relative speed of the two, wouldn’t anything that says “let’s move people from capabilities to alignment research” like the moratorium asks do exactly what you say? You can’t arbitrarily speed up one field without affecting the other, really, human resources are limited, and you won’t get much mileage out of suddenly funding thousands of CS graduates to work on alignment while the veterans all advance capabilities. There’s trade offs at work. You need to actively rebalance your resources to avoid alignment just playing catch up. It’s kind of an essential part of the work anyway; imagine NASA having hundreds of engineers developing gigantic rocket engines and a team of like four guys working on control systems. You can’t go to the Moon with raw power alone.
No. 3 depends heavily on what we think the consequences of misaligned AGI are. How dangerous Eliezer’s proposal is also depends on how much do countries want to develop AGI. Consider e.g. biological weapons, where it’s probably fairly easy to get a consensus on “let’s just not do that” once everyone has realised how expensive, complex and ultimately still likely to blow up in your face they are. Vice versa, if alignment is easy, there’s probably no reason why anyone would put such an agreement in place; but there needs to be some evidence of alignment being easy, and we need it soon. We can’t wait for the point where if we’re wrong the AI destroys the world to find out. That’s not called a plan, that’s just called being lucky, if it goes well.
No. 1 would convince me more if we were seeing good alignment in the existing, still subhuman models. I honestly think there are multiple problems with alignment as a concept, but I also expect that there would be a significant difficulty jump when dealing with superhuman AI (for example, RLHF becomes entirely useless).
The good news is we have better techniques than RLHF, which as you note is not particularly useful as an alignment technique.
On alignment not making sense as a concept, I agree somewhat. In the case of an AI and a human, I think that alignment is sensible, but as you scale up, it increasingly devolves into nonsense until you just force your own values.
No. 2 I don’t quite understand—we care about the relative speed of the two, wouldn’t anything that says “let’s move people from capabilities to alignment research” like the moratorium asks do exactly what you say?
Not exactly, though what I’m envisioning is that you can use a finetuned AI to do alignment research, and while there are capabilities externalities, they may be necessary depending on how much feedback we need in order to solve the alignment problem.
Also, I think part of the disagreement is we are coming from different starting points on how much progress we did on alignment.
No. 3 depends heavily on what we think the consequences of misaligned AGI are.
This is important, but there are other considerations.
For example, the most important thing to think about with Eliezer’s plan is what ethics/morals do you use.
Next, you need to consider the consequences of both aligned and misaligned AGIs. And I suspect they net out to much smaller consequences for AGI once you sum up the positives and negatives, assuming a consequentialist ethical system.
The problem is Eliezer’s treaty would basically imply GPU production is enough to start a war. This is a much more severe consequence than almost any treaty ever done, and this has very negative impacts under a consequentialist ethical system.
evidence of alignment being easy, and we need it soon. We can’t wait for the point where if we’re wrong the AI destroys the world to find out. That’s not called a plan, that’s just called being lucky, if it goes well.
I think this is another point of disagreement. While I wouldn’t like to test the success without dignity hypothesis, also known as luck, I do think there’s a non-trivial probability of that happening, compared to other alignment people who think the chance is effectively epsilon.
Under international law, counterfeiting another nation’s currency is considered an act of war and you can “legally” go to war to stop it… if you can bomb a printing press, is it ridiculous to say you can’t have a treaty that says you can bomb a GPU foundry?
(The two most recent cases of a government actually counterfeiting another nation’s currency were Nazi Germany during World War II which made counterfeit British pounds as part of its military strategy, and the “supernote” US dollar produced by North Korea.)
And in the end no one bombed North Korea, because saying something is an act of war doesn’t imply automatic war anyway, it’s subtler than that. Honestly in the hypothetical “no GPUs” world you’d probably have all the major States agreeing it’s a danger to them and begrudgingly cooperating on those lines, and the occasional pathetic attempt by some rogue actor with nothing to lose might be nipped in the bud via sanctions or threats. The big question really is how detectable such attempts would be compared to developing e.g. bacteriological weapons. But if tomorrow we found out that North Korea is developing Super Smallpox and plans to release it, what would we do? We are already in a similar world, we just don’t think much about it because we’ve gotten used to this being the precarious equilibrium we exist in.
Next, you need to consider the consequences of both aligned and misaligned AGIs. And I suspect they net out to much smaller consequences for AGI once you sum up the positives and negatives, assuming a consequentialist ethical system.
I find this sort of argument kinda nonsensical. Like, yes, it’s useful to conceptualise goods and harms as positives and negatives you balance, but in practice you can’t literally put numbers on them and run the sums, especially not with so many uncertainties at stake. It’s always possible to fudge the numbers and decide that some values are unimportant and some are super important and lo and behold, the calculation turns in your favour! In the end it’s no better than deontology or simply saying “I think this is good”; there is no point trying to vest it with a semblance of objectivity that just isn’t there. I am a consequentialist and I think that overall AGI is on the net probably bad for humanity, and I include also some possible outcomes from aligned AGI in there.
I do think there’s a non-trivial probability of that happening, compared to other alignment people who think the chance is effectively epsilon.
I don’t think it’s that improbable either, I just think it’s irresponsible either way when so much is at stake. I think the biggest possible points of failure of the doom argument are:
we just aren’t able to build AGI any soon (but in that case the whole affair turns out to be much ado about nothing), or
we are able to build AGI, but then AGI can’t really push past to ASI. This might be purely chance, or the result of us using approaches that merely “copy” human intelligence but aren’t able to transcend it (for example, if becoming superintelligent would require being trained on text written by superintelligent entities)
So, sure, we may luck out, thought that leaves us “only” with already plenty disruptive human-level AGI. Regardless, this makes the world potentially a much more unstable powder keg. Even without going specifically down the road EY mentions, I think nuclear and MAD analogies do apply because the power in play is just that great (in fact am writing a post on this, will go up tomorrow if I can finish it).
It’s always possible to fudge the numbers and decide that some values are unimportant and some are super important and lo and behold, the calculation turns in your favour! In the end it’s no better than deontology or simply saying “I think this is good”; there is no point trying to vest it with a semblance of objectivity that just isn’t there.
Is this not simply the fallacy of gray?
As saying goes, it’s easy to lie with statistics, but even easier to lie without them. Certainly you can fudge the numbers to make the result say anything, but if you show your work then the fudging gets more obvious.
I agree that laying out your thinking at least forces you to specifically elucidate your values. That way people can criticise the precise assumptions they disagree with, and you can’t easily back out of them. I don’t think the “lying with statistics” saying applies in its original meaning because really this is entirely about subjective terminal values. “Because I like it this way” is essentially what it boils down to no matter how you slice it.
In the end it’s no better than deontology or simply saying “I think this is good”; there is no point trying to vest it with a semblance of objectivity that just isn’t there.
You’re right that it isn’t an objective calculation, and apparently it requires more subjective assumptions, so I’ll agree that we really shouldn’t be treating this as though it’s an objective calculation.
I don’t think it’s that improbable either, I just think it’s irresponsible either way when so much is at stake.
I agree that testing that hypothesis is dangerously irresponsible, given the stakes involved. That’s why I still support alignment work.
I think the biggest things if success without dignity happens, I think it will be due to some of the following factors:
Alignment turns out to be really easy by default, that is something like the naive ideas like RLHF just work, or it turns out that value learning is almost trivial.
Corrigibility is really easy or trivial to do, such that alignment isn’t relevant, because humans can redirect it’s goals easily. In particular, it’s easy to get AIs to respect a shutdown order.
We can’t make AGI, or it’s too hard to progress AGI to ASI.
These are the major factors I view as likely in a success without dignity case, or we survive AGI/ASI via luck.
I find 1 unlikely, 2 almost impossible (or rather, it would imply partial alignment, in which at least you managed to impress Asimov’s Second Law of Robotics into your AGI above all else), and 3 the most likely, but also unstable (what if your 10^8 instances of AGI engineers suddenly achieve a breakthrough after 20 years of work?). So this doesn’t seem particularly satisfying to me.
Responding to your #1, do you think we’re on track to handle the cluster of AGI Ruin scenarios pointed at in 16-19? I feel we are not making any progress here other than towards verifying some properties in 17.
16: outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction. 17: on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they’re there, rather than just observable outer ones you can run a loss function over. 18: There’s no reliable Cartesian-sensory ground truth (reliable loss-function-calculator) about whether an output is ’aligned′ 19: there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment
This is not something he said, and not something he thinks. If you read what he wrote carefully, through a pedantic decoupling lens, or alternatively with the context of some of his previous writing about deterrence, this should be pretty clear. He says that AI is bad enough to put a red line on; nuclear states put red lines of lots of things, most of which are nowhere near as bad as nuclear war is.
“[Y]ou’ve gestured at nuclear risk. … How many people are allowed to die to prevent AGI?”,
he wrote:
“There should be enough survivors on Earth in close contact to form a viable reproductive population, with room to spare, and they should have a sustainable food supply. So long as that’s true, there’s still a chance of reaching the stars someday.”
He later deleted that tweet because he worried it would be interpreted by some as advocating a nuclear first strike.
I’ve seen no evidence that he is advocating a nuclear first strike, but it does seem to me to be a fair reading of that tweet that he would trade nuclear devastation for preventing AGI.
I’ve seen no evidence that he is advocating a nuclear first strike, but it does seem to me to be a fair reading of that tweet that he would trade nuclear devastation for preventing AGI.
Most nuclear powers are willing to trade nuclear devastation for preventing the other side’s victory. If you went by sheer “number of surviving humans”, your best reaction to seeing the ICBMs fly towards you should be to cross your arms, make your peace, and let them hit without lifting a finger. Less chance of a nuclear winter and extinction that way. But the way deterrence prevents that from happening is by pre-commitment to actually just blowing it all up if someone ever tries something funny. That is hardly less insane than what EY suggests, but it kinda makes sense in context (but still, with a God’s eye view on humanity, it’s insane, and just the best way we could solve our particular coordination problem).
There’s a big difference between pre-committing to X so you have a credible threat against Y, vs. just outright preferring X over Y. In the quoted comment, Eliezer seems to have been doing the latter.
“Most humans die in a nuclear war, but human extinction doesn’t happen” is presumably preferable to “all biological life on Earth is eaten by nanotechnology made by an unaligned AI that has worthless goals”. It should go without saying that both are absolutely terrible outcomes, but one actually is significantly more terrible than the other.
Note that this is literally one of the examples in the OP—discussion of axiology in philosophy.
Right, but of course the absolute, certain implication from “AGI is created” to “all biological life on Earth is eaten by nanotechnology made by an unaligned AI that has worthless goals” requires some amount of justification, and that justification for this level of certainty is completely missing.
In general such confidently made predictions about the technological future have a poor historical track record, and there are multiple holes in the Eliezer/MIRI story, and there is no formal, canonical write up of why they’re so confident in their apparently secret knowledge. There’s a lot of informal, non-canonical, nontechnical stuff like List of Lethalities, security mindset, etc. that’s kind of gesturing at ideas, but there are too many holes and potential objections to have their claimed level of confidence, and they haven’t published anything formal since 2021, and very little since 2017.
We need more than that if we’re going to confidently prefer nuclear devastation over AGI.
The trade-off you’re gesturing at is really risk of AGI vs. risk of nuclear devastation. So you don’t need absolute certainty on either side in order to be willing to make it.
If the former, then I don’t understand your comment and maybe a rewording would help me.
If the latter, then I’ll just reiterate that I’m referring to Eliezer’s explicitly stated willingness to trade off the actuality of (not just some risk of) nuclear devastation to prevent the creation of AGI (though again, to be clear, I am not claiming he advocated a nuclear first strike). The only potential uncertainty in that tradeoff is the consequences of AGI (though I think Eliezer’s been clear that he thinks it means certain doom), and I suppose what follows after nuclear devastation as well.
And how credible would your precommitment be if you made it clear that you actually prefer Y, you’re just saying you’d do X for game theoretical reasons, and you’d do it, swear? These are the murky cognitive waters in which sadly your beliefs (or at least, your performance of them) affects the outcome.
One’s credibility would be less of course, but Eliezer is not the one who would be implementing the hypothetical policy (that would be various governments), so it’s not his credibility that’s relevant here.
I don’t have much sense he’s holding back his real views on the matter.
But on the object level, if you do think that AGI means certain extinction, then that’s indeed the right call (consider also that a single strike on a data centre might mean a risk of nuclear war, but that doesn’t mean it’s a certainty. If one listened to Putin’s barking, every bit of help given to Ukraine is a risk of nuclear war, but in practice Russia just swallows it up and lets it go, because no one is actually very eager to push that button, and they still have way too much to lose from it).
The scenario in which Eliezer’s approach is just wrong is if he is vastly overestimating the risk of an AGI extinction event or takeover. This might be the case, or might become so in the future (for example imagine a society in which the habit is to still enforce the taboo, but alignment has actually advanced enough to make friendly AI feasible). It isn’t perfect, it isn’t necessarily always true, but it isn’t particularly scandalous. I bet you lots of hawkish pundits during the Cold War have said that nuclear annihilation would have been preferable to the worldwide victory of Communism, and that is a substantially more nonsensical view.
I agree that if you’re absolutely certain AGI means the death of everything, then nuclear devastation is preferable.
I think the absolute certainty that AGI does mean the death of everything is extremely far from called for, and is itself a bit scandalous.
(As to whether Eliezer’s policy proposal is likely to lead to nuclear devastation, my bottom line view is it’s too vague to have an opinion. But I think he should have consulted with actual AI policy experts and developed a detailed proposal with them, which he could then point to, before writing up an emotional appeal, with vague references to air strikes and nuclear conflict, for millions of lay people to read in TIME Magazine.)
I think the absolute certainty in general terms would not be warranted; the absolute certainty if AGI is being developed in a reckless manner is more reasonable. Compare someone researching smallpox in a BSL-4 lab versus someone juggling smallpox vials in a huge town square full of people, and what probability does each of them make you assign to a smallpox pandemic being imminent. I still don’t think AGI would mean necessarily doom simply because I don’t fully buy that its ability to scale up to ASI is 100% guaranteed.
However, I also think in practice that would matter little, because states might still see even regular AGI as a major threat. Having infinite cognitive labour is such a broken hax tactic it basically makes you Ruler of the World by default if you have an exclusive over it. That alone might make it a source of tension.
We don’t know with confidence how hard alignment is, and whether something roughly like the current trajectory (even if reckless) leads to certain death if it reaches superintelligence.
There is a wide range of opinion on this subject from smart, well-informed people who have devoted themselves to studying it. We have a lot of blog posts and a small number of technical papers, all usually making important (and sometimes implicit and unexamined) theoretical assumptions which we don’t know are true, plus some empirical analysis of much weaker systems.
We do not have an established, well-tested scientific theory like we do with pathogens such as smallpox. We cannot say with confidence what is going to happen.
Yeah, at the very least it’s calling for billions dead across the world, because once we realize what Eliezer wants, this is the only realistic outcome.
I don’t agree billions dead is the only realistic outcome of his proposal. Plausibly it could just result in actually stopping large training runs. But I think he’s too willing to risk billions dead to achieve that.
I suspect a lot of the negative reactions to Yudkowsky’s article isn’t about norms, exactly, but rather a disagreement of how far we should be willing to go to slow down AI.
Yudkowsky is on the extreme end of the spectrum, which views airstrikes leading to global nuclear warfare as okay if AI is slowed down.
Suffice it to say, if you don’t believe that doom is certain, then you will have massive issues with going this far for AI safety.
Yes, this is why I take issue with his position.
Despite disagreements with you and TurnTrout’s models of how optimistic alignment is, I also agree that I have many issues with Eliezer’s position on AI safety.
My problem isn’t people disagreeing, it’s none of the people disagreeing actually pointing out what they think is the specific flaw in EY’s worries, and what are we doing to avoid them materialising. When so many of the people who are confident there’s no danger don’t seem to understand key points of the argument I just get early COVID vibes all over again.
Here’s a ~12,000 word post of me doing exactly that: My Objections to “We’re All Gonna Die with Eliezer Yudkowsky”
I’ll explain the issues I have with Eliezer Yudkowsky’s position in a nutshell:
Alignment is almost certainly easier than Yudkowsky or the extremely pessimistic people think. In particular, alignment is progressing way more than the extremely pessimistic models predicted.
I don’t think that slowing AI down instead of accelerating alignment is the best choice, primarily because I think we should mostly try to improve our chances on the current path than overturn our current path.
Given that I am an optimist on AI safety, I don’t really agree with Eliezer’s suggestions on how AI should be dealt with.
No. 1 would convince me more if we were seeing good alignment in the existing, still subhuman models. I honestly think there are multiple problems with alignment as a concept, but I also expect that there would be a significant difficulty jump when dealing with superhuman AI (for example, RLHF becomes entirely useless).
No. 2 I don’t quite understand—we care about the relative speed of the two, wouldn’t anything that says “let’s move people from capabilities to alignment research” like the moratorium asks do exactly what you say? You can’t arbitrarily speed up one field without affecting the other, really, human resources are limited, and you won’t get much mileage out of suddenly funding thousands of CS graduates to work on alignment while the veterans all advance capabilities. There’s trade offs at work. You need to actively rebalance your resources to avoid alignment just playing catch up. It’s kind of an essential part of the work anyway; imagine NASA having hundreds of engineers developing gigantic rocket engines and a team of like four guys working on control systems. You can’t go to the Moon with raw power alone.
No. 3 depends heavily on what we think the consequences of misaligned AGI are. How dangerous Eliezer’s proposal is also depends on how much do countries want to develop AGI. Consider e.g. biological weapons, where it’s probably fairly easy to get a consensus on “let’s just not do that” once everyone has realised how expensive, complex and ultimately still likely to blow up in your face they are. Vice versa, if alignment is easy, there’s probably no reason why anyone would put such an agreement in place; but there needs to be some evidence of alignment being easy, and we need it soon. We can’t wait for the point where if we’re wrong the AI destroys the world to find out. That’s not called a plan, that’s just called being lucky, if it goes well.
The good news is we have better techniques than RLHF, which as you note is not particularly useful as an alignment technique.
On alignment not making sense as a concept, I agree somewhat. In the case of an AI and a human, I think that alignment is sensible, but as you scale up, it increasingly devolves into nonsense until you just force your own values.
Not exactly, though what I’m envisioning is that you can use a finetuned AI to do alignment research, and while there are capabilities externalities, they may be necessary depending on how much feedback we need in order to solve the alignment problem.
Also, I think part of the disagreement is we are coming from different starting points on how much progress we did on alignment.
This is important, but there are other considerations.
For example, the most important thing to think about with Eliezer’s plan is what ethics/morals do you use.
Next, you need to consider the consequences of both aligned and misaligned AGIs. And I suspect they net out to much smaller consequences for AGI once you sum up the positives and negatives, assuming a consequentialist ethical system.
The problem is Eliezer’s treaty would basically imply GPU production is enough to start a war. This is a much more severe consequence than almost any treaty ever done, and this has very negative impacts under a consequentialist ethical system.
I think this is another point of disagreement. While I wouldn’t like to test the success without dignity hypothesis, also known as luck, I do think there’s a non-trivial probability of that happening, compared to other alignment people who think the chance is effectively epsilon.
Under international law, counterfeiting another nation’s currency is considered an act of war and you can “legally” go to war to stop it… if you can bomb a printing press, is it ridiculous to say you can’t have a treaty that says you can bomb a GPU foundry?
(The two most recent cases of a government actually counterfeiting another nation’s currency were Nazi Germany during World War II which made counterfeit British pounds as part of its military strategy, and the “supernote” US dollar produced by North Korea.)
And in the end no one bombed North Korea, because saying something is an act of war doesn’t imply automatic war anyway, it’s subtler than that. Honestly in the hypothetical “no GPUs” world you’d probably have all the major States agreeing it’s a danger to them and begrudgingly cooperating on those lines, and the occasional pathetic attempt by some rogue actor with nothing to lose might be nipped in the bud via sanctions or threats. The big question really is how detectable such attempts would be compared to developing e.g. bacteriological weapons. But if tomorrow we found out that North Korea is developing Super Smallpox and plans to release it, what would we do? We are already in a similar world, we just don’t think much about it because we’ve gotten used to this being the precarious equilibrium we exist in.
I find this sort of argument kinda nonsensical. Like, yes, it’s useful to conceptualise goods and harms as positives and negatives you balance, but in practice you can’t literally put numbers on them and run the sums, especially not with so many uncertainties at stake. It’s always possible to fudge the numbers and decide that some values are unimportant and some are super important and lo and behold, the calculation turns in your favour! In the end it’s no better than deontology or simply saying “I think this is good”; there is no point trying to vest it with a semblance of objectivity that just isn’t there. I am a consequentialist and I think that overall AGI is on the net probably bad for humanity, and I include also some possible outcomes from aligned AGI in there.
I don’t think it’s that improbable either, I just think it’s irresponsible either way when so much is at stake. I think the biggest possible points of failure of the doom argument are:
we just aren’t able to build AGI any soon (but in that case the whole affair turns out to be much ado about nothing), or
we are able to build AGI, but then AGI can’t really push past to ASI. This might be purely chance, or the result of us using approaches that merely “copy” human intelligence but aren’t able to transcend it (for example, if becoming superintelligent would require being trained on text written by superintelligent entities)
So, sure, we may luck out, thought that leaves us “only” with already plenty disruptive human-level AGI. Regardless, this makes the world potentially a much more unstable powder keg. Even without going specifically down the road EY mentions, I think nuclear and MAD analogies do apply because the power in play is just that great (in fact am writing a post on this, will go up tomorrow if I can finish it).
Is this not simply the fallacy of gray?
As saying goes, it’s easy to lie with statistics, but even easier to lie without them. Certainly you can fudge the numbers to make the result say anything, but if you show your work then the fudging gets more obvious.
I agree that laying out your thinking at least forces you to specifically elucidate your values. That way people can criticise the precise assumptions they disagree with, and you can’t easily back out of them. I don’t think the “lying with statistics” saying applies in its original meaning because really this is entirely about subjective terminal values. “Because I like it this way” is essentially what it boils down to no matter how you slice it.
You’re right that it isn’t an objective calculation, and apparently it requires more subjective assumptions, so I’ll agree that we really shouldn’t be treating this as though it’s an objective calculation.
I agree that testing that hypothesis is dangerously irresponsible, given the stakes involved. That’s why I still support alignment work.
I think the biggest things if success without dignity happens, I think it will be due to some of the following factors:
Alignment turns out to be really easy by default, that is something like the naive ideas like RLHF just work, or it turns out that value learning is almost trivial.
Corrigibility is really easy or trivial to do, such that alignment isn’t relevant, because humans can redirect it’s goals easily. In particular, it’s easy to get AIs to respect a shutdown order.
We can’t make AGI, or it’s too hard to progress AGI to ASI.
These are the major factors I view as likely in a success without dignity case, or we survive AGI/ASI via luck.
I find 1 unlikely, 2 almost impossible (or rather, it would imply partial alignment, in which at least you managed to impress Asimov’s Second Law of Robotics into your AGI above all else), and 3 the most likely, but also unstable (what if your 10^8 instances of AGI engineers suddenly achieve a breakthrough after 20 years of work?). So this doesn’t seem particularly satisfying to me.
Responding to your #1, do you think we’re on track to handle the cluster of AGI Ruin scenarios pointed at in 16-19? I feel we are not making any progress here other than towards verifying some properties in 17.
This is not something he said, and not something he thinks. If you read what he wrote carefully, through a pedantic decoupling lens, or alternatively with the context of some of his previous writing about deterrence, this should be pretty clear. He says that AI is bad enough to put a red line on; nuclear states put red lines of lots of things, most of which are nowhere near as bad as nuclear war is.
In response to the question,
“[Y]ou’ve gestured at nuclear risk. … How many people are allowed to die to prevent AGI?”,
he wrote:
“There should be enough survivors on Earth in close contact to form a viable reproductive population, with room to spare, and they should have a sustainable food supply. So long as that’s true, there’s still a chance of reaching the stars someday.”
He later deleted that tweet because he worried it would be interpreted by some as advocating a nuclear first strike.
I’ve seen no evidence that he is advocating a nuclear first strike, but it does seem to me to be a fair reading of that tweet that he would trade nuclear devastation for preventing AGI.
Most nuclear powers are willing to trade nuclear devastation for preventing the other side’s victory. If you went by sheer “number of surviving humans”, your best reaction to seeing the ICBMs fly towards you should be to cross your arms, make your peace, and let them hit without lifting a finger. Less chance of a nuclear winter and extinction that way. But the way deterrence prevents that from happening is by pre-commitment to actually just blowing it all up if someone ever tries something funny. That is hardly less insane than what EY suggests, but it kinda makes sense in context (but still, with a God’s eye view on humanity, it’s insane, and just the best way we could solve our particular coordination problem).
There’s a big difference between pre-committing to X so you have a credible threat against Y, vs. just outright preferring X over Y. In the quoted comment, Eliezer seems to have been doing the latter.
“Most humans die in a nuclear war, but human extinction doesn’t happen” is presumably preferable to “all biological life on Earth is eaten by nanotechnology made by an unaligned AI that has worthless goals”. It should go without saying that both are absolutely terrible outcomes, but one actually is significantly more terrible than the other.
Note that this is literally one of the examples in the OP—discussion of axiology in philosophy.
Right, but of course the absolute, certain implication from “AGI is created” to “all biological life on Earth is eaten by nanotechnology made by an unaligned AI that has worthless goals” requires some amount of justification, and that justification for this level of certainty is completely missing.
In general such confidently made predictions about the technological future have a poor historical track record, and there are multiple holes in the Eliezer/MIRI story, and there is no formal, canonical write up of why they’re so confident in their apparently secret knowledge. There’s a lot of informal, non-canonical, nontechnical stuff like List of Lethalities, security mindset, etc. that’s kind of gesturing at ideas, but there are too many holes and potential objections to have their claimed level of confidence, and they haven’t published anything formal since 2021, and very little since 2017.
We need more than that if we’re going to confidently prefer nuclear devastation over AGI.
The trade-off you’re gesturing at is really risk of AGI vs. risk of nuclear devastation. So you don’t need absolute certainty on either side in order to be willing to make it.
Did you intend to say risk off, or risk of?
If the former, then I don’t understand your comment and maybe a rewording would help me.
If the latter, then I’ll just reiterate that I’m referring to Eliezer’s explicitly stated willingness to trade off the actuality of (not just some risk of) nuclear devastation to prevent the creation of AGI (though again, to be clear, I am not claiming he advocated a nuclear first strike). The only potential uncertainty in that tradeoff is the consequences of AGI (though I think Eliezer’s been clear that he thinks it means certain doom), and I suppose what follows after nuclear devastation as well.
And how credible would your precommitment be if you made it clear that you actually prefer Y, you’re just saying you’d do X for game theoretical reasons, and you’d do it, swear? These are the murky cognitive waters in which sadly your beliefs (or at least, your performance of them) affects the outcome.
One’s credibility would be less of course, but Eliezer is not the one who would be implementing the hypothetical policy (that would be various governments), so it’s not his credibility that’s relevant here.
I don’t have much sense he’s holding back his real views on the matter.
But on the object level, if you do think that AGI means certain extinction, then that’s indeed the right call (consider also that a single strike on a data centre might mean a risk of nuclear war, but that doesn’t mean it’s a certainty. If one listened to Putin’s barking, every bit of help given to Ukraine is a risk of nuclear war, but in practice Russia just swallows it up and lets it go, because no one is actually very eager to push that button, and they still have way too much to lose from it).
The scenario in which Eliezer’s approach is just wrong is if he is vastly overestimating the risk of an AGI extinction event or takeover. This might be the case, or might become so in the future (for example imagine a society in which the habit is to still enforce the taboo, but alignment has actually advanced enough to make friendly AI feasible). It isn’t perfect, it isn’t necessarily always true, but it isn’t particularly scandalous. I bet you lots of hawkish pundits during the Cold War have said that nuclear annihilation would have been preferable to the worldwide victory of Communism, and that is a substantially more nonsensical view.
I agree that if you’re absolutely certain AGI means the death of everything, then nuclear devastation is preferable.
I think the absolute certainty that AGI does mean the death of everything is extremely far from called for, and is itself a bit scandalous.
(As to whether Eliezer’s policy proposal is likely to lead to nuclear devastation, my bottom line view is it’s too vague to have an opinion. But I think he should have consulted with actual AI policy experts and developed a detailed proposal with them, which he could then point to, before writing up an emotional appeal, with vague references to air strikes and nuclear conflict, for millions of lay people to read in TIME Magazine.)
I think the absolute certainty in general terms would not be warranted; the absolute certainty if AGI is being developed in a reckless manner is more reasonable. Compare someone researching smallpox in a BSL-4 lab versus someone juggling smallpox vials in a huge town square full of people, and what probability does each of them make you assign to a smallpox pandemic being imminent. I still don’t think AGI would mean necessarily doom simply because I don’t fully buy that its ability to scale up to ASI is 100% guaranteed.
However, I also think in practice that would matter little, because states might still see even regular AGI as a major threat. Having infinite cognitive labour is such a broken hax tactic it basically makes you Ruler of the World by default if you have an exclusive over it. That alone might make it a source of tension.
We don’t know with confidence how hard alignment is, and whether something roughly like the current trajectory (even if reckless) leads to certain death if it reaches superintelligence.
There is a wide range of opinion on this subject from smart, well-informed people who have devoted themselves to studying it. We have a lot of blog posts and a small number of technical papers, all usually making important (and sometimes implicit and unexamined) theoretical assumptions which we don’t know are true, plus some empirical analysis of much weaker systems.
We do not have an established, well-tested scientific theory like we do with pathogens such as smallpox. We cannot say with confidence what is going to happen.
Yeah, at the very least it’s calling for billions dead across the world, because once we realize what Eliezer wants, this is the only realistic outcome.
I don’t agree billions dead is the only realistic outcome of his proposal. Plausibly it could just result in actually stopping large training runs. But I think he’s too willing to risk billions dead to achieve that.