Why We MUST Create an AGI that Disempowers Humanity. For Real.

Second iteration on a post I wrote earlier. Some people commented about the slightly click-baity title, but I chose to keep it for this second iteration, because I don’t come to any other conclusion. All of your comments were welcome though and I tried to address your input in this post.

In this post, I’m referring to Eliezer Yudkowsky’s List of Lethalities, also taking the responses by Paul Christiano and Zvi Mowshowitz into consideration. Specifically, I refer to Yudkowsky’s following statement:

6. We need to align the performance of some large task, a ‘pivotal act’ that prevents other people from building an unaligned AGI that destroys the world.

I’d like to explore what the ideal outcome of such a pivotal act looks like. Yudkowsky is using the “burn all GPUs” strategy as a placeholder, but that can hardly be considered ideal. I think the best-case scenario of a pivotal act could look something like this.

  • We create an AGI and sandbox it, for example by removing all network access from the machine the AGI is running on.

  • We try to align the AGI using the best methods we have available at that point.

  • Once we plug the network cable in, the AGI escapes, hacks the planet, and installs itself on all online devices.

  • The AGI supervises all other devices indefinitely and eradicates any competing AGIs.

What can we possibly do?

The question I’m currently pondering is do we have any other choice? As far as I see, we have four options to deal with AGI risks:

A: Ensure that no AGI is ever built.

How far are we willing to go to achieve this outcome? Can anything short of burning all GPUs accomplish this? One of the comments under my original post pointed out that we can theoretically build an AI (not AGI) that is capable of hacking devices and destroying their GPU.

Is that even enough or do we need to burn all CPUs in addition to that and fully go back to a pre-digital age?

Regulation on AI research can help us gain some valuable time, but not everyone adheres to regulation, so eventually somebody will build an AGI anyway.

B: Ensure that there is no AI apocalypse, even if a misaligned AGI is built.

Is that even possible?

C: Ensure that every AGI created is aligned.

Can we somehow ensure that there is no accident with misaligned AGIs?

What about bad actors that build a misaligned AGI on purpose?

D: What I described above

Sure, the idea of an AGI taking over the majority of the world’s computing resources does sound scary and is definitely way outside of the Overton window, but that’s probably what an aligned AGI would do, since otherwise, it would not be able to protect us from a misaligned AGI. In fact, we might even postulate that an AGI that does not instantly hack the planet is not properly aligned, since it fails at protecting us from AGI risks.

How to act?

Answering the questions I ask under A, B, and C helps us figure out what we should do, as we can try to identify winning paths on each strategy. Let’s hypothetically assign a p(win) to the four paths.

My assumption is that p(win_A) > p(win_D) > p(win_C) > p(win_B).

The reason for this assumption is that it is harder to align an AGI, than it is to build an AI “GPU Nuke” that fries all of the GPUs, and it’s harder to align all AGIs than only one. I’m not sure about p(win_B), but as far as my understanding goes, it’s astronomically low.

My personal preference is C > B > D > A > lose.

Apparently, anything is better than human extinction. Humanity would survive, but path A would lead us to a very undesirable future if it involved a “GPU nuke”. So much of our critical infrastructure and economy depends on GPUs that one would ask if there even is a winning path on A that satisfies Yudkowsky’s AI alignment criterion (“less than fifty percent chance of killing more than one billion people”). If path A requires nuking all CPUs as well, I’d definitely answer this with no.

Ideally, I’d prefer a future where multiple AGIs were competing against each other in a free market and none of them is power-seeking. I’d also prefer a future with multiple misaligned-but-otherwise-harmless AGIs, than a future with one extremely powerful aligned AGI. However, my preference for C and B over D is rather weak and p(win_B) and p(win_C) are by magnitudes smaller than p(win_D). Therefore, from my standpoint, I should focus on path D, although I don’t really like the implications.

Such an AGI would need to control the vast majority of devices, in order to be able to effectively defend itself against hostile AGIs. If properly aligned, the AGI would of course grant the device users computing resources when they need them and potentially even a private mode, for whatever you do in private mode. But that alone would be a significant disempowerment of humanity.

Some governments will probably hand some or all decision making over to the AGI as well. At the very least, we have to expect that the AGI will influence decision-making through public opinion, as we’re already seeing with GPT-3. In any case, we need to ask ourselves how autonomous we still are at this point, or if parliamentary decision-making is only a facade to give us an illusion of autonomy. So why not let it go all the way? I believe that once AGI is there, a lot of people will prefer being governed by an aligned AGI over being governed by human factions.

The takeaway here is that everyone who experiments with AGI must be aware of these implications. A significant, but mostly welcome disempowerment of humanity is the best-case scenario of everything that can realistically happen if the experiment succeeds. Everything AI alignment researchers can hope for is that their work increases the probability that this best-case scenario ensues and not a worse scenario.

When to act?

Imagine a hypothetical AGI doomsday clock, where the time to midnight is halved every time a breakthrough happens that is a prerequisite of AGI, which reflects the increasing speed we’re making progress in AI research. At one second to midnight, a misaligned AGI is switched on and a second later, everything is over.

If we want to get to a winning path on D, we need to have the alignment problem solved by two seconds before midnight and make sure that the AGI that ultimately gets switched on is aligned. But what if we’re not sure if it is aligned?

Suppose you have a sandboxed AGI that is ready to be released (say by plugging in the network cable) and the probability p(win_D) that it is properly aligned. In that case, you’d have to assume that we’re either at two seconds before midnight, or at least very close to that. After all, if you, or the group you are working with is capable of building such an AGI, you can assume that other groups have figured it out as well.

You further assume that you have a higher p(win_D) than all of your other competitors. In that hypothetical case, you have two options: Wait for a day longer to increase your p(win_D), or plug the network cable in. How high does p(win_D) need to be in order for you to plug the network cable in?

Let’s bring path A into the equation. Suppose you have a sandboxed AGI with p(win_D) and a sandboxed GPU nuke with p(win_A) = 1. You can plug the network cable into either the AGI, or the GPU nuke, or you can wait one day to increase p(win_D). How high does p(win_D) need to be now? How long do you wait until plugging the network cable into the GPU nuke?