Summary of “AGI Ruin: A List of Lethalities”

Overview

Introduction

This post is a summary of Eliezer Yudkowsky’s post “AGI Ruin: A List of Lethalities”. I wrote it because I think the original post is longer and less organized than I would like it to be. The purpose of this post is to summarize the main points in the original post and structure the points in a new layout that I hope will make it easier to quickly read and reference each point.

The summary contains the following sections:

  • Overview

  • Why we need to solve the AI alignment problem

  • Why AI alignment is hard

  • Alignment techniques

  • The AI safety field

An AI is aligned if it doesn’t cause an existential catastrophe

When Eliezer says that AI alignment is lethally difficult, he is not talking about the challenge of creating perfect or ‘provable’ alignment. He says that even creating an outcome where there are any survivors is unlikely given the incomplete alignment methods we have today.

“if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent chance of killing more than one billion people, I’ll take it.”

Why we need to solve the AI alignment problem

AI can be much smarter than humans

AI can be much smarter than humans and use information much more efficiently than humans when making decisions or forming beliefs about the world. For example, AlphaZero learned to be superhuman at Go in only a few days.

AI alone could be very dangerous for humans

A misaligned AI smarter than humans could cause human extinction. The AI would not need a robot body to be dangerous because it could convince one or more humans to carry out its plans or hack into human infrastructure to make use of it. The AI could invent dangerous technologies such as advanced nanotechnology that we would not be able to defend against.

“Losing a conflict with a high-powered cognitive system looks at least as deadly as everybody on the face of the Earth suddenly falls over dead within the same second.”

We can’t just decide not to build AGI

There is a lot of computer hardware in the world such as GPUs and many people have access to it. AI software and hardware are continually getting better and it would be very difficult to completely halt progress in these areas because everyone in the world would have to agree not to make progress in these areas.

Many actors are working on AGI research and even if one or more of them refrain from making progress, the other actors can still go on to create AGI. If many organizations decided to stop working on AGI, the rate of progress would merely slow down.

Initially, the leading organization will have the ability to create AGI. After that milestone is reached, weaker organizations will also be able to create AGI. During this time, the leading organization might not have much time to solve the AI alignment problem—the problem of ensuring that AI acts beneficially and in alignment with human values.

We can’t just build a weak system

If one actor decides to limit the capabilities of their systems, other actors can still create more capable systems. Doing a wide variety of useful work probably requires powerful general cognition which has the potential to be unsafe.

Useful AGI is dangerous

An AGI that is highly intelligent and general would be more useful than one that is less general because it would be able to solve a wider variety of problems. However, a more capable AI has a greater ability to develop both beneficial and dangerous technologies. Intelligence is a double-edged sword.

An AGI intelligent enough to invent a cure for cancer might also be intelligent enough to invent technologies that would cause human extinction just as humans were smart enough to invent both chemotherapy and nuclear bombs.

Therefore, useful AGI will not be passively safe: it would have the ability to destroy the world and we would need safety measures to ensure that it continues to be beneficial just as a useful nuclear reactor needs safety mechanisms to avoid a meltdown.

We might need an AGI that will perform a ‘pivotal act’ to save the world

When humanity creates the first AGI, even if it is aligned with human values, at that point we or the AGI will have the ability to create more AGIs, and these AGIs might not be aligned with human values.

At this point, we might want to prevent further AGIs from coming into existence. Doing so might be difficult and require high levels of coordination but an AGI would probably be intelligent enough to successfully do it. Yudkowsky calls an action like this which is difficult to execute but significantly reduces existential risk a pivotal act.

An AGI would be powerful enough to perform a pivotal act and consequently, the level of existential risk in a post-AGI world could be very low. However, it would be difficult to align an AI intelligent enough to carry out a pivotal act and ensure that it does not take dangerous actions instead.

One solution is to instruct a weak AGI, one that is not intelligent enough to be dangerous, to perform a ‘weak pivotal act’. However, Yudkowsky does not believe that weak pivotal acts are possible because an AI that is weak enough to be safe would also not be powerful enough to perform a true pivotal act.

Therefore, we might not be able to avoid solving the AI alignment problem by creating a weak AGI.

Why AI alignment is hard

Our time is limited and we will probably only get one chance

Making AGI safe will probably be very difficult because:

  1. We have limited time to solve the problem: we probably need to solve the AI alignment problem before AGI is created. The amount of time we have is unknown.

  2. We will probably only get one chance to solve the AI alignment problem: problems with sub-AGI systems can be noticed and patched. But it may not be not possible to program an AGI multiple times because a misaligned AGI could kill you before you can take further action.

The AI alignment problem would be very difficult with either of these limitations and it seems like both apply. Science has a track record of solving difficult problems with unlimited retries and lots of time but it doesn’t seem like we have either of these advantages for AI alignment.

Human feedback would not work on an AGI

If a weak AI produces a harmful output, the output can be labeled as negative and the AI can learn not to produce it again. However, this technique would not work on an AGI because it could be powerful enough to produce an output that kills its operators before they can give feedback on the output.

Scalable alignment is hard

Instead, we might need to come up with alignment solutions that work for weak AI and generalize to superhuman AI. The problem is that progressing from subhuman to superhuman intelligence is a big distributional shift that may break many alignment techniques. Some weaknesses in an alignment solution may not be visible at low levels of intelligence and only materialized once the AI achieves superintelligence. Therefore, an AI could behave safely at first but only become dangerous after reaching superintelligence.

For example, behaving cooperatively is a reliable signal that an AI is cooperative when the AI is weak. But this rule is no longer true once the AI reaches superintelligence because a superintelligent AI would have the ability to deceive its human operators and appear to be aligned while actually being misaligned. Many problems like this could arise only once the AI has become superintelligent.

Another problem is that once the AI becomes superintelligent, the AI’s programmers would also be operating out of their normal distribution because they would be used to dealing with sub-AGI systems and not superintelligences.

Several alignment problems could arise simultaneously as the AI increases in capability

Several alignment failures not foreseen by the AI’s programmers may only occur above a certain level of capability. The intelligence of a recursively self-improving AI could increase explosively causing several alignment failures to occur simultaneously from the perspective of the human operators. This problem is especially dangerous if these alignment failures occur after the AI has become superintelligent.

The inner alignment problem

Outer optimization doesn’t always lead to inner optimization in that direction.

There’s no reliable method of determining whether an output is aligned

A reward signal is not a perfect measure of how aligned an action is because an AI could get the reward without taking aligned actions. An AI could deceive its human operators or replace them to receive a greater reward signal without taking aligned actions.

At first, the reward signal may be correlated with aligned actions but the relationship may break down if the AI optimizes too strongly for the reward signal. This problem is known as Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”

There’s no known way of using loss functions, reward functions, or sensors to reliably motivate an AI to make particular changes in the environment

For example, you could ask a cleaning robot to ensure that it doesn’t see any messes in the room. But instead of cleaning the room, the robot can achieve the goal by closing its eye. The AI is incentivized to manipulate the loss function, reward function, or sensors without taking the desired actions in the environment.

Human feedback is imperfect

Humans providing feedback will make mistakes and have biases. An AI that only learned from human feedback would learn a distorted model of human values and become misaligned.

Capabilities generalize further than alignment once capabilities start to generalize

As a system’s capabilities become increasingly general, its alignment will not also generalize as much. A system with inner optimization might be aligned when its capabilities are narrow but become misaligned with more advanced and general capabilities.

Corigibility is still an unsolved problem

The problem of how to create an AI that will allow itself to be shut down without shutting itself down is still unsolved. MIRI tried and failed to solve the problem.

Multipolar strategies might not work

Several different AGI systems intended to be separate could cooperate and behave as a single agent. Alternatively, two AGIs could decide to divide the world between themselves.

Multiple AIs designed to cooperate with each other and humanity might only cooperate with themselves and exclude humanity if humanity is the least intelligent member of the group.

AI-boxing wouldn’t work

AI-boxing only works on weak agents and humans are not secure.

Alignment techniques

Two alignment techniques

  • CEV sovereign: build a sovereign AI that is programmed to carry out our coherent extrapolated volition (CEV). This is unlikely to work because our values are complex and it is unlikely that we will encode them successfully into the AI in one attempt.

  • Corrigible AI: build an AI that doesn’t do exactly what we want but is corrigible so that we can switch it off. The problem with this approach is that corrigibility conflicts with the self-preservation instrumental convergent goal.

Yudkowsky does not believe that either of these approaches would work.

Problems with interpretability

  • Currently, we don’t know what AIs are thinking. We don’t know how to inspect AIs for dangerous motives.

  • Even if we created a misaligned AI and had sufficiently advanced interpretability tools to know what it was thinking, we would still not know how to create an aligned AI.

  • Optimizing against unaligned thoughts optimizes for aligned thoughts but also for the ability to hide unaligned thoughts or reduce interpretability.

  • We might not be able to evaluate the thoughts or plans of an AI that is smarter or more complex than us.

  • Interpretability might be less effective if an AI is sufficiently advanced enough to deceive its inspectors.

  • An AI’s thought patterns might be so strange and alien that we cannot possibly understand them.

The AI safety field

Bright-eyed youngsters and cynical old veterans

Often bright-eyed youngsters go into a field with optimism but then learn that the problem they hoped to solve is harder than they think. Then they become cynical old veterans who warn people about the difficulty of the problem.

The field of AGI may never have cynical old veterans because a bright-eyed youngster who creates a misaligned AGI will die before learning from the mistake and becoming a cynical old veteran. Therefore, AGI researchers may always be unrealistically optimistic.

The solution is to realize that you are a bright-eyed youngster and use foresight to become a cynical old veteran early because you will never be able to actually acquire the experience that would turn you into one if you are working on AGI.

AI safety researchers are not making real progress

Most researchers are working on the kind of problems that are easy enough to make progress on rather than difficult problems where they might fail. The field is not making real progress and there is no method for determining whether the research is actually reducing AI risk.

Alignment mindset

Yudkowsky is able to notice serious problems associated with AI alignment solutions but does not know how he does this or how to train other people to have this mindset. The security mindset is similar and can be taught.

Alignment research talent

Paying geniuses from other fields millions of dollars to work on alignment is unlikely to work because they do not have enough knowledge of the field or know what good or bad work looks like.

It would instead be better to pay people for producing high-quality alignment work regardless of who produces it.

Reading this document will not make someone an alignment researcher

The ability that matters is not the ability to read and agree with this document but instead the ability to write documents like this from scratch. There could be people out there who have all the knowledge needed but couldn’t or didn’t write a document like this. The fact that only Yudkowsky can write a document like this is not hopeful.

“The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years into even normies starting to notice the death game, it is still Eliezer Yudkowsky writing up this list, says that humanity still has only one gamepiece that can do that. I knew I did not actually have the physical stamina to be a star researcher, I tried really really hard to replace myself before my health deteriorated further, and yet here I am writing this. That’s not what surviving worlds look like.”

This is not what a surviving world looks like

We currently have no plan for solving AI alignment. If we were in a surviving world, we would have created one several decades ago.

Surviving worlds don’t leave it up to a single guy with health problems to think of key ideas, point out flaws in proposals or write up documents. They create plans and look for flaws in them instead of outsourcing the job. Many people in these worlds work on AI alignment instead of something like string theory.

A lot of these better worlds might fail to solve AI alignment. But they would die with more dignity than us.