AI Safety Endgame Stories

Ivan Vendrov28 Sep 2022 16:58 UTC

LW: 31 AF: 13

AI AI Success Models AI Risk Concrete Stories

Assume you are in the set of possible worlds where AI takeover happens by default. If you do nothing, then at some point in the 21st century the AI lab Magma develops a transformative AI system. Magma employees perform a number of safety checks, conclude the system is safe enough, and deploy it. They deploy it slowly and incrementally, with careful monitoring. But despite their efforts, the system turns out to be unsafe and the monitoring insufficient, triggering a cascade of events eventually leading to an existential catastrophe.^[1]

I’ll refer to this sequence of events as the “baseline story” going forward.

Assume further that you’re in the narrower set of worlds where this AI catastrophe is contingent on your actions. In other words, there exists a sequence of actions you (or your organization) can take that averts catastrophe, a decisive intervention. Not necessarily a pivotal act, an intervention that averts all existential risk from AI. Just an intervention that prevents this specific Magma catastrophe, giving humanity some breathing room, perhaps only a few months or years.^[2]

Let’s try to understand what this decisive sequence of actions could look like. It’s tempting to start at the beginning of the sequence and think about what the first few actions look like. Unfortunately, the most probable starting actions are “meta” actions like thinking really hard, talking to experts, or recruiting more people to work on the problem. These are the same kinds of actions that any successful project starts with! So it doesn’t help us constrain the space of decisive interventions.

Instead, it’s more helpful to start with the endgame: how, in the end, did your actions change the baseline story and avert catastrophe? And what were the last nodes in the causal chain leading up to the change?

At the most abstract level, the baseline story has the following structure. A social process (Magma) instantiates a technological artifact (unsafe AI) which destroys the world. There are three objects here, and any change to the story requires changing the properties of at least one of them.

This leads naturally to a 3-way categorization of endgame stories, though the real endgame story will probably involve some combination of the three:

Changing the technology. You changed something about the technology that Magma had access to, which delayed the deployment or made it safe.
Changing the decision maker. You changed something about Magma, or more broadly the social decision process that led to the unsafe deployment.
Changing the broader world. You changed something about the broader world that made it resilient to Magma’s deployment decision.

In the rest of this post I’ll use this categorization to walk through a diverse array of endgame stories.

Changing the Technology

Differential Development of Safety

Let’s start with the broad endgame story that most technical alignment research is aimed at, differential development of safety technology (a special case of differential technological development):

You develop technology that makes AI safer, with mild competitiveness penalties. Because of your direct work, the technology is developed N years faster than it would have been by default. Magma’s deployment falls within that N-year window, so they use your technology, so their deployment does not lead to catastrophe.

Can we make this story more concrete? How exactly does the new technology prevent catastrophe? The simplest path involves finding decisive safety augmentation, something like “add an adversarial robustness term to the training objective”—a conceptual breakthrough that Magma adopts immediately because it is relatively easy to implement and aligned with business objectives.

Another simple path involves decisive monitoring technology, like in Chris Olah’s “interpretability gives you a mulligan” story: differentially advanced interpretability tech allows Magma to detect that the AI is unsafe and abort the deployment.

Other stories involve deeper changes in the technological landscape. The endgame of proposals like Iterated Amplification is to differentially advance a safer way to build ML systems for the same tasks. Perhaps in the baseline story, Magma uses model-based RL to train its model; but a concerted research effort manages to advance amplification capability so much that Magma changes their technology stack and uses amplification instead. The Microscope AI endgame involves an even deeper change, where differential interpretability progress leads Magma to use humans with AI-enhanced understanding instead of deploying an agentic AI system at all. Perhaps the most extreme differential-development endgame is the old MIRI strategy of building a safe-by-construction AGI from first principles, routing around modern ML altogether.

There are also differential development endgame stories that don’t involve AI technology at all—you could instead advance a technology that is an economic substitute for AI such as brain emulation. More speculatively, you could slow down AI development by advancing technologies that profitably redirect key inputs to AI such as compute or software engineering talent—the 2021 crypto boom may have had this effect by accident.

Stretching the definition of technology to include conceptual understanding, another differential development endgame story involves improving our understanding of AI systems in some way, e.g. discovering that large generative models exhibit unnpredictable capability jumps as they scale. Knowledge of such a phenomenon could help Magma take the decisive safety precaution that averts catastrophe.

Note that in most cases, just developing the technology is not enough; Magma also needs to know about the technology, and needs to have the ability and incentive to implement or integrate it. Any knowledge you share with Magma is likely to be dual-use; many safety improvements depend on insights that could be used by Magma to advance capabilities instead.

Differential development endgames stories can seem implausible, especially if you’re thinking about interventions on the scale of an individual or a small team. There are, or soon will be, millions of AI researchers and engineers worldwide. How can you possibly reshape the technological landscape enough to get Magma to deploy a substantially different system? One answer is to leverage technological attractor states.

Technological Attractor States

There are strong incentives for researchers and engineers to work on systems that are state-of-the-art. When a new technology becomes state-of-the-art, everyone starts using it and developing techniques to improve it, quickly amplifying what may have been a small performance difference into an insurmountable gulf. Because of this feedback loop, technological development can fall into one of several different self-reinforcing paths, or attractor states. And a very small push at the right time—perhaps just a single compelling prototype or research paper—could change the attractor the world falls into.

To illustrate the key dynamic with a stylized endgame story:

It turns out there are two different ways to build transformative AI, one of which is safe and the other isn’t. Safe AI requires 2x the compute for the same downstream task performance. There are 10 tricky algorithmic improvements like dropout to be discovered, each of which improves compute efficiency 2x. But, critically, there are totally distinct improvements for the two trees: insight doesn’t transfer between approaches, like knowing about dropout doesn’t help you train better SVMs. By default, unsafe AI will win out, because it’s more competitive. But a well-timed burst of research could discover 2 efficiency improvements for safe AI, making it state-of-the-art. Nearly all researchers and corporate labs switch to safe AI. Because the field’s attention is on the safe AI approach, more and more improvements get discovered, and the unsafe AI path falls further and further behind. Eventually, maybe many decades later, Magma trains a transformative AI system, but because of that well-timed burst of research it is safe.

A promising concrete endgame story along these lines is Ought’s plan to avoid the dangerous attractor state of AI systems that are optimized end-to-end (“outcome-based systems”) by differentially advancing the capabilities of process-based systems. Process-based systems (i.e. systems that use human-understandable processes to make decisions) may be an attractor state because they are more composable: if most economically valuable tasks can be solved by composing together a few existing systems in a human-understandable way, the incentive for end-to-end optimization is much lower. Just as today you wouldn’t train a neural net if you could just write a few lines of Python code instead.

The related endgame story of Comprehensive AI Services is that we might be able to navigate to a benign attractor state where safe, narrow tool AIs can do everything a general agent-like AI might do. It’s less clear what a decisive intervention would look like, perhaps designing a broadly adopted protocol that interfaces between narrow AI systems.

Changing the Decision Maker

Let’s now assume the technological landscape is fixed, and investigate how we might change the social decision processes that causes the catastrophe. We’ll use a broad definition of what counts as the decision process, that includes not only Magma employees but also Magma’s investors, regulators, cultural influencers, and competitors.

Defusing Races

A key driver of AI risk is what Ajeya Cotra calls the “racing forward” assumption, that at least one powerful organization will be trying to train the most powerful models possible. Others have called this the “AGI race dynamic”. What does a story for defusing this race dynamic look like?

The global AI treaty story involves dramatically increasing global coordination on AI:

You engage in some political process and your intervention leads to a global AI control agreement, analogous to nuclear weapons non-proliferation treaties. The agreement specifies the kinds of AI systems that should not be built, or a set of safeguards that any deployer of advanced AI systems must implement. It has enough enforcement power to actually affect the behavior of the leading AI lab. Because of its compliance with the treaty, Magma doesn’t train or deploy the dangerous AI system.

For those cynical about global political processes, remember that the goal is not to write a treaty that permanently averts AI x-risk: delaying catastrophe by only a few months with some simple safety measures may give enough breathing room for one of the “changing technology” stories to bring us more durable security.

Plausible modifications to the story include aiming for regulations in specific countries (especially the US and China) instead of global coordination, or developing agreements between the leading AI labs that bypass the political process altogether. More locally, individual AI labs can make commitments like the Windfall clause and OpenAI’s “join the leader”^[3] clause, which may help defuse race dynamics.

Because the impacts and risks from AI are so uncertain, it may be that a static treaty is insufficient. As an alternative path to defusing race dynamics, you could create an organization that helps dynamically coordinate safety efforts across the leading AI labs, such as Holden’s hypothetical “IAIA”. The work of the International Atomic Energy Agency is analogous here, since it also deals with a powerful dual-use technology and seeks to promote its positive uses while preventing negative effects from military use and civilian accidents.

Changing Magma’s Culture

If Magma looks anything like existing tech companies, its employees have a great deal of power. They are not simple interchangeable cogs in a profit-maximizing machine; their beliefs and habits strongly influence Magma’s behavior. Hence many plausible endgame stories go through influencing Magma’s employees, for instance:

Because of your work translating AI safety ideas for an ML research audience, ML researchers strongly prefer to work for companies that strongly commit to safe deployment practices. Because ML research talent is the scarcest resource in AI development, Magma is forced to make a strong enforceable commitment to safety, which averts catastrophe.

Alternatives to this story involve the creation of strong norms among ML professionals analogous to extant norms for geneticists, doctors, safety engineers, and cybersecurity experts. You could also influence Magma employees through the general public: popularizing AI risks widely makes unsafe AI companies unfashionable to work for, like cigarette or oil companies today.

The simplest endgame story that leverages employee power is almost trivial:

You become the key decision maker in Magma—perhaps the CEO, or the swing vote if it’s a committee decision. You decide not to train or not to deploy the AI system, averting the catastrophe.

This is an endgame only relevant for a very small set of actors, but a critically important one. There are also many promising meta strategies that indirectly lead to this endgame: you could help someone else to become this key decision maker, or influence the key decision maker by giving them relevant information.

Replacing Magma

A simple replacement story only changes the identity of the organization:

Because of your intervention—perhaps funding it, starting it, or joining it—a different organization leads the AI race, changing the decision maker in the story to be “SafeOrg” rather than Magma. SafeOrg is more risk-averse or better-informed than Magma, so it doesn’t deploy the dangerous AI system that Magma would have.

Notice that this is not a true success story; by default, Magma will still make its unsafe deployment, at roughly the same time as in the baseline story. Maybe even earlier, if there’s any knowledge transfer from SafeOrg to Magma. SafeOrg must use its capability lead to stop Magma. But how?

It could use the lead time to perform safety and monitoring work, essentially implementing one of the “changing technology” strategies above. It could use its capabilities lead as leverage somehow to influence Magma to delay deployment. It could use its capabilities lead to make the world generally safer when Magma does deploy its AI. At the most extreme end, it could use its capabilities lead to perform a pivotal act that leads to permanent existential security.

But wait—none of these stories necessarily require a capabilities lead! Capabilities here are used as just another form of power, mostly fungible with money, cultural or political influence. So the “replace Magma” story is not really an endgame, but rather a meta strategy to amplify philanthropic capital. You started out with $1B and turned it into $100B by investing it in an AI company; now you can use your $100B to prevent AI x-risk. It is not a replacement for a direct endgame strategy like differential development, but a meta strategy that can be compared to other amplifiers like community-building, investing in financial markets, and political lobbying.

The exception is stories where a capabilities lead is not fungible with other forms of power. For example, executing one of the “change the technology” strategies may require access to very high levels of capability: OpenAI and Anthropic’s alignment strategies are both predicated on this. In theory you could just pay or influence Magma to give you access to their technology, but transaction costs like lack of trust could make such an agreement unworkable in practice. Being the capabilities leader also gives you outsized influence on changing culture and setting norms, like when OpenAI’s decision to not open source GPT-2 helped set a norm of delaying the release of state-of-the-art language models.

Changing the Broader World

Let’s now assume you can’t affect Magma or its deployment process at all. The deployment will happen regardless; how could you change the broader world to be more resilient?

This is the hardest endgame to think about in the abstract, because the type of resilience needed depends on the details of the specific failure story. If the failure story involves hacking, you might patch critical information security vulnerabilities. If it involves superhuman persuasion or propaganda, you might harden social media, isolate key decision makers from the Internet, or develop ways to inoculate people against “memetic plagues”. If it involves biorisk, you might regulate on-demand DNA synthesis companies or increase pandemic preparedness.

You could improve social decision making by raising the sanity waterline, developing better research assistants, or improving institutions. Such broad interventions are not precisely targeted at mitigating x-risk, and could even increase it in the wrong hands, so they may be best deployed strategically.

A broadly applicable intervention that is targeted precisely at x-risk is building shelters and refuges, from nuclear submarines to space colonies. Shelters will not save us from the deadliest version of AI x-risk (a recursively self-improving superintelligence expanding at the speed of light), but could potentially avert other scenarios like AI-engineered pandemics or AI-triggered nuclear winter.

Counterfactual Impact and Power-Seeking

It worries me that many of the most promising theories of impact for alignment end up with the structure “acquire power, then use it for good”.

This seems to be a result of the counterfactual impact framing and a bias towards simple plans. You are a tiny agent in an unfathomably large world, trying to intervene on what may be the biggest event in human history. If you try to generate stories where you have a clear, simple counterfactual impact, most of them will involve power-seeking for the usual instrumental convergence reasons. Power-seeking might be necessary sometimes, but it seems extremely dangerous as a general attitude; ironically human power-seeking is one of the key drivers of AI x-risk to begin with. Benjamin Ross Hoffman writes beautifully about this problem in Against responsibility.

I don’t have any good solutions, other than a general bias away from power-seeking strategies and towards strategies involving cooperation, dealism, and reducing transaction costs. I think the pivotal act framing is particularly dangerous, and aiming to delay existential catastrophe rather than preventing it completely is a better policy for most actors.

Thinking about meta strategies is also a useful antidote. For any endgame story where you perform decisive intervention X, you can generate a modified story in which you “assist someone in performing X” or “research possible consequences of X” or “create a social context in which more people are trying to do X” or “build a for-profit company that is incentivized to do more X” or “use AI to do X better”. Or just give someone doing X a hug, a smile, a word of encouragement. Any specific story like this is unlikely to prove decisive; but summing up over all the possible stories, the majority of your expected impact will come from such indirect actions.

A final note of epistemic caution. This post illustrates the breadth of possible interventions that could avert AI x-risk, but it is very far from exhaustive. The world is much bigger and weirder than our minds can comprehend. There are decisive interventions lurking in all sorts of unexpected places. The real history of AI risk in the 21st century, if and when it is written, will be far stranger than any story.

Thanks to Jungwon Byun, Andreas Stuhlmuller, Todor Markov, and Anna Wang for feedback on a draft.

^
The story is most directly inspired by Ajeya’s takeover post, but meant to cover most AI x-risk stories including What failure looks like, AGI Ruin: A List of Lethalities, and most multipolar failures. It’s also mostly agnostic to timelines and takeoff speeds.
^
I revisit this assumption later in the essay, but I think it is analytically useful for two reasons. First, any plan that leads to true existential security will need to have an answer for how to avert this specific Magma catastrophe, so much of the analysis will transfer over. Second, achieving existential security or building friendly AGI may simply not be possible, and all we can do is tread water and delay catastrophe a few years at a time. Cryptography is like this—we haven’t found any perfect ways to do encryption and may never, but we can chain together enough kludges that extremely secure communication is possible most of the time.
^
From the OpenAI charter: “if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years.””

What links here?