Why AI Safety is Hard

This post is meant to summarize the difficulties of AI Safety in a way that my former self from a few months ago would have found helpful. I expect that it might be a useful framing for others, although I’m not saying anything new. I thank David Lindner for helpful comments. All views are my own.

AI systems are getting increasingly powerful and general. It seems plausible that we will get to extremely general and powerful systems within a decade and likely to get there in a few decades^[1]. In this post I’ll use the term AGI and mean a system that can perform any mental task at least as well as the best human (but I think similar arguments hold for Transformative AI or PASTA).

We want to use AGI to create a “flourishing future”. There are at least three questions that we need to answer:

How do we align the AGI to its operator?
What do we tell it to do?
How do we get to that point safely?

How do we align AGI to its operator?

Problem: Currently nobody knows how to build AGI. But even if we did, no one knows how to make it safe-to-use. The difficulty of solving this problem remains uncertain.

Ensuring that an AGI actually does what you want poses a significant technical challenge, the classical alignment problem. Often, people are drawn to this field due to their interest in tackling this technical challenge.

We’ve already seen instances where large language models exhibit behaviors their creators don’t endorse. However, this is just the tip of the iceberg, as current models are still relatively limited in their capabilities.

Two primary challenges in aligning AGI with its operator are convergent instrumental goals and deceptive alignment:

Convergent instrumental goals: Certain goals, such as self-preservation, goal preservation, resource acquisition, and self-improvement, are useful for nearly any objective. An AGI would recognize this and act accordingly. It will be really hard to constrain it in the “right” way.
Deceptive alignment: The standard process for developing an AI involves training it and then assessing its performance in a controlled environment before gradually deploying it. During this training phase (which includes pre-training, fine-tuning, and red-teaming), we evaluate the model’s alignment. If it appears aligned, it’s either genuinely aligned or deceptively pretending to be aligned because it knows it’s being trained. It’s unclear whether genuine alignment is more likely than deception. Convergent instrumental goals and random “bad luck” can both obstruct the process.

Deceptive alignment complicates matters, as the conventional methods of developing ML models (benchmarks, testing, and “deploy trial-and-error”) might prove insufficient. We could soon enter a realm where tools must work on the first try, as failure could lead to disastrous consequences. Following the current paradigm, this outcome seems likely.

For a more in-depth, yet accessible discussion, check out Holden Karnofsky’s insightful post.

This is the really hard, unsolved problem that initially captured my attention. However, it has become increasingly evident to me that it’s only the first step in a series of challenges.

What do we tell the AGI to do?

Problem: We don’t yet know which tasks to assign an AGI to ensure a positive future outcome.

Assume we have an AGI that reliably carries out our instructions. What should we tell it to do? If it can perform virtually any physically possible task, we could ask it to “eliminate child starvation,” and it might achieve this goal within a month (without resorting to harmful solutions). However, this raises several questions: e.g. should we aim for a thriving future for humanity or for all sentient beings? And even if we agree on that, what does “thriving” mean precisely?

Addressing this issue requires a robust framework for defining the goals and values that will guide the AGI’s actions. Potential approaches include public deliberation or AI-assisted simulations of public deliberation (yeah, really). Another option is to provide the AGI with a set of “reasonable guardrails” (however we define them) and allow it to determine the optimal path toward a desirable future. Shard theory is another, more technical approach, but I haven’t read much about it yet.r

If we, as humanity, don’t have a great plan for the AGI, someone will resort to a less-than-great plan and that could have long-lasting consequences. At present, it’s unclear to me how many people are actively exploring this question. I haven’t seen much discussion about this overall.

How do we get to that point safely?

Problem: A single actor knowing how to build safe AGI and what its goal should be, doesn’t prevent others from causing catastrophes.

Ideally, we would have a single safety-conscious, well-funded research lab working patiently on problems 1 and 2. Competition in this space is detrimental, as building safe AGI and figuring out what to do with it might be substantially more challenging than merely constructing AGI. People in the AI Safety community have long warned about an AI arms race and anticipated it might happen anyways. We now have empirical evidence of how it’s unfolding.

There are at least two forces that are pushing towards an AI arms race:

Lack of trust between actors.
Economic incentives.

The lack of trust is especially critical among safety-conscious actors. For example, Elon Musk’s mistrust in DeepMind led to OpenAI’s creation, and Dario Amodei’s mistrust in OpenAI contributed to Anthropic’s founding. Exploring the reasons behind this mistrust would provide valuable context to understand the AI landscape’s complexities.

The mission for all these labs relies on building state-of-the-art models, which requires significant funding, mainly for compute resources. Consequently, economic incentives come into play. State-of-the-art models are highly valuable, and both OpenAI and Anthropic capitalize on them extensively.

OpenAI’s first significant partnership was with Microsoft. When Microsoft launched products powered by OpenAI’s advanced models, it threatened Google, leading them to join the race. The real-world result of these competitive forces is an open race among at least five players (with some alliances) striving to stay ahead of each other. Beyond those are a host of others like Meta, Amazon, Adept AI, Stability AI and Baidu who are probably not very far behind.

Racing to build and deploy powerful systems can compromise safety. For instance, Microsoft recently laid off one of its responsible AI teams:

Members of the ethics and society team said they generally tried to be supportive of product development. But they said that as Microsoft became focused on shipping AI tools more quickly than its rivals, the company’s leadership became less interested in the kind of long-term thinking that the team specialized in.

Although current models do not yet pose an existential threat, race dynamics establish a culture of hasty releases, which could become difficult to reverse when caution is needed. Even for Anthropic (as a public-benefit corporation) and OpenAI, with organizational structures in place to prioritize safety, this could be really hard. For massive, for-profit companies like Google and Microsoft it’s even worse.

While the AI race’s acceleration is empirically evident, this trend may not continue indefinitely. Katja Grace argued that race dynamics could change if all involved parties perceive substantial risks of destroying humanity – in such a scenario, trying to beat others to the finish line would not make sense. Regulatory intervention or public pressure may also prompt a shift. Recent mainstream media coverage, such as an opinion piece by Ezra Klein in the NYT indicates that this might be happening. However, these changes may need to happen soon.

Holden suggests that standards and monitoring could be a promising direction. Developing tools to show that a system is genuinely dangerous could slow deployment, as labs are inherently interested in deploying safe systems. Agreements between labs not to deploy such systems would further mitigate risks by reducing race dynamics. ARC Evals is already collaborating with Anthropic and OpenAI on this.

I want to be clear, that the biggest risk is not that a “bad actor” gets to AGI first and uses it for a bad purpose (though that’s a big risk, too), but rather that haste leads to someone deploying an unsafe AGI accidentally. Holden frames this as the competition vs. the caution frame. Elizier visualizes it as fighting about which monkey gets the poisoned banana.

Holden has written about this whole issue more eloquently than I could and sums it up like this:

This seems hard. If we end up in the future envisioned in this piece, I imagine this being extremely stressful and difficult. I’m picturing a world in which many companies, and even governments, can see the huge power and profit they might reap from deploying powerful AI systems before others—but we’re hoping that they instead move with caution (but not too much caution!), take the kinds of actions described above, and that ultimately cautious actors “win the race” against less cautious ones.

So where does this leave us?

You might want to help. I certainly want to help. But it sure seems damn hard.

Zvi also recently summarized the issue and ended his list like this:

A few examples for where people potentially made things worse:

Despite good intentions, Elon Musk founded OpenAI on the idea that making AI open and available to all would make it safer. This (a) started OpenAI which might have been pretty bad (see next point) and (b) contributed to the meme that “AI for all is good” which is still a common sentiment.
OpenAI has since changed their stance on this, but again, (probably) despite good intentions, it seems very plausible (some have more confidence) that OpenAI was overall harmful for advancing capabilities too quickly and generating hype.
Paul Christiano invented Reinforcement Learning from Human Feedback to make it possible to have AIs learn complex reward functions that cannot be easily described or even demonstrated. This might really help with alignment, but it has also helped to ship impressive products (like ChatGPT) which added to race dynamics. There is debate as to whether this was overall good or bad.
The hope with research on mechanistic interpretability is that we can actually understand what is going on inside a model. That would help to detect dangerous behaviors; but it can also help build more capable models.
OpenAI’s alignment plan is to build a powerful research assistant to help with alignment research. Those tools can also be used to build more powerful AI systems in general.
Anthropic wants to build safe systems and do world class research, but they are also selling a product, adding to AI hype and race dynamics.

Generally it seems like you either make something happen and risk it being net-negative or you don’t do much at all. Taking action is challenging and potentially harmful, but doing nothing feels unsatisfying. I’m still unsure what to do. If in doubt, refer to Holden again and perhaps just try to talk about these issues in accessible ways so that more people understand them (in person conversations seem particularly safe). That seems robustly good and might buy us all more time, although it won’t actually solve the problem.

^
Timelines are hard, but see e.g. this Metaculus question:

Other sources include
https://www.cold-takes.com/forecasting-transformative-ai-the-biological-anchors-method-in-a-nutshell/
https://forum.effectivealtruism.org/posts/ByBBqwRXWqX5m9erL/update-to-samotsvety-agi-timelines