Countering arguments against working on AI safety

This is a crosspost from my personal website.

One of the questions I’ve grappled with the most throughout the time I’ve been considering working on AI safety is whether we know enough about what a superintelligent system will look like to be able to do any useful research on aligning it. This concern has been expressed by several public intellectuals I respect. For instance, here’s an excerpt from Scott Aaronson’s superb essay „The Ghost in the Quantum Turing Machine“:

“For example, suppose we conclude—as many Singularitarians have—that the greatest problem facing humanity today is how to ensure that, when superhuman AIs are finally built, those AIs will be “friendly” to human concerns. The difficulty is: given our current ignorance about AI, how on earth should we act on that conclusion? Indeed, how could we have any confidence that whatever steps we did take wouldn’t backfire, and increase the probability of an unfriendly AI?”

And here’s Ben Garfinkel on the 80,000 Hours podcast:

“So just because something has some long run significance doesn’t necessarily mean that you can do much to affect the trajectory. So just a quick analogy; it’s quite clear that the industrial revolution was very important. /.../ At the same time though, if you’re living in 1750 or something and you’re trying to think, “How do I make the industrial revolution go well? How do I make the world better… let’s say in the year 2000, or even after that”, knowing that industrialization is going to be very important… it’s not really clear what you do to make things go different in a foreseeably positive way.”

Note that Aaronson probably doesn’t endorse his argument anymore, at least in the same form: he recently joined OpenAI and wrote in the post announcing his decision that he has become significantly more optimistic about our ability to attack this problem over time. However, as I’ve seen a similar argument thrown around many times elsewhere, I still feel the need to address it.

So, the argument that I’m trying to counter is “Okay, I buy that AI is possibly the biggest problem we’re facing, but so what? If I don’t have the slightest idea about how to tackle this problem, I might as well work on some other important thing – power-seeking AI is not the only existential risk in the world.” I believe that even if one assumes we know very little about what a future AGI system will look like, there’s still a strong case for working on AI safety. Without further ado, here’s a list of arguments that have convinced me to keep looking into the field of AI safety.

Counterargument 1: We don’t know how long it will take to build AGI

Provided that there’s even a very small tail risk that a generally capable AI system which would kill us all unless aligned will be built in the next 20-30 years, we don’t have the option of waiting – it’s better to do something based on the limited information that we have than to sit on the couch and do nothing. Every bit of preparation, even if largely clueless, is sorely needed.

This rests on the assumption that AGI within the next 20-30 years is within the realm of possibility, but at this point, it seems almost impossible to argue otherwise. In 2014, the median prediction of the 100 top experts of the field was a 50% chance of AGI by 2050, and my feeling is that many experts have updated towards shorter timelines in the meantime. You can find a huge list of claims made about AGI timelines here. One might argue that AI researchers suffer from planning fallacy and that predictions made 30 years ago look awfully similar to predictions made today, but the experts’ predictions seem reasonable when looking at biological anchors and extrapolating computing trends. Even if one is significantly more skeptical about AI timelines than the median researcher, it seems incredibly difficult to justify predicting virtually zero chance of AGI in the next 20-30 years.

We may perhaps be better able to do AI safety research at some point in the future, but exactly how close to AGI do we need to get before deciding that the problem is urgent enough? Stuart Russell has articulated this point well in Human Compatible: “[I]magine what would happen if we received notice from a superior alien civilization that they would arrive on Earth in thirty to fifty years. The word pandemonium doesn’t begin to describe it. Yet our response to the anticipated arrival of superintelligent AI has been . . . well, underwhelming begins to describe it.”

I’d further note that if AGI is only 20-30 years away, which we cannot rule out, then the AI systems that we have now are much more likely to be at least somewhat similar to the system that eventually becomes our first AGI. In that case, we’re probably not that clueless.

Counterargument 2: On the margin, AI safety needs additional people more than many other fields

Say for the sake of the argument that we’re indeed sure that AGI is definitely more than 50 years away. We’ll never be, but assume we will for the sake of the argument. Even in this case, I think that we need significantly more AI safety researchers on the margin.

Although we’re not in 2014 anymore when there probably weren’t even 10 full-time researchers focusing on AI safety, it still seems that there are up to a hundred similarly qualified AI capabilities researchers for every AI safety researcher. The field remains in a state of searching for a common approach to attack the problem and there’s still a lot of disagreement about which paradigms are worth pursuing. Contrast this to a field like particle physics. Although that too is a field still in search of its ultimate flawless theory, there’s a broad agreement that the Standard Model gives us predictions that are good enough for nearly every practical problem, and there’s no looming deadline when we either have the Theory of Everything or go extinct.

Even 20 years is a fairly long time, so I don’t think it’s time to give up all hope about us succeeding with alignment. Rather, there’s an enormous opportunity to give a huge contribution to solving one of the most impactful problems of our time for anyone who can contribute original ways of tackling the problem and good arguments for preferring one paradigm over the others.

Counterargument 3: There’s a gap between capabilities and safety that must be bridged

Continuing on a similar note to the previous argument, there’s currently a huge gap between our ability to build AI systems and our ability to understand them. Nevertheless, just about 2% of the papers at NeurIPS 2021 were safety-related, and as argued before, just a small fraction of the number of capabilities researchers is working on safety research.

If these proportions don’t change, it’s difficult to imagine safety research catching up with our ability to build sophisticated systems by the right time. Of course, it’s possible that sufficiently good safety research will be possible without ever being funded as much as capabilities research, but even then, the current proportions still look way too much tilted in favor of capabilities. After all, safety has thus far looked anything but an easy problem solvable over the summer on a small budget by 10 graduate students having a reliable supply of Red Bull.

Counterargument 4: Research on less capable systems might generalize to more capable systems, even if the systems are not very similar

Even if the safety tools we’re building right now are useless for aligning a generally capable agent, we’ll have a huge stack of less useful tools and past mistakes to learn from once we get to aligning a true AGI. There is at least some evidence that tools built for a certain safety-related task help us solve other safety problems quicker. For example, my impression is that having already done similar research on computer vision models, Chris Olah’s team at Anthropic was able to make progress on interpreting transformer language models faster than they would have otherwise (I’m not affiliated with Anthropic in any way, though, so please correct me if I’m wrong on this). In the same way, it seems possible that ability to interpret less capable systems really well could help us build interpretability tools for more capable ones.

I would also expect highly theoretical work such as MIRI’s research on the foundations of agency and John Wentworth’s theory of natural abstractions to be useful for aligning an arbitrarily powerful system, provided that these theories have become sufficiently refined by the time we need them. In general, if one buys the argument of Rich Sutton’s “The Bitter Lesson”, it seems preferable to do safety research on simpler AI models and methods that look like they might generalize better (again, conditional on that it’s possible to make progress on the problem quickly enough). Here, Dan Hendrycks and Thomas Woodside give a related overview of the dynamic of creative destruction of old methods in machine learning.

Counterargument 5: AI safety must have become an established field by the time AGI is closer

Hendrycks and Woodside write in their Pragmatic AI Safety sequence:

“Imagine you’re in the late 2000s and care about AI safety. It is very difficult to imagine that you could have developed any techniques or algorithms which would transfer to the present day. However, it might have been possible to develop datasets that would be used far into the future or amass safety researchers which could enable more safety research in the future. For instance, if more people had been focused on safety in 2009, we would likely have many more professors working on it in 2022, which would allow more students to be recruited to work on safety. In general, research ecosystems, safety culture, and datasets survive tsunamis.”

Even if we’re pretty clueless about the best directions of research right now, it’s essential that we have a fully functioning field of AI safety and an industry prioritizing a safety-first culture ready by the time we get dangerously close to building an AGI. Since we don’t know when we’ll hit that point of extreme danger, it’s better to start getting ready now. Currently, we don’t have enough people working on AI safety, as argued in counterargument 2, and, even though commercial incentives might make it impossible to make safety the principal concern in the field, safety culture can also certainly be improved.

Counterargument 6: Some less capable systems also require safety work

A system doesn’t need to have human-level intelligence to require alignment work. Even if building an AGI system takes a lot longer than expected, safety research can still be useful for improving narrower models. For example, the more we give important decisions concerning policy or our everyday lives to narrow AI systems, the bigger our need to interpret the way they make decisions. This can both help us spot algorithmic bias and get a better overview of the AI system’s considerations behind making some decision. In this way, interpretability research can both help with AGI alignment and with making narrower systems safer and more useful.

I also think that there’s a lot of policy work required to make narrow AIs safe. For example, given the way autonomous weapons can lower the costs of going into wars, good policies around their use are probably vital for keeping the trend of yearly decrease in the number of wars going. And there are probably many other ways alignment work can contribute to making narrow AI systems better that don’t come to my mind at the moment. Sure, none of the problems mentioned under this argument are existential risks, but preventing a single war through successful policies would already save thousands of lives.