Safety-capabilities tradeoff dials are inevitable in AGI

A safety-capabilities tradeoff is when you have something like a dial on your AGI, and one end of the dial says “more safe but less capable”, and the other end of the dial says “less safe but more capable”.

Make no mistake: safety-capabilities tradeoff dials stink. But I want to argue that they are inevitable, and we better get used to them. I will argue that the discussion should be framed as “Just how problematic is this dial? How do we minimize its negative impact?”, not “This particular approach has a dial, so it’s automatically doomed. Let’s throw it out and talk about something else instead.”

(Recent examples of the latter attitude, at least arguably: here, here.)

1. Background (if it’s not obvious): why do safety-capabilities tradeoff dials stink?

The biggest problem is that the world has ambitious people and competitive people and optimistic people and people desperate to solve problems (like climate change, or the fact that their child is dying of cancer, or the fact that their company is about to go bust and they’ll have to lay off thousands of employees who are depending on them, etc. etc.). Also, we don’t have research liability insurance that internalizes the societal cost of catastrophic accidents. And even if we did, I don’t think that would work for catastrophes so bad that no humans would be left to sue the insurer, or so weird and new that nobody knows how to price them, or if the insurance is so expensive that AGI research shifts to a country where insurance is not required, or shifts underground, etc.

So I expect by default that a subset of people would set their dial to “less safe but more capable”, and then eventually we get catastrophic accidents with out-of-control AGIs self-replicating around the internet and all that.

2. Why do I say that these dials are inevitable?

Here are a few examples.

  • Pre-deployment testing. Will pre-deployment testing help with safety? Yeah, duh! (It won’t catch every problem, but I think it’s undeniable that it would help more than zero.) And will pre-deployment testing hurt capabilities? You bet! The time and money and compute and personnel on the Sandbox Testing Team is time and money and compute and personnel that are not training the next generation of more powerful AGIs. Well, that’s an oversimplification. Some amount of sandbox testing would help capabilities, by helping the team better understand how things are going. But there’s an optimal amount of sandbox testing for capabilities, and doing further testing beyond that point is a safety-capabilities tradeoff.

  • Humans in the loop. If we run the model without ever pausing to study and test it, and allow the AGI to execute plans that we humans do not understand or even cannot understand, it will be able to accomplish things faster and less safely.

  • Resources. As we give an AGI things like full unfiltered internet access and money and extra computers to use and so on, it will make the AGI more capable, but also less safe (since an alignment failure would more quickly and reliably turn into an unrecoverable catastrophe).

  • Following human norms and laws. Presumably we want people to try to make AGIs that follow the letter and spirit of all laws and social customs (cf. “Following Human Norms”). There’s a sliding scale of how conservative the AGI might be in those respects. Higher conservatism reduces the risk of dangerous and undesired behavior, but also prevents the AGI from taking some actions that would have been very good and effective. So this is yet another safety-capabilities tradeoff.

3. A better way to think about it: “alignment tax”

I like the “alignment tax” framing (the terminology comes from Eliezer Yudkowsky via Paul Christiano). (Maybe it would have been slightly better to call it “safety tax” in the context of this post, but whatever.)

  • If there’s no tradeoff whatsoever between safety and capabilities, that would be an “alignment tax” of 0%—the best case.

  • If we know how to make unsafe AGIs but we don’t know how to make safe AGIs, that would be an “alignment tax” of infinity%—the worst case.

So if we’re talking about a proposal that entails a safety-capabilities tradeoff dial, the first questions I would ask are:

  • Does this safety proposal entail an alignment tax of infinity%? That is, are there plans /​ ideas that an unsafe AI could come up with and execute, and a safe AI can’t, not even with extra time /​ money /​ compute /​ whatever?

  • If so, would the capability ceiling for “safe AI” be so low that the safe AI is not even up to the task of doing superhuman AGI safety research? (i.e., is the “safe AI” inadequate for use in a “bootstrapping” approach to safe powerful AGI?)

If yes to both, that’s VERY BAD. If any safety proposal is like that, it goes straight into the garbage!

Then the next question to ask is: if the alignment tax is less than infinity, that’s a good start, but just how high or low is the tax? There’s no right answer anymore: it’s just “The lower the better”.

After all, we would like a world of perfect compliance, where all relevant actors are always paying the alignment tax in order to set their dial to “safe”. I don’t think this is utterly impossible. There are three things that can work together to make it happen:

  • As-low-as-possible alignment tax

  • People’s selfish desire to have their own AGIs to stay safe and under control.

  • _____________ [this bullet point is deliberately left blank, to be filled in by some coordination mechanism /​ legal framework /​ something-or-other that the AGI strategy /​ governance folks will hopefully eventually come up with!!]

  • Updated to add: For that last one, it would also help if the settings of the important safety-capabilities tradeoff dials are somehow externally legible /​ auditable by third parties. (Thanks Koen Holtman in the comments.)

(Thanks Alex Turner & Miranda Dixon-Luinenburg for comments on a draft.)