AI development incentive gradients are not uniformly terrible

Much of the work for this post was done together with Nuño Sempere

Perhaps you think that your values will be best served if the AGI you (or your team, company or nation) are developing is deployed first. Would you decide that it’s worth cutting a few corners, reducing your safety budget, and pushing ahead to try and get your AI out the door first?

It seems plausible, and worrying, that you might. And if your competitors reason symmetrically, we would get a “safety race to the bottom”.

On the other hand, perhaps you think your values will be better served if your enemy wins than if either of you accidentally produces an unfriendly AI. Would you decide the safety costs to improving your chances aren’t worth it?

In a simple two player model, you should only shift funds from safety to capabilities if (the relative₁ decrease in chance of friendliness) /​ (the relative₁ increase in the chance of winning) < (expected relative₂ loss of value if your enemy wins rather than you). Here, the relative₁ increases and decreases are relative to the current values. The relative₂ loss of value is relative to the expected value if you win.

The plan of this post is as follows: 1. Consider a very simple model that leads to a safety race. Identify unrealistic assumptions which are driving its results. 2. Remove some of the unrealistic assumptions and generate a different model. Derive the inequality expressed above. 3. Look at some specific example cases, and see how they affect safety considerations.

A partly discontinuous model

Let’s consider a model with two players with the same amount of resources. Each player’s choice is what fraction of their resources to devote to safety, rather than capabilities. Whichever player contributes more to capabilities wins the race. If you win the race, you either get a good outcome or a bad outcome. Your chance of getting a good outcome increases continuously with the amount you spent on safety. If the other player wins, you get a bad outcome. A good outcome is a finite amount better than any bad outcome.

We can envisage each player as choosing a point along the line going from 0 to 1: this is the fraction of their resources they’re devoting to capabilities. Therefore, the possible pairs of choices can be visualised as the square with sides of length 1. The payoffs for each player are a surface over this square, with a cliff along the line x = y—the player switches from definitely getting a bad outcome to possibly getting a good outcome. I’ve attempted to illustrate one such surface below.

A sample payoff surface

In this picture, all bad outcomes are equally bad (0 payoff). If a player is currently spending enough on capabilities to win, the other player should spend just a little more on capabilities, and so “jump up the cliff” and capture the expected value.

At first, one might think that this behaviour is driven by indifference between unfriendly AI and the other player’s friendly AI. Though I won’t show it here, this is not the case. If we assume that unfriendly AI is equally bad no matter which player deploys it, and the chance of an AI being unfriendly is just a function of the proportion spent on capabilities vs safety, the same result is found.

If not indifference, what is driving this result? The real driving force is that finite always beats infinitesimal. Because of the discontinuity—the “cliff”—an infinitesimal increase in the chance of a finitely worse outcome (unfriendly AI) is more than paid for by a finite increase in the chance of a finitely better outcome (AI aligned with my goals rather than yours).

Letting whichever player spends more on capabilities win is clearly too simple a model: there’s a chance that a less capabilities-focused player would get lucky and deploy AI first. It’s also unrealistic that one can continuously vary the amount one spends on capabilities vs safety. Could you really have a researcher move 0.0000000120000900007% of their effort from safety to capabilities (even granting that they’re equally productive in both fields)?

Below I focus on removing the discontinuity in expected payoff, rather than adding discontinuity in strategy. That is, for any given amount of resources that each player is putting towards capabilities, there is some probability of each deploying first.

Chances & resources

This is the section with maths in, if you want to skip it

Again, focus on a two player game, with players 1 and 2. Assume that each has some fixed amount of resources and can choose their allocation of the resources to capabilities or safety research. Assume that the expected value of unfriendly AI (if it happens) does not depend on these resources or on their allocation, and nor does the expected payoff of friendly AI (if it is achieved). So the AI is either in a fully good scenario or a fully bad scenario, regardless of the resources and allocations. These change the probability of a good outcome or bad outcome, but not their nature.

Let kᵢ be the total resources of player i. Let rᵢ be the fraction of resources player i devotes to capabilities.

Let f(kᵢ, rᵢ, kⱼ, rⱼ) be the probability that a player with resources kᵢ and allocation rᵢ deploys their AI before a player with resources kⱼ and allocation rⱼ. Note we have f(kᵢ, rᵢ, kⱼ, rⱼ) = 1 - f(kⱼ, rⱼ, kᵢ, rᵢ).

Let s(k, r) be the probability that a player with resources k and allocation r deploys a friendly AI if it deploys an AI.

Let’s set the payoff scale so that unfriendly AI is 0. Let the payoff for deploying friendly AI yourself be a. Let the payoff for the other player deploying friendly AI be b. I assume 0 < b < a.

For brevity, I will write f for f(k₁, r₁, k₂, r₂). I will write sᵢ for s(kᵢ, rᵢ).

Primed quantities (e.g. f’) will denote partial derivatives with respect to r₁.

The expected payoff, v is f * a * s₁ + (1 - f) * b * s₂.

Player 1 should shift resources towards capabilities if the expected payoff increases with change in r₁. That is, if v’ > 0.

v’ = f’ * a * s₁ + f * a * s₁′ - f’ * b * s₂

Rearranging:

a * f * s₁′ > f’ * (s₂ * b—s₁ * a)

Choosing constants α and β, such that b = α * a and s₂ = β * s₁, and dividing both sides by a * s₁ * f’, we get

(f /​ f’) * (s₁′ /​ s₁) > α * β − 1

or, multiplying both sides by −1,

-(f /​ f’) * (s₁′ /​ s₁) < 1 - α * β

For any function g, call g’/​g the relative (instantaneous) increase in g. Also, call -g’/​g the relative decrease in g.

α * β is (the expected value gained by player 1 if player 2 wins) divided by (the expected value gained by player 1 if player 1 wins). So call 1 - α * β the relative expected loss if player 2 wins.

Rewriting with our new definitions, we have player 1 should move resources from safety to capabilities just if

(the relative decrease in chance of friendliness)/​(the relative increase in chance of winning) < (the relative expected loss if player 2 wins).

Concrete cases

Against failure

We can model a single player trying to develop AI by some fixed time as playing against nature or failure. If they do not succeed, reality certainly deploys a “friendly AI” that is just “no change”. Our situation continues unaltered. So β is 1, and α is the expected value of our situation continuing as a fraction of the goodness of a positive singularity. Note that this expected value includes the chance of a positive singularity later, just not by the given time.

What drives the value of α is mostly how much more or less likely we think the future is to be able to deploy positive AI. If we think it is likely to be wiser and smarter, then α may even be larger than one, and a single player should only do safety research. If we think there is a chance of instability and unrest in the future, and this is our most stable chance, then a single player may wish to press on.

Of course, in reality, a multipolar scenario is more likely.

If unfriendliness is much, much worse than your enemy winning

If the default outcome is something like an S-Risk, your enemy winning may be essentially indistinguishable from you winning. That is α is approximately 1. So you should only move resources to capabilities if (the relative decrease in chance of friendliness)/​(the relative increase in chance of winning) < (the relative expected increase in unfriendliness risk if 2 wins).

Against a cautious player

Suppose the other player is playing it very safe. In fact, they are playing it many times safer than you are. So β is in fact larger than 1. The only way for the inequality to hold is you increase your chance of friendliness or decrease your chance of winning: you should shift resources away from capabilities.

Reckless Rivals

Suppose the other player is pursuing a no-holds-barred strategy that is very likely to produce an AGI before you and is very likely to produce an unaligned AI. The details here are sensitive to the exact figures and forms of the equations. Below is a simple case.

Say that the chance that you win the race is p, the chance that you produce a friendly AI is q and that the chance they produce a friendly AI if they win is r*q (r < 1).

As you shift resources to capabilities, say that your chance of winning the race climbs linearly from p to 1 in 2 when you have the same allocation. Say the chance of producing a safe AI drops linearly from q to r * q when you have the same allocation.

You should shift resources to capabilities if (1 - r)*q/​(1/​2 - p) < (1 - r). That is if (q—p) < 12. Interestingly, in this case there is no dependence on r.

Armstrong, Bostrom & Shulman 2013

After having written this up, I looked at the paper Racing to the Precipice: a model of artificial intelligence development by Armstrong, Bostrom & Shulman.

Their model can be obtained by this one by substituting in the specific values of f and α they consider, generalising to a multipolar scenario, considering a concrete distribution of capabilities and adding in partial observability of others’ capabilities.

Importantly, their f is discontinuous (very similar to that in the first model considered above). When there is uncertainty about others’ capabilities, it plays a somewhat similar role to fuzzing the probabilities of winning.