AI development incentive gradients are not uniformly terrible

Much of the work for this post was done to­gether with Nuño Sempere

Per­haps you think that your val­ues will be best served if the AGI you (or your team, com­pany or na­tion) are de­vel­op­ing is de­ployed first. Would you de­cide that it’s worth cut­ting a few cor­ners, re­duc­ing your safety bud­get, and push­ing ahead to try and get your AI out the door first?

It seems plau­si­ble, and wor­ry­ing, that you might. And if your com­peti­tors rea­son sym­met­ri­cally, we would get a “safety race to the bot­tom”.

On the other hand, per­haps you think your val­ues will be bet­ter served if your en­emy wins than if ei­ther of you ac­ci­den­tally pro­duces an un­friendly AI. Would you de­cide the safety costs to im­prov­ing your chances aren’t worth it?

In a sim­ple two player model, you should only shift funds from safety to ca­pa­bil­ities if (the rel­a­tive₁ de­crease in chance of friendli­ness) /​ (the rel­a­tive₁ in­crease in the chance of win­ning) < (ex­pected rel­a­tive₂ loss of value if your en­emy wins rather than you). Here, the rel­a­tive₁ in­creases and de­creases are rel­a­tive to the cur­rent val­ues. The rel­a­tive₂ loss of value is rel­a­tive to the ex­pected value if you win.

The plan of this post is as fol­lows: 1. Con­sider a very sim­ple model that leads to a safety race. Iden­tify un­re­al­is­tic as­sump­tions which are driv­ing its re­sults. 2. Re­move some of the un­re­al­is­tic as­sump­tions and gen­er­ate a differ­ent model. Derive the in­equal­ity ex­pressed above. 3. Look at some spe­cific ex­am­ple cases, and see how they af­fect safety con­sid­er­a­tions.

A partly dis­con­tin­u­ous model

Let’s con­sider a model with two play­ers with the same amount of re­sources. Each player’s choice is what frac­tion of their re­sources to de­vote to safety, rather than ca­pa­bil­ities. Whichever player con­tributes more to ca­pa­bil­ities wins the race. If you win the race, you ei­ther get a good out­come or a bad out­come. Your chance of get­ting a good out­come in­creases con­tin­u­ously with the amount you spent on safety. If the other player wins, you get a bad out­come. A good out­come is a finite amount bet­ter than any bad out­come.

We can en­visage each player as choos­ing a point along the line go­ing from 0 to 1: this is the frac­tion of their re­sources they’re de­vot­ing to ca­pa­bil­ities. There­fore, the pos­si­ble pairs of choices can be vi­su­al­ised as the square with sides of length 1. The pay­offs for each player are a sur­face over this square, with a cliff along the line x = y—the player switches from definitely get­ting a bad out­come to pos­si­bly get­ting a good out­come. I’ve at­tempted to illus­trate one such sur­face be­low.

A sample payoff surface

In this pic­ture, all bad out­comes are equally bad (0 pay­off). If a player is cur­rently spend­ing enough on ca­pa­bil­ities to win, the other player should spend just a lit­tle more on ca­pa­bil­ities, and so “jump up the cliff” and cap­ture the ex­pected value.

At first, one might think that this be­havi­our is driven by in­differ­ence be­tween un­friendly AI and the other player’s friendly AI. Though I won’t show it here, this is not the case. If we as­sume that un­friendly AI is equally bad no mat­ter which player de­ploys it, and the chance of an AI be­ing un­friendly is just a func­tion of the pro­por­tion spent on ca­pa­bil­ities vs safety, the same re­sult is found.

If not in­differ­ence, what is driv­ing this re­sult? The real driv­ing force is that finite always beats in­finites­i­mal. Be­cause of the dis­con­ti­nu­ity—the “cliff”—an in­finites­i­mal in­crease in the chance of a finitely worse out­come (un­friendly AI) is more than paid for by a finite in­crease in the chance of a finitely bet­ter out­come (AI al­igned with my goals rather than yours).

Let­ting whichever player spends more on ca­pa­bil­ities win is clearly too sim­ple a model: there’s a chance that a less ca­pa­bil­ities-fo­cused player would get lucky and de­ploy AI first. It’s also un­re­al­is­tic that one can con­tin­u­ously vary the amount one spends on ca­pa­bil­ities vs safety. Could you re­ally have a re­searcher move 0.0000000120000900007% of their effort from safety to ca­pa­bil­ities (even grant­ing that they’re equally pro­duc­tive in both fields)?

Below I fo­cus on re­mov­ing the dis­con­ti­nu­ity in ex­pected pay­off, rather than adding dis­con­ti­nu­ity in strat­egy. That is, for any given amount of re­sources that each player is putting to­wards ca­pa­bil­ities, there is some prob­a­bil­ity of each de­ploy­ing first.

Chances & resources

This is the sec­tion with maths in, if you want to skip it

Again, fo­cus on a two player game, with play­ers 1 and 2. As­sume that each has some fixed amount of re­sources and can choose their al­lo­ca­tion of the re­sources to ca­pa­bil­ities or safety re­search. As­sume that the ex­pected value of un­friendly AI (if it hap­pens) does not de­pend on these re­sources or on their al­lo­ca­tion, and nor does the ex­pected pay­off of friendly AI (if it is achieved). So the AI is ei­ther in a fully good sce­nario or a fully bad sce­nario, re­gard­less of the re­sources and al­lo­ca­tions. Th­ese change the prob­a­bil­ity of a good out­come or bad out­come, but not their na­ture.

Let kᵢ be the to­tal re­sources of player i. Let rᵢ be the frac­tion of re­sources player i de­votes to ca­pa­bil­ities.

Let f(kᵢ, rᵢ, kⱼ, rⱼ) be the prob­a­bil­ity that a player with re­sources kᵢ and al­lo­ca­tion rᵢ de­ploys their AI be­fore a player with re­sources kⱼ and al­lo­ca­tion rⱼ. Note we have f(kᵢ, rᵢ, kⱼ, rⱼ) = 1 - f(kⱼ, rⱼ, kᵢ, rᵢ).

Let s(k, r) be the prob­a­bil­ity that a player with re­sources k and al­lo­ca­tion r de­ploys a friendly AI if it de­ploys an AI.

Let’s set the pay­off scale so that un­friendly AI is 0. Let the pay­off for de­ploy­ing friendly AI your­self be a. Let the pay­off for the other player de­ploy­ing friendly AI be b. I as­sume 0 < b < a.

For brevity, I will write f for f(k₁, r₁, k₂, r₂). I will write sᵢ for s(kᵢ, rᵢ).

Primed quan­tities (e.g. f’) will de­note par­tial deriva­tives with re­spect to r₁.

The ex­pected pay­off, v is f * a * s₁ + (1 - f) * b * s₂.

Player 1 should shift re­sources to­wards ca­pa­bil­ities if the ex­pected pay­off in­creases with change in r₁. That is, if v’ > 0.

v’ = f’ * a * s₁ + f * a * s₁′ - f’ * b * s₂

Rear­rang­ing:

a * f * s₁′ > f’ * (s₂ * b—s₁ * a)

Choos­ing con­stants α and β, such that b = α * a and s₂ = β * s₁, and di­vid­ing both sides by a * s₁ * f’, we get

(f /​ f’) * (s₁′ /​ s₁) > α * β − 1

or, mul­ti­ply­ing both sides by −1,

-(f /​ f’) * (s₁′ /​ s₁) < 1 - α * β

For any func­tion g, call g’/​g the rel­a­tive (in­stan­ta­neous) in­crease in g. Also, call -g’/​g the rel­a­tive de­crease in g.

α * β is (the ex­pected value gained by player 1 if player 2 wins) di­vided by (the ex­pected value gained by player 1 if player 1 wins). So call 1 - α * β the rel­a­tive ex­pected loss if player 2 wins.

Rewrit­ing with our new defi­ni­tions, we have player 1 should move re­sources from safety to ca­pa­bil­ities just if

(the rel­a­tive de­crease in chance of friendli­ness)/​(the rel­a­tive in­crease in chance of win­ning) < (the rel­a­tive ex­pected loss if player 2 wins).

Con­crete cases

Against failure

We can model a sin­gle player try­ing to de­velop AI by some fixed time as play­ing against na­ture or failure. If they do not suc­ceed, re­al­ity cer­tainly de­ploys a “friendly AI” that is just “no change”. Our situ­a­tion con­tinues un­altered. So β is 1, and α is the ex­pected value of our situ­a­tion con­tin­u­ing as a frac­tion of the good­ness of a pos­i­tive sin­gu­lar­ity. Note that this ex­pected value in­cludes the chance of a pos­i­tive sin­gu­lar­ity later, just not by the given time.

What drives the value of α is mostly how much more or less likely we think the fu­ture is to be able to de­ploy pos­i­tive AI. If we think it is likely to be wiser and smarter, then α may even be larger than one, and a sin­gle player should only do safety re­search. If we think there is a chance of in­sta­bil­ity and un­rest in the fu­ture, and this is our most sta­ble chance, then a sin­gle player may wish to press on.

Of course, in re­al­ity, a mul­ti­po­lar sce­nario is more likely.

If un­friendli­ness is much, much worse than your en­emy winning

If the de­fault out­come is some­thing like an S-Risk, your en­emy win­ning may be es­sen­tially in­dis­t­in­guish­able from you win­ning. That is α is ap­prox­i­mately 1. So you should only move re­sources to ca­pa­bil­ities if (the rel­a­tive de­crease in chance of friendli­ness)/​(the rel­a­tive in­crease in chance of win­ning) < (the rel­a­tive ex­pected in­crease in un­friendli­ness risk if 2 wins).

Against a cau­tious player

Sup­pose the other player is play­ing it very safe. In fact, they are play­ing it many times safer than you are. So β is in fact larger than 1. The only way for the in­equal­ity to hold is you in­crease your chance of friendli­ness or de­crease your chance of win­ning: you should shift re­sources away from ca­pa­bil­ities.

Reck­less Rivals

Sup­pose the other player is pur­su­ing a no-holds-barred strat­egy that is very likely to pro­duce an AGI be­fore you and is very likely to pro­duce an un­al­igned AI. The de­tails here are sen­si­tive to the ex­act figures and forms of the equa­tions. Below is a sim­ple case.

Say that the chance that you win the race is p, the chance that you pro­duce a friendly AI is q and that the chance they pro­duce a friendly AI if they win is r*q (r < 1).

As you shift re­sources to ca­pa­bil­ities, say that your chance of win­ning the race climbs lin­early from p to 1 in 2 when you have the same al­lo­ca­tion. Say the chance of pro­duc­ing a safe AI drops lin­early from q to r * q when you have the same al­lo­ca­tion.

You should shift re­sources to ca­pa­bil­ities if (1 - r)*q/​(1/​2 - p) < (1 - r). That is if (q—p) < 12. In­ter­est­ingly, in this case there is no de­pen­dence on r.

Arm­strong, Bostrom & Shul­man 2013

After hav­ing writ­ten this up, I looked at the pa­per Rac­ing to the Precipice: a model of ar­tifi­cial in­tel­li­gence de­vel­op­ment by Arm­strong, Bostrom & Shul­man.

Their model can be ob­tained by this one by sub­sti­tut­ing in the spe­cific val­ues of f and α they con­sider, gen­er­al­is­ing to a mul­ti­po­lar sce­nario, con­sid­er­ing a con­crete dis­tri­bu­tion of ca­pa­bil­ities and adding in par­tial ob­serv­abil­ity of oth­ers’ ca­pa­bil­ities.

Im­por­tantly, their f is dis­con­tin­u­ous (very similar to that in the first model con­sid­ered above). When there is un­cer­tainty about oth­ers’ ca­pa­bil­ities, it plays a some­what similar role to fuzzing the prob­a­bil­ities of win­ning.