What Failure Looks Like: Distilling the Discussion

The com­ments un­der a post of­ten con­tains valuable in­sights and ad­di­tions. They are also of­ten very long and in­volved, and harder to cite than posts them­selves. Given this, I was mo­ti­vated to try to dis­till some com­ment sec­tions on LessWrong, in part to start ex­plor­ing whether we can build some norms and some fea­tures to help fa­cil­i­tate this kind of in­tel­lec­tual work more reg­u­larly. So this is my at­tempt to sum­marise the post and dis­cus­sion around What Failure Looks Like by Paul Chris­ti­ano.

Epistemic sta­tus: I think I did an okay job. I think I prob­a­bly made the most er­rors in place where I try to em­pha­sise con­crete de­tails more than the origi­nal post did. I think the sum­mary of the dis­cus­sion is much more con­cise than the origi­nal.

What Failure Looks Like (Sum­mary)

On its de­fault course, our civ­i­liza­tion will build very use­ful and pow­er­ful AI sys­tems, and use such sys­tems to run sig­nifi­cant parts of so­ciety (such as health­care, le­gal sys­tems, com­pa­nies, the mil­i­tary, and more). Similar to how we are de­pen­dent on much novel tech­nol­ogy such as money and the in­ter­net, we will be de­pen­dent on AI.

The stereo­typ­i­cal AI catas­tro­phe in­volves a pow­er­ful and mal­i­cious AI that seems good but sud­denly be­comes evil and quickly takes over hu­man­ity. Such de­scrip­tions are of­ten stylised for good story-tel­ling, or em­pha­sise unim­por­tant vari­ables.

The post be­low will con­cretely lay out two ways that build­ing pow­er­ful AI sys­tems may cause an ex­is­ten­tial catas­tro­phe, if the prob­lem of in­tent al­ign­ment is not solved. This is solely an at­tempt to de­scribe what failure looks like, not to as­sign prob­a­bil­ities to such failure or to pro­pose a plan to avoid these failures.

There are two failure modes that will be dis­cussed. First, we may in­creas­ingly fail to un­der­stand how our AI sys­tems work and sub­se­quently what is hap­pen­ing in so­ciety. Se­condly, we may even­tu­ally give these AI sys­tems mas­sive amounts of power de­spite not un­der­stand­ing their in­ter­nal rea­son­ing and de­ci­sion-mak­ing al­gorithms. Due to the mas­sive space of de­signs we’ll be search­ing through, if we do not un­der­stand the AI, this will mean cer­tain AIs will be more power-seek­ing than ex­pected, and will take ad­ver­sar­ial ac­tion and take con­trol.

Failure by loss of control

There is a gap be­tween what we want, and what ob­jec­tive func­tions we can write down. No­body has yet cre­ated a func­tion that when max­imised perfectly de­scribes what we want, but in­creas­ingly pow­er­ful ma­chine learn­ing will op­ti­mise very hard for what func­tion we en­code. This will lead to a strong in­creases the gap be­tween what we can op­ti­mise for and what we want. (This is a clas­sic good­hart­ing sce­nario.)

Con­cretely, we will grad­u­ally use ML to perform more and more key func­tions in so­ciety, but will largely not un­der­stand how these sys­tems work or what ex­actly they’re do­ing. The in­for­ma­tion we can gather will seem strongly pos­i­tive: GDP will be ris­ing quickly, crime will be down, life-satis­fac­tion rat­ings will be up, congress’s ap­proval will be up, and so on.

How­ever, the un­der­ly­ing re­al­ity will in­creas­ingly di­verge from what we think these met­rics are mea­sur­ing, and we may no longer have the abil­ity to in­de­pen­dently figure this out. In fact, new things won’t be built, crime will con­tinue, peo­ple’s lives will be mis­er­able, and congress will not be effec­tive at im­prov­ing gov­er­nance, we’ll just be­lieve this be­cause the ML sys­tems will be im­prov­ing our met­rics, and will have a hard time un­der­stand­ing what’s go­ing on out­side of what they re­port.

Grad­u­ally, our civ­i­liza­tion will lose its abil­ity to un­der­stand what is hap­pen­ing in the world as our sys­tems and in­fras­truc­ture shows us suc­cess on all of our available met­rics (GDP, wealth, crime, health, self-re­ported hap­piness, etc) and in the end we will no longer be in con­trol at all. Giv­ing up this new tech­nol­ogy would be analo­gous to liv­ing like a quaker to­day (but more ex­treme), which the vast ma­jor­ity of peo­ple do not even con­sider.

In this world, the fu­ture will be man­aged by a sys­tem we do not un­der­stand, de­signed to show us ev­ery­thing is go­ing well, and the sys­tem will be in­te­gral to all parts of so­ciety. This is not a prob­lem that we do not already face (the econ­omy and the stock mar­ket are already quite con­fus­ing), but it is a prob­lem that can be mas­sively ex­ac­er­bated by ma­chine learn­ing.

This grad­u­ally moves to­ward a world where civ­i­liza­tion is run us­ing tools that seem pos­i­tive, but whose effects we don’t re­ally un­der­stand, with no way out of this state of af­fairs.

Failure by en­emy action

Ma­chine Learn­ing sys­tems not only run com­pu­ta­tion we don’t un­der­stand, but en­tire agents too. The fo­cus in this failure mode is on agents that seek to gain in­fluence in the world, agents that we are not fully aware we have built.

As above, we will grad­u­ally use ma­chine learn­ing to perform more and more key func­tions in so­ciety, and will largely not un­der­stand how these sys­tems work or what al­gorithms they’re run­ning.

De­spite this lack of un­der­stand­ing, we will de­sign very com­pe­tent sys­tems with differ­ent goals to us (with many ob­jec­tive func­tions) in charge of a lot of the econ­omy and civ­i­liza­tion (law, health, gov­er­nance, in­dus­try, etc).

We gen­er­ally don’t un­der­stand a lot of what hap­pens in­side ML sys­tems, all we know is that mas­sive num­bers of types of cog­ni­tion and poli­cies are be­ing built and tested as we train such sys­tems. We know that such sys­tems may cre­ate var­i­ous sub­sys­tems op­ti­mis­ing for differ­ent goals (this is analo­gous to how evolu­tion cre­ated hu­mans who do not pri­mar­ily care about self-repli­ca­tion, but care about many other things like art and ad­ven­ture), but we do not know of a way to en­sure that such de­cep­tive in­fluence-seek­ing agents are not cre­ated. As long as this is true, and as long as we don’t un­der­stand how to en­sure that they are not in­stan­ti­ated in our ML sys­tems, then they will even­tu­ally be built ‘by de­fault’.

Con­cretely, this looks like us build­ing sys­tems that are very com­pe­tent and good at be­ing use­ful – they in­form policy, they provide health­care, serve as mil­i­tary units, give peo­ple life ad­vice, and so on, even though we don’t un­der­stand the in­ter­nals of these sys­tems. But they will some­times blatantly fail – an au­to­mated cor­po­ra­tion may just take the money and run, a law en­force­ment sys­tem may start seiz­ing re­sources and try to defend it­self from be­ing dis­cov­ered or de­com­mis­sioned. There are many ad­ver­sar­ial forces already pre­sent in the world (feral an­i­mals, cor­rupt bu­reau­crats, com­pa­nies ob­sessed with growth, psy­chopaths, etc), and to this, we will add cer­tain classes of pow­er­ful ML sys­tems.

We’ll likely solve small ver­sions of this prob­lem, pre­vent­ing any medium sized failures. But as au­toma­tion grows, we will reach a point where we could not re­cover from a cor­re­lated au­toma­tion failure, and now the in­cen­tives on the au­to­mated agents will be very differ­ent. Dur­ing a pe­riod of height­ened vuln­er­a­bil­ity – a con­flict be­tween states, a nat­u­ral dis­aster, a se­ri­ous cy­ber­at­tack, etc – we will see a cas­cad­ing se­ries of au­toma­tion failures, where we sud­denly crit­i­cally rely on sys­tems in off-dis­tri­bu­tion be­havi­our and have to al­low it sub­stan­tial in­fluence over the world with­out leav­ing us the abil­ity to course cor­rect af­ter­wards.

After the dis­aster, we will re­al­ise that we have a num­ber of pow­er­ful in­fluence-seek­ing sys­tems that are so­phis­ti­cated enough that we can prob­a­bly not get rid of them.

This is the clas­sic Bostro­mian sce­nario of an AI tak­ing a treach­er­ous turn into ex­plicit power-grab­bing be­havi­our, and will take con­trol over civ­i­liza­tion and its tra­jec­tory with no way for hu­man­ity to re­take con­trol. While the in­crease in ca­pa­bil­ities will be grad­ual, the turn will be fairly sud­den and bi­nary, and not have clear sig­nal­ling in the lead-up.

The key dis­tinc­tion be­tween this and the prior failure mode is that the sys­tems are tak­ing clear ad­ver­sar­ial ac­tion to take con­trol, as op­posed to just try­ing to do their jobs very well. The AIs will even­tu­ally be tak­ing clear power-grab­bing ac­tion. At a point where the ML sys­tems are in a po­si­tion of suffi­cient power and con­trol to stop obey­ing us, they will di­rectly take power and con­trol out of the hands of hu­man civ­i­liza­tion.

This doesn’t de­pend on a very spe­cific story of how a given AI sys­tem is built, just that it is search­ing a very large space of al­gorithms that in­cludes ad­ver­sar­ial agents.

Fur­ther Dis­cus­sion Sum­mary.

This is not a com­pre­hen­sive list of AI Risks

The sce­nar­ios above are failures from failing to solve the in­tent al­ign­ment prob­lem, the prob­lem of en­sur­ing that AIs we build are “try­ing to do what their cre­ators want”.

But this is not a com­pre­hen­sive list of ex­is­ten­tial risks to which AI may con­tribute. Two other key prob­lems as­so­ci­ated with AI in­clude:

  • We may build pow­er­ful AI weaponry that we use to di­rectly end civ­i­liza­tion such as na­tions cor­rupt­ing each oth­ers’ news sources to the point of col­lapse, or na­tions en­gag­ing in war us­ing ad­vanced weapons such as drones to the point of ex­tinc­tion.

  • We may not use the pow­er­ful AI weaponry to end civ­i­liza­tion di­rectly, but use it to di­rectly in­stan­ti­ate very bad end states for hu­man­ity that on re­flec­tion are very limited in value. This could look like build­ing a uni­verse of or­gas­mium, or some­thing else that is con­fused about ethics and re­sul­tantly pretty mean­ingless.

This was dis­cussed in com­ments by Wei Dai and Paul Chris­ti­ano.

Parts of the stereo­typ­i­cal AI movie story may still apply

Not all the parts of the stereo­typ­i­cal movie story are nec­es­sar­ily mis­taken. For ex­am­ple it is an open ques­tion as to whether failure by robot armies com­mit­ting geno­cide is a likely out­come of the two failures modes dis­cussed above.

Failure from loss of con­trol. It is un­clear whether or not the failures from loss of con­trol in­volve mass mil­i­tary ac­tion – whether this is one of the sim­ple met­rics that we can op­ti­mise for whilst still los­ing con­trol, or whether our sim­ple met­rics will ac­tu­ally be op­ti­mised around e.g. if the sim­ple met­ric is “our se­cu­rity cam­eras show no death or mur­der” whether this means that we suc­ceed on this ba­sic met­ric yet still lose, or whether it means some­thing the AI fabri­cat­ing a fake feed into the video cam­era and then do­ing what­ever it wants with the peo­ple in real life.

Failure from en­emy ac­tion. Ad­vanced AI may be used to help run le­gal and mil­i­tary sys­tems, which could be used to com­mit geno­cide by an ad­ver­sar­ial ac­tor. There are many ways to take over the world that don’t in­volve straight­for­ward mass mur­der, so it is not a nec­es­sary sce­nario for the ini­tial point-of-failure, but ei­ther way a treach­er­ous turn likly re­sults in the swift end of hu­man­ity soon af­ter, and this is a plau­si­ble mechanism.

This was dis­cussed in com­ments by Carl Shul­man and Paul Chris­ti­ano.

Mul­tipo­lar out­comes are possible

In a failure by en­emy ac­tion, the pre­cise de­tails of the in­ter­ac­tions be­tween differ­ent AI sys­tems run­ning so­ciety are un­clear. They may at­tempt to co­op­er­ate or en­ter into con­flict, po­ten­tially lead­ing to a mul­ti­po­lar out­come, and this can change the dy­nam­ics by which power-grab­bing oc­curs.

This was dis­cussed in com­ments by Richard Ngo and Wei Dai.

Th­ese failure modes are very gen­eral and may already be occurring

The two sce­nar­ios above do not de­pend on a de­tailed story of how AI sys­tems work, and are prob­lems that ap­ply gen­er­ally in the world. It is an in­ter­est­ing and open ques­tion to what ex­tent civ­i­liza­tion is cur­rently grad­u­ally col­laps­ing due to the prob­lems stated above.

This was dis­cussed in a com­ment by Zvi Mow­show­itz.

A faster take­off may em­pha­sise differ­ent failures

This is speci­fi­cally a de­scrip­tion of failure in a world where there is a slow take­off, whereby AI ca­pa­bil­ities rise grad­u­ally – more speci­fi­cally, where there is not a “one year dou­bling of the world econ­omy be­fore there’s been a four year dou­bling”. Faster take­off may see other failures.

This was dis­cussed in com­ments by Buck Sh­legeris and Ben Pace.

We can get more ev­i­dence and data from history

Fur­ther ev­i­dence about these sce­nar­ios can come from analysing how pre­vi­ous tech­nolog­i­cal progress has af­fected these vari­ables. How much has loss of con­trol been a prob­lem for tech his­tor­i­cally? How much has ML faced such prob­lems so far? Th­ese would help gather the ev­i­dence to­gether on these top­ics to figure out the like­li­hood of these sce­nar­ios.

This was dis­cussed in a post by Grue_Slinky.

Dis­cus­sion not yet sum­marised.

If peo­ple com­ment with sum­maries that seem good to me, I will add them to the post. I also may ac­cept ed­its to the above, as I didn’t spend as much time as I’d like on it, so there will be im­prove­ments to be made (though I don’t promise to spend a lot of time ne­go­ti­at­ing over ed­its).