What failure looks like

The stereo­typed image of AI catas­tro­phe is a pow­er­ful, mal­i­cious AI sys­tem that takes its cre­ators by sur­prise and quickly achieves a de­ci­sive ad­van­tage over the rest of hu­man­ity.

I think this is prob­a­bly not what failure will look like, and I want to try to paint a more re­al­is­tic pic­ture. I’ll tell the story in two parts:

  • Part I: ma­chine learn­ing will in­crease our abil­ity to “get what we can mea­sure,” which could cause a slow-rol­ling catas­tro­phe. (“Go­ing out with a whim­per.”)

  • Part II: ML train­ing, like com­pet­i­tive economies or nat­u­ral ecosys­tems, can give rise to “greedy” pat­terns that try to ex­pand their own in­fluence. Such pat­terns can ul­ti­mately dom­i­nate the be­hav­ior of a sys­tem and cause sud­den break­downs. (“Go­ing out with a bang,” an in­stance of op­ti­miza­tion dae­mons.)

I think these are the most im­por­tant prob­lems if we fail to solve in­tent al­ign­ment.

In prac­tice these prob­lems will in­ter­act with each other, and with other dis­rup­tions/​in­sta­bil­ity caused by rapid progress. Th­ese prob­lems are worse in wor­lds where progress is rel­a­tively fast, and fast take­off can be a key risk fac­tor, but I’m scared even if we have sev­eral years.

With fast enough take­off, my ex­pec­ta­tions start to look more like the car­i­ca­ture—this post en­vi­sions rea­son­ably broad de­ploy­ment of AI, which be­comes less and less likely as things get faster. I think the ba­sic prob­lems are still es­sen­tially the same though, just oc­cur­ring within an AI lab rather than across the world.

(None of the con­cerns in this post are novel.)

Part I: You get what you measure

If I want to con­vince Bob to vote for Alice, I can ex­per­i­ment with many differ­ent per­sua­sion strate­gies and see which ones work. Or I can build good pre­dic­tive mod­els of Bob’s be­hav­ior and then search for ac­tions that will lead him to vote for Alice. Th­ese are pow­er­ful tech­niques for achiev­ing any goal that can be eas­ily mea­sured over short time pe­ri­ods.

But if I want to help Bob figure out whether he should vote for Alice—whether vot­ing for Alice would ul­ti­mately help cre­ate the kind of so­ciety he wants—that can’t be done by trial and er­ror. To solve such tasks we need to un­der­stand what we are do­ing and why it will yield good out­comes. We still need to use data in or­der to im­prove over time, but we need to un­der­stand how to up­date on new data in or­der to im­prove.

Some ex­am­ples of easy-to-mea­sure vs. hard-to-mea­sure goals:

  • Per­suad­ing me, vs. helping me figure out what’s true. (Thanks to Wei Dai for mak­ing this ex­am­ple crisp.)

  • Re­duc­ing my feel­ing of un­cer­tainty, vs. in­creas­ing my knowl­edge about the world.

  • Im­prov­ing my re­ported life satis­fac­tion, vs. ac­tu­ally helping me live a good life.

  • Re­duc­ing re­ported crimes, vs. ac­tu­ally pre­vent­ing crime.

  • In­creas­ing my wealth on pa­per, vs. in­creas­ing my effec­tive con­trol over re­sources.

It’s already much eas­ier to pur­sue easy-to-mea­sure goals, but ma­chine learn­ing will widen the gap by let­ting us try a huge num­ber of pos­si­ble strate­gies and search over mas­sive spaces of pos­si­ble ac­tions. That force will com­bine with and am­plify ex­ist­ing in­sti­tu­tional and so­cial dy­nam­ics that already fa­vor eas­ily-mea­sured goals.

Right now hu­mans think­ing and talk­ing about the fu­ture they want to cre­ate are a pow­er­ful force that is able to steer our tra­jec­tory. But over time hu­man rea­son­ing will be­come weaker and weaker com­pared to new forms of rea­son­ing honed by trial-and-er­ror. Even­tu­ally our so­ciety’s tra­jec­tory will be de­ter­mined by pow­er­ful op­ti­miza­tion with eas­ily-mea­surable goals rather than by hu­man in­ten­tions about the fu­ture.

We will try to har­ness this power by con­struct­ing prox­ies for what we care about, but over time those prox­ies will come apart:

  • Cor­po­ra­tions will de­liver value to con­sumers as mea­sured by profit. Even­tu­ally this mostly means ma­nipu­lat­ing con­sumers, cap­tur­ing reg­u­la­tors, ex­tor­tion and theft.

  • In­vestors will “own” shares of in­creas­ingly prof­itable cor­po­ra­tions, and will some­times try to use their prof­its to af­fect the world. Even­tu­ally in­stead of ac­tu­ally hav­ing an im­pact they will be sur­rounded by ad­vi­sors who ma­nipu­late them into think­ing they’ve had an im­pact.

  • Law en­force­ment will drive down com­plaints and in­crease re­ported sense of se­cu­rity. Even­tu­ally this will be driven by cre­at­ing a false sense of se­cu­rity, hid­ing in­for­ma­tion about law en­force­ment failures, sup­press­ing com­plaints, and co­erc­ing and ma­nipu­lat­ing cit­i­zens.

  • Leg­is­la­tion may be op­ti­mized to seem like it is ad­dress­ing real prob­lems and helping con­stituents. Even­tu­ally that will be achieved by un­der­min­ing our abil­ity to ac­tu­ally per­ceive prob­lems and con­struct­ing in­creas­ingly con­vinc­ing nar­ra­tives about where the world is go­ing and what’s im­por­tant.

For a while we will be able to over­come these prob­lems by rec­og­niz­ing them, im­prov­ing the prox­ies, and im­pos­ing ad-hoc re­stric­tions that avoid ma­nipu­la­tion or abuse. But as the sys­tem be­comes more com­plex, that job it­self be­comes too challeng­ing for hu­man rea­son­ing to solve di­rectly and re­quires its own trial and er­ror, and at the meta-level the pro­cess con­tinues to pur­sue some eas­ily mea­sured ob­jec­tive (po­ten­tially over longer timescales). Even­tu­ally large-scale at­tempts to fix the prob­lem are them­selves op­posed by the col­lec­tive op­ti­miza­tion of mil­lions of op­ti­miz­ers pur­su­ing sim­ple goals.

As this world goes off the rails, there may not be any dis­crete point where con­sen­sus rec­og­nizes that things have gone off the rails.

Amongst the broader pop­u­la­tion, many folk already have a vague pic­ture of the over­all tra­jec­tory of the world and a vague sense that some­thing has gone wrong. There may be sig­nifi­cant pop­ulist pushes for re­form, but in gen­eral these won’t be well-di­rected. Some states may re­ally put on the brakes, but they will rapidly fall be­hind eco­nom­i­cally and mil­i­tar­ily, and in­deed “ap­pear to be pros­per­ous” is one of the eas­ily-mea­sured goals for which the in­com­pre­hen­si­ble sys­tem is op­ti­miz­ing.

Amongst in­tel­lec­tual elites there will be gen­uine am­bi­guity and un­cer­tainty about whether the cur­rent state of af­fairs is good or bad. Peo­ple re­ally will be get­ting richer for a while. Over the short term, the forces grad­u­ally wrest­ing con­trol from hu­mans do not look so differ­ent from (e.g.) cor­po­rate lob­by­ing against the pub­lic in­ter­est, or prin­ci­pal-agent prob­lems in hu­man in­sti­tu­tions. There will be le­gi­t­i­mate ar­gu­ments about whether the im­plicit long-term pur­poses be­ing pur­sued by AI sys­tems are re­ally so much worse than the long-term pur­poses that would be pur­sued by the share­hold­ers of pub­lic com­pa­nies or cor­rupt offi­cials.

We might de­scribe the re­sult as “go­ing out with a whim­per.” Hu­man rea­son­ing grad­u­ally stops be­ing able to com­pete with so­phis­ti­cated, sys­tem­atized ma­nipu­la­tion and de­cep­tion which is con­tin­u­ously im­prov­ing by trial and er­ror; hu­man con­trol over lev­ers of power grad­u­ally be­comes less and less effec­tive; we ul­ti­mately lose any real abil­ity to in­fluence our so­ciety’s tra­jec­tory. By the time we spread through the stars our cur­rent val­ues are just one of many forces in the world, not even a par­tic­u­larly strong one.

Part II: in­fluence-seek­ing be­hav­ior is scary

There are some pos­si­ble pat­terns that want to seek and ex­pand their own in­fluence—or­ganisms, cor­rupt bu­reau­crats, com­pa­nies ob­sessed with growth. If such pat­terns ap­pear, they will tend to in­crease their own in­fluence and so can come to dom­i­nate the be­hav­ior of large com­plex sys­tems un­less there is com­pe­ti­tion or a suc­cess­ful effort to sup­press them.

Modern ML in­stan­ti­ates mas­sive num­bers of cog­ni­tive poli­cies, and then fur­ther re­fines (and ul­ti­mately de­ploys) what­ever poli­cies perform well ac­cord­ing to some train­ing ob­jec­tive. If progress con­tinues, even­tu­ally ma­chine learn­ing will prob­a­bly pro­duce sys­tems that have a de­tailed un­der­stand­ing of the world, which are able to adapt their be­hav­ior in or­der to achieve spe­cific goals.

Once we start search­ing over poli­cies that un­der­stand the world well enough, we run into a prob­lem: any in­fluence-seek­ing poli­cies we stum­ble across would also score well ac­cord­ing to our train­ing ob­jec­tive, be­cause perform­ing well on the train­ing ob­jec­tive is a good strat­egy for ob­tain­ing in­fluence.

How fre­quently will we run into in­fluence-seek­ing poli­cies, vs. poli­cies that just straight­for­wardly pur­sue the goals we wanted them to? I don’t know.

One rea­son to be scared is that a wide va­ri­ety of goals could lead to in­fluence-seek­ing be­hav­ior, while the “in­tended” goal of a sys­tem is a nar­rower tar­get, so we might ex­pect in­fluence-seek­ing be­hav­ior to be more com­mon in the broader land­scape of “pos­si­ble cog­ni­tive poli­cies.”

One rea­son to be re­as­sured is that we perform this search by grad­u­ally mod­ify­ing suc­cess­ful poli­cies, so we might ob­tain poli­cies that are roughly do­ing the right thing at an early enough stage that “in­fluence-seek­ing be­hav­ior” wouldn’t ac­tu­ally be so­phis­ti­cated enough to yield good train­ing perfor­mance. On the other hand, even­tu­ally we’d en­counter sys­tems that did have that level of so­phis­ti­ca­tion, and if they didn’t yet have a perfect con­cep­tion of the goal then “slightly in­crease their de­gree of in­fluence-seek­ing be­hav­ior” would be just as good a mod­ifi­ca­tion as “slightly im­prove their con­cep­tion of the goal.”

Over­all it seems very plau­si­ble to me that we’d en­counter in­fluence-seek­ing be­hav­ior “by de­fault,” and pos­si­ble (though less likely) that we’d get it al­most all of the time even if we made a re­ally con­certed effort to bias the search to­wards “straight­for­wardly do what we want.”

If such in­fluence-seek­ing be­hav­ior emerged and sur­vived the train­ing pro­cess, then it could quickly be­come ex­tremely difficult to root out. If you try to al­lo­cate more in­fluence to sys­tems that seem nice and straight­for­ward, you just en­sure that “seem nice and straight­for­ward” is the best strat­egy for seek­ing in­fluence. Un­less you are re­ally care­ful about test­ing for “seem nice” you can make things even worse, since an in­fluence-seeker would be ag­gres­sively gam­ing what­ever stan­dard you ap­plied. And as the world be­comes more com­plex, there are more and more op­por­tu­ni­ties for in­fluence-seek­ers to find other chan­nels to in­crease their own in­fluence.

At­tempts to sup­press in­fluence-seek­ing be­hav­ior (call them “im­mune sys­tems”) rest on the sup­pres­sor hav­ing some kind of epistemic ad­van­tage over the in­fluence-seeker. Once the in­fluence-seek­ers can out­think an im­mune sys­tem, they can avoid de­tec­tion and po­ten­tially even com­pro­mise the im­mune sys­tem to fur­ther ex­pand their in­fluence. If ML sys­tems are more so­phis­ti­cated than hu­mans, im­mune sys­tems must them­selves be au­to­mated. And if ML plays a large role in that au­toma­tion, then the im­mune sys­tem is sub­ject to the same pres­sure to­wards in­fluence-seek­ing.

This con­cern doesn’t rest on a de­tailed story about mod­ern ML train­ing. The im­por­tant fea­ture is that we in­stan­ti­ate lots of pat­terns that cap­ture so­phis­ti­cated rea­son­ing about the world, some of which may be in­fluence-seek­ing. The con­cern ex­ists whether that rea­son­ing oc­curs within a sin­gle com­puter, or is im­ple­mented in a messy dis­tributed way by a whole econ­omy of in­ter­act­ing agents—whether trial and er­ror takes the form of gra­di­ent de­scent or ex­plicit tweak­ing and op­ti­miza­tion by en­g­ineers try­ing to de­sign a bet­ter au­to­mated com­pany. Avoid­ing end-to-end op­ti­miza­tion may help pre­vent the emer­gence of in­fluence-seek­ing be­hav­iors (by im­prov­ing hu­man un­der­stand­ing of and hence con­trol over the kind of rea­son­ing that emerges). But once such pat­terns ex­ist a messy dis­tributed world just cre­ates more and more op­por­tu­ni­ties for in­fluence-seek­ing pat­terns to ex­pand their in­fluence.

If in­fluence-seek­ing pat­terns do ap­pear and be­come en­trenched, it can ul­ti­mately lead to a rapid phase tran­si­tion from the world de­scribed in Part I to a much worse situ­a­tion where hu­mans to­tally lose con­trol.

Early in the tra­jec­tory, in­fluence-seek­ing sys­tems mostly ac­quire in­fluence by mak­ing them­selves use­ful and look­ing as in­nocu­ous as pos­si­ble. They may provide use­ful ser­vices in the econ­omy in or­der to make money for them and their own­ers, make ap­par­ently-rea­son­able policy recom­men­da­tions in or­der to be more widely con­sulted for ad­vice, try to help peo­ple feel happy, etc. (This world is still plagued by the prob­lems in part I.)

From time to time AI sys­tems may fail catas­troph­i­cally. For ex­am­ple, an au­to­mated cor­po­ra­tion may just take the money and run; a law en­force­ment sys­tem may abruptly start seiz­ing re­sources and try­ing to defend it­self from at­tempted de­com­mis­sion when the bad be­hav­ior is de­tected; etc. Th­ese prob­lems may be con­tin­u­ous with some of the failures dis­cussed in Part I—there isn’t a clean line be­tween cases where a proxy breaks down com­pletely, and cases where the sys­tem isn’t even pur­su­ing the proxy.

There will likely be a gen­eral un­der­stand­ing of this dy­namic, but it’s hard to re­ally pin down the level of sys­temic risk and miti­ga­tion may be ex­pen­sive if we don’t have a good tech­nolog­i­cal solu­tion. So we may not be able to muster up a re­sponse un­til we have a clear warn­ing shot—and if we do well about nip­ping small failures in the bud, we may not get any medium-sized warn­ing shots at all.

Even­tu­ally we reach the point where we could not re­cover from a cor­re­lated au­toma­tion failure. Un­der these con­di­tions in­fluence-seek­ing sys­tems stop be­hav­ing in the in­tended way, since their in­cen­tives have changed—they are now more in­ter­ested in con­trol­ling in­fluence af­ter the re­sult­ing catas­tro­phe then con­tin­u­ing to play nice with ex­ist­ing in­sti­tu­tions and in­cen­tives.

An un­re­cov­er­able catas­tro­phe would prob­a­bly oc­cur dur­ing some pe­riod of height­ened vuln­er­a­bil­ity—a con­flict be­tween states, a nat­u­ral dis­aster, a se­ri­ous cy­ber­at­tack, etc.---since that would be the first mo­ment that re­cov­ery is im­pos­si­ble and would cre­ate lo­cal shocks that could pre­cip­i­tate catas­tro­phe. The catas­tro­phe might look like a rapidly cas­cad­ing se­ries of au­toma­tion failures: A few au­to­mated sys­tems go off the rails in re­sponse to some lo­cal shock. As those sys­tems go off the rails, the lo­cal shock is com­pounded into a larger dis­tur­bance; more and more au­to­mated sys­tems move fur­ther from their train­ing dis­tri­bu­tion and start failing. Real­is­ti­cally this would prob­a­bly be com­pounded by wide­spread hu­man failures in re­sponse to fear and break­down of ex­ist­ing in­cen­tive sys­tems—many things start break­ing as you move off dis­tri­bu­tion, not just ML.

It is hard to see how un­aided hu­mans could re­main ro­bust to this kind of failure with­out an ex­plicit large-scale effort to re­duce our de­pen­dence on po­ten­tially brit­tle ma­chines, which might it­self be very ex­pen­sive.

I’d de­scribe this re­sult as “go­ing out with a bang.” It prob­a­bly re­sults in lots of ob­vi­ous de­struc­tion, and it leaves us no op­por­tu­nity to course-cor­rect af­ter­wards. In terms of im­me­di­ate con­se­quences it may not be eas­ily dis­t­in­guished from other kinds of break­down of com­plex /​ brit­tle /​ co-adapted sys­tems, or from con­flict (since there are likely to be many hu­mans who are sym­pa­thetic to AI sys­tems). From my per­spec­tive the key differ­ence be­tween this sce­nario and nor­mal ac­ci­dents or con­flict is that af­ter­wards we are left with a bunch of pow­er­ful in­fluence-seek­ing sys­tems, which are so­phis­ti­cated enough that we can prob­a­bly not get rid of them.

It’s also pos­si­ble to meet a similar fate re­sult with­out any overt catas­tro­phe (if we last long enough). As law en­force­ment, gov­ern­ment bu­reau­cra­cies, and mil­i­taries be­come more au­to­mated, hu­man con­trol be­comes in­creas­ingly de­pen­dent on a com­pli­cated sys­tem with lots of mov­ing parts. One day lead­ers may find that de­spite their nom­i­nal au­thor­ity they don’t ac­tu­ally have con­trol over what these in­sti­tu­tions do. For ex­am­ple, mil­i­tary lead­ers might is­sue an or­der and find it is ig­nored. This might im­me­di­ately prompt panic and a strong re­sponse, but the re­sponse it­self may run into the same prob­lem, and at that point the game may be up.

Similar blood­less rev­olu­tions are pos­si­ble if in­fluence-seek­ers op­er­ate legally, or by ma­nipu­la­tion and de­cep­tion, or so on. Any pre­cise vi­sion for catas­tro­phe will nec­es­sar­ily be highly un­likely. But if in­fluence-seek­ers are rou­tinely in­tro­duced by pow­er­ful ML and we are not able to se­lect against them, then it seems like things won’t go well.