The Catastrophic Convergence Conjecture

Overfit­ting the AU landscape

When we act, and oth­ers act upon us, we aren’t just chang­ing our abil­ity to do things – we’re shap­ing the lo­cal en­vi­ron­ment to­wards cer­tain goals, and away from oth­ers.[1] We’re fit­ting the world to our pur­poses.

What hap­pens to the AU land­scape[2] if a pa­per­clip max­i­mizer takes over the world?[3]

Prefer­ences im­plicit in the evolu­tion of the AU landscape

Shah et al.’s Prefer­ences Im­plicit in the State of the World lev­er­ages the in­sight that the world state con­tains in­for­ma­tion about what we value. That is, there are agents push­ing the world in a cer­tain “di­rec­tion”. If you wake up and see a bunch of vases ev­ery­where, then vases are prob­a­bly im­por­tant and you shouldn’t ex­plode them.

Similarly, the world is be­ing op­ti­mized to fa­cil­i­tate achieve­ment of cer­tain goals. AUs are shift­ing and mor­ph­ing, of­ten to­wards what peo­ple lo­cally want done (e.g. set­ting the table for din­ner). How can we lev­er­age this for AI al­ign­ment?

Ex­er­cise: Brain­storm for two min­utes by the clock be­fore I an­chor you.

Two ap­proaches im­me­di­ately come to mind for me. Both rely on the agent fo­cus­ing on the AU land­scape rather than the world state.

Value learn­ing with­out a pre­speci­fied on­tol­ogy or hu­man model. I have pre­vi­ously crit­i­cized value learn­ing for need­ing to lo­cate the hu­man within some kind of pre­speci­fied on­tol­ogy (this crit­i­cism is not new). By tak­ing only the agent it­self as prim­i­tive, per­haps we could get around this (we don’t need any fancy en­g­ineer­ing or ar­bi­trary choices to figure out AUs/​op­ti­mal value from the agent’s per­spec­tive).

Force-mul­ti­ply­ing AI. Have the AI ob­serve which of its AUs most in­crease dur­ing some ini­tial pe­riod of time, af­ter which it pushes the most-in­creased-AU even fur­ther.

In 2016, Jes­sica Tay­lor wrote of a similar idea:

“In gen­eral, it seems like “es­ti­mat­ing what types of power a bench­mark sys­tem will try ac­quiring and then de­sign­ing an al­igned AI sys­tem that ac­quires the same types of power for the user” is a gen­eral strat­egy for mak­ing an al­igned AI sys­tem that is com­pet­i­tive with a bench­mark un­al­igned AI sys­tem.”

I think the naïve im­ple­men­ta­tion of ei­ther idea would fail; e.g., there are a lot of de­gen­er­ate AUs it might find. How­ever, I’m ex­cited by this be­cause a) the AU land­scape evolu­tion is an im­por­tant source of in­for­ma­tion, b) it feels like there’s some­thing here we could do which nicely avoids on­tolo­gies, and c) force-mul­ti­pli­ca­tion is qual­i­ta­tively differ­ent than ex­ist­ing pro­pos­als.

Pro­ject: Work out an AU land­scape-based al­ign­ment pro­posal.

Why can’t ev­ery­one be king?

Con­sider two co­ex­ist­ing agents each re­warded for gain­ing power; let’s call them Ogre and Gi­ant. Their re­ward func­tions[4] (over the par­tial-ob­serv­abil­ity ob­ser­va­tions) are iden­ti­cal. Will they com­pete? If so, why?

Let’s think about some­thing eas­ier first. Imag­ine two agents each re­warded for drink­ing coffee. Ob­vi­ously, they com­pete with each other to se­cure the max­i­mum amount of coffee. Their ob­jec­tives are in­dex­i­cal, so they aren’t al­igned with each other – even though they share a re­ward func­tion.

Sup­pose both agents are able to have max­i­mal power. Re­mem­ber, Ogre’s power can be un­der­stood as its abil­ity to achieve a lot of differ­ent goals. Most of Ogre’s pos­si­ble goals need re­sources; since Gi­ant is also op­ti­mally power-seek­ing, it will act to pre­serve its own power and pre­vent Ogre from us­ing the re­sources. If Gi­ant weren’t there, Ogre could bet­ter achieve a range of goals. So, Ogre can still gain power by de­thron­ing Gi­ant. They can’t both be king.

Just be­cause agents have in­dex­i­cally iden­ti­cal pay­offs doesn’t mean they’re co­op­er­at­ing; to be al­igned with an­other agent, you should want to steer to­wards the same kinds of fu­tures.

Most agents aren’t pure power max­i­miz­ers. But since the same re­source com­pe­ti­tion usu­ally ap­plies, the rea­son­ing still goes through.

Ob­jec­tive vs value-spe­cific catastrophes

How use­ful is our defi­ni­tion of “catas­tro­phe” with re­spect to hu­mans? After all, liter­ally any­thing could be a catas­tro­phe for some util­ity func­tion.[5]

Ty­ing one’s shoes is ab­solutely catas­trophic for an agent which only finds value in uni­verses in which shoes have never ever ever been tied. Maybe all pos­si­ble value in the uni­verse is de­stroyed if we lose at Go to an AI even once. But this seems rather silly.

Hu­man val­ues are com­pli­cated and frag­ile:

Con­sider the in­cred­ibly im­por­tant hu­man value of “bore­dom”—our de­sire not to do “the same thing” over and over and over again. You can imag­ine a mind that con­tained al­most the whole speci­fi­ca­tion of hu­man value, al­most all the morals and meta­morals, but left out just this one thing—and so it spent un­til the end of time, and un­til the farthest reaches of its light cone, re­play­ing a sin­gle highly op­ti­mized ex­pe­rience, over and over and over again.

But the hu­man AU is not so del­i­cate. That is, given that we have power, we can make value; there don’t seem to be ar­bi­trary, silly value-spe­cific catas­tro­phes for us. Given en­ergy and re­sources and time and man­power and com­pe­tence, we can build a bet­ter fu­ture.

In part, this is be­cause a good chunk of what we care about seems roughly ad­di­tive over time and space; a bad thing hap­pen­ing some­where else in space­time doesn’t mean you can’t make things bet­ter where you are; we have many sources of po­ten­tial value. In part, this is be­cause we of­ten care about the uni­verse more than the ex­act uni­verse his­tory; our prefer­ences don’t seem to en­code ar­bi­trary de­on­tolog­i­cal land­mines. More gen­er­ally, if we did have such a del­i­cate goal, it would be the case that if we learned that a par­tic­u­lar thing had hap­pened at any point in the past in our uni­verse, that en­tire uni­verse would be par­tially ru­ined for us for­ever. That just doesn’t sound re­al­is­tic.

It seems that most of our catas­tro­phes are ob­jec­tive catas­tro­phes.[6]

Con­sider a psy­cholog­i­cally trau­ma­tiz­ing event which leaves hu­mans uniquely un­able to get what they want, but which leaves ev­ery­one else (trout, AI, etc.) un­af­fected. Our abil­ity to find value is ru­ined. Is this an ex­am­ple of the del­i­cacy of our AU?

No. This is an ex­am­ple of the del­i­cacy of our im­ple­men­ta­tion; no­tice also that our AUs for con­struct­ing red cubes, re­li­ably look­ing at blue things, and sur­viv­ing are also ru­ined. Our power has been de­creased.

De­tailing the catas­trophic con­ver­gence con­jec­ture (CCC)

In gen­eral, the CCC fol­lows from two sub-claims. 1) Given we still have con­trol over the fu­ture, hu­man­ity’s long-term AU is still rea­son­ably high (i.e. we haven’t en­dured a catas­tro­phe). 2) Real­is­ti­cally, agents are only in­cen­tivized to take con­trol from us in or­der to gain power for their own goal. I’m fairly sure the sec­ond claim is true (“evil” agents are the ex­cep­tion prompt­ing the “re­al­is­ti­cally”).

Also, we’re im­plic­itly con­sid­er­ing the sim­plified frame of a sin­gle smart AI af­fect­ing the world, and not struc­tural risk via the broader con­se­quences of oth­ers also de­ploy­ing similar agents. This is im­por­tant but out­side of our scope for now.

Unal­igned goals tend to have catas­tro­phe-in­duc­ing op­ti­mal poli­cies be­cause of power-seek­ing in­cen­tives.

Let’s say a re­ward func­tion is outer-al­igned[7] if all of its Black­well-op­ti­mal poli­cies are do­ing what we want (a policy is Black­well-op­ti­mal if it’s op­ti­mal and doesn’t change as the agent cares more about the fu­ture). Let’s say a re­ward func­tion class is outer-al­ignable if it con­tains an outer-al­igned re­ward func­tion.[8] The CCC is talk­ing about outer al­ign­ment only.

Unal­igned goals tend to have catas­tro­phe-in­duc­ing op­ti­mal poli­cies be­cause of power-seek­ing in­cen­tives.

Not all un­al­igned goals in­duce catas­tro­phes, and of those which do in­duce catas­tro­phes, not all of them do it be­cause of power-seek­ing in­cen­tives. For ex­am­ple, a re­ward func­tion for which in­ac­tion is the only op­ti­mal policy is “un­al­igned” and non-catas­trophic. An “evil” re­ward func­tion which in­trin­si­cally val­ues harm­ing us is un­al­igned and has a catas­trophic op­ti­mal policy, but not be­cause of power-seek­ing in­cen­tives.

“Tend to have” means that re­al­is­ti­cally, the rea­son we’re wor­ry­ing about catas­tro­phe is be­cause of power-seek­ing in­cen­tives – be­cause the agent is gain­ing power to bet­ter achieve its own goal. Agents don’t oth­er­wise seem in­cen­tivized to screw us over very hard; CCC can be seen as try­ing to ex­plain ad­ver­sar­ial Good­hart in this con­text. If CCC isn’t true, that would be im­por­tant for un­der­stand­ing goal-di­rected outer al­ign­ment in­cen­tives and the loss land­scape for how much we value de­ploy­ing differ­ent kinds of op­ti­mal agents.

While there ex­ist agents which cause catas­tro­phe for other rea­sons (e.g. an AI mis­man­ag­ing the power grid could trig­ger a nu­clear war), the CCC claims that the se­lec­tion pres­sure which makes these poli­cies op­ti­mal tends to come from power-seek­ing drives.

Unal­igned goals tend to have catas­tro­phe-in­duc­ing op­ti­mal poli­cies be­cause of power-seek­ing in­cen­tives.

“But what about the Black­well-op­ti­mal policy for Tic-Tac-Toe? Th­ese agents aren’t tak­ing over the world now”. The CCC is talk­ing about agents op­ti­miz­ing a re­ward func­tion in the real world (or, for gen­er­al­ity, in an­other suffi­ciently com­plex mul­ti­a­gent en­vi­ron­ment).

Prior work

In fact even if we only re­solved the prob­lem for the similar-sub­goals case, it would be pretty good news for AI safety. Catas­trophic sce­nar­ios are mostly caused by our AI sys­tems failing to effec­tively pur­sue con­ver­gent in­stru­men­tal sub­goals on our be­half, and these sub­goals are by defi­ni­tion shared by a broad range of val­ues.

~ Paul Chris­ti­ano, Scal­able AI control

Con­ver­gent in­stru­men­tal sub­goals are mostly about gain­ing power. For ex­am­ple, gain­ing money is a con­ver­gent in­stru­men­tal sub­goal. If some in­di­vi­d­ual (hu­man or AI) has con­ver­gent in­stru­men­tal sub­goals pur­sued well on their be­half, they will gain power. If the most effec­tive con­ver­gent in­stru­men­tal sub­goal pur­suit is di­rected to­wards giv­ing hu­mans more power (rather than giv­ing alien AI val­ues more power), then hu­mans will re­main in con­trol of a high per­centage of power in the world.

If the world is not severely dam­aged in a way that pre­vents any agent (hu­man or AI) from even­tu­ally coloniz­ing space (e.g. se­vere nu­clear win­ter), then the per­centage of the cos­mic en­dow­ment that hu­mans have ac­cess to will be roughly close to to the per­centage of power that hu­mans have con­trol of at the time of space coloniza­tion. So the most rele­vant fac­tors for the com­po­si­tion of the uni­verse are (a) whether any­one at all can take ad­van­tage of the cos­mic en­dow­ment, and (b) the long-term bal­ance of power be­tween differ­ent agents (hu­mans and AIs).

I ex­pect that en­sur­ing that the long-term bal­ance of power fa­vors hu­mans con­sti­tutes most of the AI al­ign­ment prob­lem...

~ Jes­sica Tay­lor, Pur­su­ing con­ver­gent in­stru­men­tal sub­goals on the user’s be­half doesn’t always re­quire good priors


  1. In plan­ning and ac­tivity re­search there are two com­mon ap­proaches to match­ing agents with en­vi­ron­ments. Either the agent is de­signed with the spe­cific en­vi­ron­ment in mind, or it is pro­vided with learn­ing ca­pa­bil­ities so that it can adapt to the en­vi­ron­ment it is placed in. In this pa­per we look at a third and un­der­ex­ploited al­ter­na­tive: de­sign­ing agents which adapt their en­vi­ron­ments to suit them­selves… In this case, due to the ac­tion of the agent, the en­vi­ron­ment comes to be bet­ter fit­ted to the agent as time goes on. We ar­gue that [this no­tion] is a pow­er­ful one, even just in ex­plain­ing agent-en­vi­ron­ment in­ter­ac­tions.

    Ham­mond, Kris­tian J., Ti­mothy M. Con­verse, and Joshua W. Grass. “The sta­bi­liza­tion of en­vi­ron­ments.” Ar­tifi­cial In­tel­li­gence 72.1-2 (1995): 305-327. ↩︎

  2. Think­ing about overfit­ting the AU land­scape im­plic­itly in­volves a prior dis­tri­bu­tion over the goals of the other agents in the land­scape. Since this is just a con­cep­tual tool, it’s not a big deal. Ba­si­cally, you know it when you see it. ↩︎

  3. Overfit­ting the AU land­scape to­wards one agent’s un­al­igned goal is ex­actly what I meant when I wrote the fol­low­ing in Towards a New Im­pact Mea­sure:

    Un­for­tu­nately, al­most never,[9] so we have to stop our re­in­force­ment learn­ers from im­plic­itly in­ter­pret­ing the learned util­ity func­tion as all we care about. We have to say, “op­ti­mize the en­vi­ron­ment some ac­cord­ing to the util­ity func­tion you’ve got, but don’t be a weirdo by tak­ing us liter­ally and turn­ing the uni­verse into a pa­per­clip fac­tory. Don’t overfit the en­vi­ron­ment to , be­cause that stops you from be­ing able to do well for other util­ity func­tions.”

    ↩︎
  4. In most finite Markov de­ci­sion pro­cesses, there does not ex­ist a re­ward func­tion whose op­ti­mal value func­tion is (defined as “the abil­ity to achieve goals in gen­eral” in my pa­per) be­cause of­ten vi­o­lates smooth­ness con­straints on the on-policy op­ti­mal value fluc­tu­a­tion (AFAICT, a new re­sult of pos­si­bil­ity the­ory, even though you could prove it us­ing clas­si­cal tech­niques). That is, I can show that op­ti­mal value can’t change too quickly from state to state while the agent is act­ing op­ti­mally, but can drop off very quickly.

    This doesn’t mat­ter for Ogre and Gi­ant, be­cause we can still find a re­ward func­tion whose unique op­ti­mal policy nav­i­gates to the high­est power states. ↩︎

  5. In most finite Markov de­ci­sion pro­cesses, most re­ward func­tions do not have such value frag­ility. Most re­ward func­tions have sev­eral ways of ac­cu­mu­lat­ing re­ward. ↩︎

  6. When I say “an ob­jec­tive catas­tro­phe de­stroys a lot of agents’ abil­ities to get what they want”, I don’t mean that the agents have to ac­tu­ally be pre­sent in the world. Break­ing a fish tank de­stroys a fish’s abil­ity to live there, even if there’s no fish in the tank. ↩︎

  7. This idea comes from Evan Hub­inger’s Outer al­ign­ment and imi­ta­tive am­plifi­ca­tion:

    In­tu­itively, I will say that a loss func­tion is outer al­igned at op­ti­mum if all the pos­si­ble mod­els that perform op­ti­mally ac­cord­ing to that loss func­tion are al­igned with our goals—that is, they are at least try­ing to do what we want. More pre­cisely, let and . For a given loss func­tion , let . Then, is outer al­igned at op­ti­mum if, for all such that , is try­ing to do what we want.

    ↩︎
  8. Some large re­ward func­tion classes are prob­a­bly not outer al­ignable; for ex­am­ple, con­sider all Marko­vian lin­ear func­tion­als over a we­b­cam’s pixel val­ues. ↩︎

  9. I dis­agree with my us­age of “al­igned al­most never” on a tech­ni­cal ba­sis: as­sum­ing a finite state and ac­tion space and con­sid­er­ing the max­en­tropy re­ward func­tion dis­tri­bu­tion, there must be a pos­i­tive mea­sure set of re­ward func­tions for which the/​a hu­man-al­igned policy is op­ti­mal. ↩︎