Non-Adversarial Goodhart and AI Risks

In a re­cent pa­per by Scott Garrabrant and my­self, we for­mal­ized and ex­tended the cat­e­gories Scott pro­posed for Good­hart-like phe­nom­ena. (If you haven’t read ei­ther his post or the new pa­per, it’s im­por­tant back­ground for most of this post.)

Here, I lay out my fur­ther in­tu­itions about how and where the non-ad­ver­sar­ial cat­e­gories mat­ter for AI safety. Speci­fi­cally, I view these cat­e­gories as par­tic­u­larly crit­i­cal in pre­vent­ing ac­ci­den­tal su­per­hu­man AI, or near-term pa­per­clip­ping. This makes them par­tic­u­larly cru­cial in the short term.

I do not think that most of the is­sues high­lighted are new, but I think the fram­ing is use­ful, and hope­fully clearly pre­sents why causal mis­takes by Agen­tic AI are harder prob­lems that I think is nor­mally ap­pre­ci­ated.

Epistemic Sta­tus: Pro­vi­sional and open to re­vi­sion based on new ar­gu­ments, but ar­rived at af­ter sig­nifi­cant con­sid­er­a­tion. I be­lieve con­clu­sions 1-4 are restate­ments of well un­der­stood claims in AI safety. I be­lieve con­clu­sions 5 and 6 are less well ap­pre­ci­ated.

Side Note: I am defer­ring dis­cus­sion of ad­ver­sar­ial Good­hart to the other pa­per and a later post; it is ar­guably more im­por­tant, but in very differ­ent ways. The deferred top­ics in­cludes most is­sues with mul­ti­ple agen­tic AIs that in­ter­act, and is­sues with pre-spec­i­fy­ing a con­trol scheme for a su­per­hu­man AI.

Good­hart Effects Re­view—Read the pa­per for de­tails!

Re­gres­sional Good­hart—When se­lect­ing for a proxy mea­sure, you se­lect not only for the true goal, but also for the differ­ence be­tween the proxy and the goal.

Ex­tremal Good­hart—Wor­lds in which the proxy takes an ex­treme value may be very differ­ent from the or­di­nary wor­lds in which the re­la­tion­ship be­tween the proxy and the goal was ob­served. This oc­curs in the form of Model In­suffi­ciency, or Change in Regime.

Causal Good­hart—When the causal path be­tween the proxy and the goal is in­di­rect, in­ter­ven­ing can change the re­la­tion­ship be­tween the mea­sure and proxy, and op­ti­miz­ing can then cause per­verse effects.

Ad­ver­sar­ial Good­hart will not be dis­cussed in this post. It oc­curs in two ways. Misal­ign­ment—The agent ap­plies se­lec­tion pres­sure know­ing the reg­u­la­tor will ap­ply differ­ent se­lec­tion pres­sure on the ba­sis of the met­ric. This al­lows the agent to hi­jack the reg­u­la­tor’s op­ti­miza­tion. Co­bra Effect - The reg­u­la­tor mod­ifies the agent goal, usu­ally via an in­cen­tive, to cor­re­late it with the reg­u­la­tor met­ric. The agent then ei­ther 1) uses se­lec­tion pres­sure to cre­ate ex­tremal Good­hart effects oc­cur or make re­gres­sional Good­hart effects more se­vere, or 2) acts by chang­ing the causal struc­ture due to in­com­pletely al­igned goals in a way that cre­ates a Good­hart effect.

Re­gres­sional and Ex­tremal Goodhart

The first two cat­e­gories of Good­hart-like phe­nom­ena, re­gres­sional and ex­tremal, are over-op­ti­miza­tion mis­takes, and in my view the mis­takes should be avoid­able. This is not to say we don’t need to treat AI as a cryp­to­graphic rocket probe, or that we don’t need to be wor­ried about it—just that we know what to be con­cerned about already. This seems vaguely re­lated to what Scott Alexan­der calls “Mis­take The­ory”—the risks would be tech­ni­cally solv­able if we could con­vince peo­ple not to do the stupid things that make them hap­pen.

Re­gres­sional Good­hart, as Scott Garrabrant cor­rectly noted, is an un­avoid­able phe­nomenon when do­ing un­con­strained op­ti­miza­tion us­ing a fixed met­ric which is im­perfectly cor­re­lated with a goal. To avoid the prob­lems of overop­ti­miza­tion de­spite the un­avoid­able phe­nomenon, a safe AI sys­tems must 1) have limited op­ti­miza­tion power to al­low ro­bust­ness to mis­al­ign­ment, per­haps via satis­fic­ing, low-im­pact agents, or sus­pend­able /​ stop­pable agents, and/​or 2) in­volve a met­ric which is adap­tive us­ing tech­niques like over­sight or re­in­force­ment learn­ing. This al­lows hu­mans to re­al­ign the AI, and safe ap­proaches should en­sure there are other ways to en­force ro­bust­ness to scal­ing up.

Con­clu­sion 1 - Don’t al­low un­con­strained op­ti­miza­tion us­ing a fixed met­ric.

In less un­avoid­able ways, Ex­tremal Good­hart effects are mis­takes of overop­ti­miza­tion, and my in­tu­ition is that they should be ad­dress­able in similar ways. We need to be able to de­tect the regime changes or in­creas­ing mis­al­ign­ment of the met­ric, but the strate­gies that ad­dress re­gres­sional effects should be closely re­lated or use­ful in the same cases. Again, it’s not easy, but it’s a well defined prob­lem.

Con­clu­sion 2 - When ex­plor­ing the fit­ness land­scape, don’t jump down the op­ti­miza­tion slope too quickly be­fore dou­ble check­ing ex­ter­nally. This is es­pe­cially true when mov­ing to out-of-sam­ple ar­eas.

De­spite the challenges, I think that the di­ver­gences be­tween goals and met­rics in the first two Good­hart-like effects can be un­der­stood and ad­dressed be­fore­hand, and these tech­niques are be­ing ac­tively ex­plored. In fact, I think this de­scribes at least a large plu­ral­ity of the cur­rent work be­ing done on AI al­ign­ment.

The Ad­di­tional Challenge of Causality

Causal Good­hart, like the ear­lier two cat­e­gories, is always a mis­take of un­der­stand­ing. Un­like the first two, it seems less eas­ily avoid­able by be­ing cau­tious. The difficulty of in­fer­ring causal­ity cor­rectly means that it’s po­ten­tially easy to ac­ci­den­tally screw up an agen­tic AI’s world model in a way that al­lows causal mis­takes to be made. I’m un­sure the ap­proaches be­ing con­sid­ered for AI Safety are prop­erly care­ful about this fact. (I am not very fa­mil­iar with the var­i­ous threads of AI safety re­search, so I may be mis­taken on that count.)

Ac­count­ing for un­cer­tainty about causal mod­els is crit­i­cal, but given the mul­ti­plic­ity of pos­si­ble mod­els, we run into the prob­lems of com­pu­ta­tion seen in AIXI. (And even AIXI doesn’t guaran­tee safety!)

So in­fer­ring causal struc­ture is NP hard. Scott’s ear­lier post claims that “you can try to in­fer the causal struc­ture of the vari­ables us­ing statis­ti­cal meth­ods, and check that the proxy ac­tu­ally causes the goal be­fore you in­ter­vene on the proxy. ” The prob­lem is that we can’t ac­tu­ally in­fer causal struc­ture well, even given RCTs, with­out si­mul­ta­neously test­ing the full fac­to­rial set of cases. (And even then statis­tics is hard, and can be screwed up ac­ci­den­tally in com­plex and unan­ti­ci­pated ways.) Hu­mans in­fer causal­ity partly in­tu­itively, but in more com­plex sys­tems, badly. They can be taught to do it bet­ter (PDF), but only in nar­row do­mains.

Con­clu­sion 3 - Get­ting causal­ity right is an in­trin­si­cally com­pu­ta­tion­ally hard and sam­ple-in­effi­cient prob­lem, and build­ing AI won’t fix that.

As Pearl notes, policy is hard in part be­cause know­ing ex­actly these com­plex causal fac­tors is hard. This isn’t re­stricted to AI, and it also hap­pens in [in­sert ba­si­cally any pub­lic policy that you think we should stop already here]. (Poli­tics is the mind-kil­ler, and policy de­bates of­ten cen­ter around claims about causal­ity. No, I won’t give con­tem­po­rary ex­am­ples.)

We don’t even get causal­ity right in the rel­a­tively sim­pler policy sys­tems we already con­struct—hence Ch­ester­ton’s fence, Boustead’s Iron Law of In­ter­ven­tion, and the fact that in­tel­lec­tu­als through­out his­tory rou­tinely start ad­vo­cat­ing strongly for things that turn out to be bad when ac­tu­ally ap­plied. They never ac­tu­ally apol­o­gize for ac­ci­den­tally starv­ing and kil­ling 5% of their pop­u­la­tion. This, of course, is be­cause their ac­tual idea was good, it was just done badly. Ob­vi­ously real kil­ling of birds to re­duce pests in China has never been tried.

Con­clu­sion 4 - Some­times the per­verse effects of get­ting it a lit­tle bit wrong are re­ally, re­ally bad, es­pe­cially be­cause per­verse effects may only be ob­vi­ous af­ter long de­lays.

There are two parts to this is­sue, the first of which is that mis­taken causal struc­ture can lead to re­gres­sional or ex­tremal Good­hart. This is not causal Good­hart, and isn’t more wor­ri­some than those is­sues, since the ear­lier men­tioned solu­tions still ap­ply. The sec­ond part is that the ac­tion taken by the reg­u­la­tor may ac­tu­ally change the causal struc­ture. They think they are do­ing some­thing sim­ple like re­mov­ing a crop-eat­ing preda­tor, but the re­la­tion­ship be­tween crop-eat­ing and birds ig­nores the fact that the birds eat other pests. This is much more wor­ri­some, and harder to avoid.

This sec­ond case is causal Good­hart. The mis­take can oc­cur as soon as you al­low a reg­u­la­tor—Ma­chine Learn­ing, AI, or oth­er­wise—to in­ter­act in ar­bi­trary ways with wider com­plex sys­tems di­rectly to achieve speci­fied goals, with­out spec­i­fy­ing and re­strict­ing the meth­ods to be used.

Th­ese prob­lems don’t show up in cur­rent de­ployed sys­tems be­cause hu­mans typ­i­cally choose the ac­tion set to be cho­sen from based on the causal un­der­stand­ing needed. The challenge is also not seen in toy-wor­lds, since test­ing do­mains are usu­ally very well speci­fied, and in­fer­ring causal­ity be­comes difficult only when the sys­tem be­ing ma­nipu­lated con­tains com­plex and not-fully-un­der­stood causal dy­nam­ics. (A pos­si­ble coun­terex­am­ple is a story I heard sec­ond­hand about OpenAI de­vel­op­ing what be­came the World of Bits sys­tem. Giv­ing a RL sys­tem ac­cess to a ran­dom web browser and the mouse led to weird prob­lems in­clud­ing, if I re­call cor­rectly, the en­tire sys­tem crash­ing.)

Con­clu­sion 5 - This class of causal mis­take prob­lem should be ex­pected to show up as proto-AI sys­tems are fully de­ployed, not be­fore­hand when tested in limited cases.

This class of prob­lem does not seem to be ad­dressed by much of the AI-Risk ap­proaches that are cur­rently be­ing sug­gested or de­vel­oped. (The only ap­proach that avoids this is us­ing Or­a­cle-AIs.) It seems there is no al­ter­na­tive to us­ing a ten­ta­tive causal un­der­stand­ing for de­ci­sion mak­ing if we al­low any form of Agen­tic AI. The prob­lems that it causes are not usu­ally ob­vi­ous to ei­ther the ob­server or the agent un­til the de­ci­sion has been im­ple­mented.

Note that at­tempt­ing to min­i­mize the im­pact of a choice is done based on the same mis­taken or im­perfect causal model that leads to the de­ci­sion, so it is not avoid­able in this way. Hu­mans pro­vid­ing re­in­force­ment learn­ing based on the pro­jected out­comes of the de­ci­sion are similarly un­aware of the per­verse effect, and im­pact min­i­miza­tion as­sumes that the pro­jected im­pact is cor­rect.

Con­clu­sion 6 - Im­pact min­i­miza­tion strate­gies do not seem likely to fix the prob­lems of causal mis­takes.


It seems that the class of is­sues iden­ti­fied in the realm of Good­hart-like phe­nom­ena illus­trates some po­ten­tial ad­van­tages and is­sues worth con­sid­er­ing in AI safety. The prob­lems iden­ti­fied in part sim­ply restate prob­lems that are already un­der­stood, but the frame­work seems worth fur­ther con­sid­er­a­tion. Most crit­i­cally, a bet­ter un­der­stand­ing of causal mis­takes and causal Good­hart effects would po­ten­tially be valuable. If the con­clu­sions here are in­cor­rect, un­der­stand­ing why also seems use­ful for un­der­stand­ing the way in which AI risk can and can­not man­i­fest.