(Some?) Possible Multi-Agent Goodhart Interactions

Epistemic Sta­tus: I need feed­back on these ideas, and I’ve been de­lay­ing be­cause I’m not sure I’m on the right track. This is the product of a lot of think­ing, but I’m not sure the list is com­plete or there isn’t some­thing im­por­tant I’m miss­ing. (Note: This is in­tended to form a large part of a pa­per for an ar­ti­cle to be sub­mit­ted to the jour­nal spe­cial is­sue here.)

Fol­low­ing up on Scott Garrabrant’s ear­lier post on Good­hart’s Law and the re­sult­ing pa­per, I wrote a fur­ther dis­cus­sion of non-ad­ver­sar­ial good­hart, and ex­plic­itly deferred dis­cus­sion of the ad­ver­sar­ial case. I’ve been work­ing on that.

Also note that these are of­ten re­for­mu­la­tions or cat­e­go­riza­tions of other terms (treach­er­ous turn, faulty re­ward func­tions, dis­tri­bu­tional shift, re­ward hack­ing, etc.) It might be good to clar­ify ex­actly what went where, but I’m un­sure.

To (fi­nally) start, here is Scott’s “Quick Refer­ence” for the ini­tial 4 meth­ods, which is use­ful for this post as well. I’ve partly re­placed the last one with the equiv­a­lent cases from the Arxiv pa­per.

Quick Reference

  • Re­gres­sional Good­hart—When se­lect­ing for a proxy mea­sure, you se­lect not only for the true goal, but also for the differ­ence be­tween the proxy and the goal.

    • Model: When U is equal to V+X, where X is some noise, a point with a large U value will likely have a large V value, but also a large X value. Thus, when U is large, you can ex­pect V to be pre­dictably smaller than U.

    • Ex­am­ple: height is cor­re­lated with bas­ket­ball abil­ity, and does ac­tu­ally di­rectly help, but the best player is only 6′3″, and a ran­dom 7′ per­son in their 20s would prob­a­bly not be as good

  • Causal Good­hart—When there is a non-causal cor­re­la­tion be­tween the proxy and the goal, in­ter­ven­ing on the proxy may fail to in­ter­vene on the goal.

    • Model: If V causes U (or if V and U are both caused by some third thing), then a cor­re­la­tion be­tween V and U may be ob­served. How­ever, when you in­ter­vene to in­crease U through some mechanism that does not in­volve V, you will fail to also in­crease V.

    • Ex­am­ple: some­one who wishes to be taller might ob­serve that height is cor­re­lated with bas­ket­ball skill and de­cide to start prac­tic­ing bas­ket­ball.

  • Ex­tremal Good­hart—Wor­lds in which the proxy takes an ex­treme value may be very differ­ent from the or­di­nary wor­lds in which the cor­re­la­tion be­tween the proxy and the goal was ob­served.

    • Model: Pat­terns tend to break at sim­ple joints. One sim­ple sub­set of wor­lds is those wor­lds in which U is very large. Thus, a strong cor­re­la­tion be­tween U and V ob­served for nat­u­rally oc­cur­ing U val­ues may not trans­fer to wor­lds in which U is very large. Fur­ther, since there may be rel­a­tively few nat­u­rally oc­cur­ing wor­lds in which U is very large, ex­tremely large U may co­in­cide with small V val­ues with­out break­ing the statis­ti­cal cor­re­la­tion.

    • Ex­am­ple: the tallest per­son on record, Robert Wad­low, was 8′11″ (2.72m). He grew to that height be­cause of a pi­tu­itary di­s­or­der, he would have strug­gled to play bas­ket­ball be­cause he “re­quired leg braces to walk and had lit­tle feel­ing in his legs and feet.”

  • (See be­low from the Arxiv pa­per.) Ad­ver­sar­ial Good­hart—When you op­ti­mize for a proxy, you provide an in­cen­tive for ad­ver­saries to cor­re­late their goal with your proxy, thus de­stroy­ing the cor­re­la­tion with your goal.

    • Model: Re­moved. See be­low.

    • Ex­am­ple—Re­moved.

From the Arxiv Paper: (Note—I think this is still in­com­plete, and fo­cuses far too much on the Agent-Reg­u­la­tor fram­ing. See be­low.)

  • Ad­ver­sar­ial Misal­ign­ment Good­hart—The agent ap­plies se­lec­tion pres­sure know­ing the reg­u­la­tor will ap­ply differ­ent se­lec­tion pres­sure on the ba­sis of the met­ric . The ad­ver­sar­ial mis­al­ign­ment failure can oc­cur due to the agent cre­at­ing ex­tremal Good­hart effects, or by ex­ac­er­bat­ing always-pre­sent re­gres­sional Good­hart, or due to causal in­ter­ven­tion by the agent which changes the effect of the reg­u­la­tor op­ti­miza­tion.

  • Camp­bell’s Law—Agents se­lect a met­ric know­ing the choice of reg­u­la­tor met­ric. Agents can cor­re­late their met­ric with the reg­u­la­tor’s met­ric, and se­lect on their met­ric. This fur­ther re­duces the use­ful­ness of se­lec­tion us­ing the met­ric for acheiv­ing the origi­nal goal.

  • Nor­mal Co­bra Effect—The reg­u­la­tor mod­ifies the agent goal, usu­ally via an in­cen­tive, to cor­re­late it with the reg­u­la­tor met­ric. The agent then acts by chang­ing the ob­served causal struc­ture due to in­com­pletely al­igned goals in a way that cre­ates a Good­hart effect.

  • Non-Causal Co­bra Effect—The reg­u­la­tor mod­ifies the agent goal to make agent ac­tions al­igned with the reg­u­la­tor’s met­ric. Un­der se­lec­tion pres­sure from the agent, ex­tremal Good­hart effects oc­cur or re­gres­sional Good­hart effects are wors­ened.

New: 5 Ways Mul­ti­ple Agents Ruin Everything

To fix that in­suffi­cient bul­let point above, here is a list of 5 forms of op­ti­miza­tion failures that can oc­cur in multi-agent sys­tems. I in­tend for the new sub-list to be both ex­haus­tive, and non-over­lap­ping, but I’m not sure ei­ther is true. For ob­vi­ous rea­sons, the list is mostly hu­man ex­am­ples, and I haven’t for­mal­ized these into ac­tual sys­tem mod­els. (Any­one who would like to help me do so would be wel­come!)

Note that the list is only dis­cussing things that hap­pen due to op­ti­miza­tion failure and in­ter­ac­tions. Also note that most ex­am­ples are 2-party. There may be com­plex and spe­cific 3-party or N-party failure modes that are not cap­tured, but I can’t find any.

1) (Ac­ci­den­tal) Steer­ing is when one agent al­ter the sys­tem in ways not an­ti­ci­pated by an­other agent, cre­at­ing one of the above-men­tioned over-op­ti­miza­tion failures for the vic­tim.

This is par­tic­u­larly wor­ri­some when mul­ti­ple agents have closely re­lated goals, even if those goals are al­igned.

Ex­am­ple 1.1 A sys­tem may change due to a com­bi­na­tion of ac­tors’ oth­er­wise be­nign in­fluences, ei­ther putting the sys­tem in an ex­tremal state or trig­ger­ing a regime change.

Ex­am­ple 1.2 In the pres­ence of mul­ti­ple agents with­out co­or­di­na­tion, ma­nipu­la­tion of fac­tors not already be­ing ma­nipu­lated by other agents is likely to be eas­ier and more re­ward­ing, po­ten­tially lead­ing to in­ad­ver­tent steer­ing.

2) Co­or­di­na­tion Failure oc­curs when mul­ti­ple agents clash de­spite hav­ing po­ten­tially com­pat­i­ble goals.

Co­or­di­na­tion is an in­her­ently difficult task, and can in gen­eral be con­sid­ered im­pos­si­ble\cite{Gib­bard1973}. In prac­tice, co­or­di­na­tion is es­pe­cially difficult when goals of other agents are in­com­pletely known or un­der­stood. Co­or­di­na­tion failures such as Yud­kowsky’s Inad­e­quate equil­ibria\cite{Yud­kowsky2017} are sta­ble, and co­or­di­na­tion to es­cape from such an equil­ibrium can be prob­le­matic even when agents share goals.

Ex­am­ple 2.1 Con­flict­ing in­stru­men­tal goals that nei­ther side an­ti­ci­pates may cause wasted re­sources on con­tention. For ex­am­ple, both agents are try­ing to do the same thing in con­flict­ing ways.

Ex­am­ple 2.2 Co­or­di­na­tion limit­ing overuse of pub­lic goods is only pos­si­ble when con­flicts are an­ti­ci­pated or no­ticed and where a re­li­able mechanism can be de­vised\cite{Ostrom1990}.

3) Ad­ver­sar­ial mis­al­ign­ment oc­curs when a vic­tim agent has an in­com­plete model of how an op­po­nent can in­fluence the sys­tem, and the op­po­nent se­lects for cases where the vic­tim’s model performs poorly and/​or pro­motes the op­po­nent’s goal.

Ex­am­ple 3.1 Chess en­g­ines will choose open­ings for which the vic­tim is weak­est.

Ex­am­ple 3.2 So­phis­ti­cated fi­nan­cial ac­tors can dupe vic­tims into buy­ing or sel­l­ing an as­set in or­der to ex­ploit the re­sult­ing price changes.

4) In­put spoofing and fil­ter­ing—Filtered ev­i­dence can be pro­vided or false ev­i­dence can be man­u­fac­tured and put into the train­ing data stream of a vic­tim agent.

Ex­am­ple 4.1 Fi­nan­cial ac­tors can filter by perform­ing trans­ac­tions they don’t want seen as pri­vate trans­ac­tions or dark pool trans­ac­tions, or can spoof by cre­at­ing offset­ting trans­ac­tions with only one half be­ing re­ported to give a false im­pres­sion of ac­tivity to other agents.

Ex­am­ple 4.2 Rat­ing sys­tems can be at­tacked by in­putting false re­views into a sys­tem, or by dis­cour­ag­ing re­views by those likely to be the least or most satis­fied re­view­ers.

Ex­am­ple 4.3 Honey­pots can be placed or Sy­bil at­tacks mounted by op­po­nents in or­der to fool vic­tims into learn­ing from ex­am­ples that sys­tem­at­i­cally differ from the true dis­tri­bu­tion.

5) Goal co-op­tion is when an agent di­rectly mod­ifies the vic­tim agent re­ward func­tion di­rectly, or ma­nipu­lates vari­ables ab­sent from the vic­tim’s sys­tem model.

The prob­a­bil­ity of ex­ploitable re­ward func­tions in­creases with the com­plex­ity of both the agent and the sys­tem it ma­nipu­lates\cite{Amodei2016}, and ex­ploita­tion by other agents seems to fol­low the same pat­tern.

Ex­am­ple 5.1 At­tack­ers can di­rectly tar­get the sys­tem on which an agent runs and mod­ify its goals.

Ex­am­ple 5.2 An at­tacker can dis­cover ex­ploitable quirks in the goal func­tion to make the sec­ond agent op­ti­mize for a new goal, as in Man­heim and Garrabrant’s Camp­bell’s law ex­am­ple.


I’d love feed­back. (I have plenty to say about ap­pli­ca­tions and im­por­tance, but I’ll talk about that sep­a­rately.)