Multi-Agent Overoptimization, and Embedded Agent World Models

I think this ex­pands on the points be­ing made in the re­cently com­pleted Garrabrant /​ Dem­ski Embed­ded Agency se­quence. It also serves to con­nect a pa­per I wrote re­cently that dis­cusses mostly non-AI risks from mul­ti­ple agents that ex­pands on the work done last year of Good­hart’s Law back to the deeper ques­tions that MIRI is con­sid­er­ing. Lastly, it tries to point out a bit of how all of this con­nects to some of the other streams of AI safety re­search.

Jug­gling Models

We don’t know how to make agents con­tain a com­plete world model that in­cludes them­selves. That’s a hard enough prob­lem, but the prob­lem could get much harder—and in some ap­pli­ca­tions it already has. When mul­ti­ple agents need to have world mod­els, the dis­crep­ancy be­tween the model and re­al­ity can have some nasty feed­back effects that re­late to Good­hart’s law, which I am now refer­ring to more gen­er­ally as overop­ti­miza­tion failures.

In my re­cent pa­per, I dis­cuss the prob­lem when mul­ti­ple agents in­ter­act, us­ing poker as a mo­ti­vat­ing ex­am­ple. Each poker-play­ing agent needs to have a (sim­plified) model of the game in or­der to play (some­what) op­ti­mally. Rea­son­able heuris­tics and Ma­chine Learn­ing already achieve su­per-hu­man perfor­mance in “heads-up” (2-player) poker. But the gen­eral case of multi-player poker is a huge game, so the game gets sim­plified.

This is ex­actly the case where we can tran­si­tion just a lit­tle bit from the world of easy de­ci­sion the­ory, which Abram and Scott point out al­lows mod­el­ing “the agent and the en­vi­ron­ment as sep­a­rate units which in­ter­act over time through clearly defined i/​o chan­nels,” to the world of not em­bed­ded agents, but in­ter­act­ing agents. This moves just a lit­tle bit in the di­rec­tion of “we don’t know how to do this.”

This par­tial tran­si­tion hap­pens be­cause the agent must have some model of the de­ci­sion pro­cess of the other play­ers in or­der to play strate­gi­cally. In that model, agents need to rep­re­sent what those play­ers will do not only in re­ac­tion to the cards, but in re­ac­tion to the bets the agent places. To do this op­ti­mally, they need a model of the other player’s (per­haps im­plicit) model of the agent. And build­ing mod­els of other player’s mod­els seems very closely re­lated to work like An­drew Critch’s pa­per on Lob’s The­o­rem and Co­op­er­a­tion.

That ex­plains why I claim that build­ing mod­els of com­plex agents that have mod­els of you that then need mod­els of them, etc. is go­ing to be re­lated to some of the same is­sues that em­bed­ded agents face, even with­out the need to deal with some of the harder parts of self-knowl­edge of agents that self-mod­ify.

Game the­ory “an­swers” this, but it cheated.

The ob­vi­ous way to model in­ter­ac­tion is with game the­ory, which makes a cou­ple seem­ingly-in­nocu­ous sim­plify­ing as­sump­tions. The prob­lem is that these as­sump­tions are im­pos­si­ble in prac­tice.

The first is that the agents are ra­tio­nal and Bayesian. But as Chris Sims pointed out, there are no real Bayesi­ans. (” Not that there’s some­thing bet­ter out there. “)

• There are fewer than 2 truly Bayesian chess play­ers (prob­a­bly none). • We know the op­ti­mal form of the de­ci­sion rule when two such play­ers play each other: Either white re­signs, black re­signs, or they agree on a draw, all be­fore the first move. • But pick­ing which of these three is the right rule re­quires com­pu­ta­tions that are not yet com­plete.

This is (kind of) a point that Abram and Scott made in the se­quence in dis­guise—that world mod­els are always smaller than the agents.

The sec­ond as­sump­tion is that agents have com­mon knowl­edge of both agents’ ob­jec­tive func­tions. (Ben Pace points out how hard that as­sump­tion is to re­al­ize in prac­tice. And yes, you can avoid this as­sump­tion by spec­i­fy­ing that they have un­cer­tainty of a defined form, but that just kicks the can down the road—how do you know what dis­tri­bu­tions to use? What hap­pens if the agent’s true util­ity is out­side the hy­poth­e­sis space?) If the mod­els of the agents must be small, how­ever, it is pos­si­ble that they can­not have a com­plete model of the other agent’s prefer­ences.

It’s a bit of a side-point for the em­bed­ded agents dis­cus­sion, but break­ing this sec­ond as­sump­tion is what al­lows for a se­ries of overop­ti­miza­tion ex­ploita­tions ex­plored in the new pa­per. Some of these, like ac­ci­den­tal steer­ing and co­or­di­na­tion failures, are wor­ry­ing for AI-al­ign­ment be­cause they pose challenges even for co­op­er­at­ing agents. Others, like ad­ver­sar­ial mis­al­ign­ment, in­put spoofing and fil­ter­ing, and goal co-op­tion, are only in the ad­ver­sar­ial case, but can still mat­ter if we are con­cerned about sub­sys­tem al­ign­ment. And the last cat­e­gory, di­rect hack­ing, gets into many of the even harder prob­lems of em­bed­ded agents.

Embed­ded agents, ex­ploita­tion and end­ing.

As I just noted, one class of is­sues that em­bed­ded agents have that tra­di­tional di­choto­mous agents do not is di­rect in­terfer­ence. If an agents hacks the soft­ware an­other agent is run­ning on, there are many ob­vi­ous ex­ploits to worry about. This can’t eas­ily hap­pen with a defined chan­nel. (But to digress, they still do hap­pen in such defined chan­nels. This is be­cause peo­ple with­out se­cu­rity mind­set keep build­ing Tur­ing-com­plete lan­guages into the com­mu­ni­ca­tion in­ter­faces, in­stead of do­ing #LangSec prop­erly.)

But for em­bed­ded agents the types of ex­ploita­tion we need to worry about are even more gen­eral. De­ci­sion the­ory with em­bed­ded world mod­els is ob­vi­ously crit­i­cal for Embed­ded Agency work, but I think it’s also crit­i­cal for value al­ign­ment, since “goal in­fer­ence” in prac­tice re­quires in­fer­ring some baseline shared hu­man value sys­tem from in­co­her­ent groups. (Whether or not the in­di­vi­d­ual agents are in­co­her­ent.) This is in many ways a multi-agent co­op­er­a­tion prob­lem—and even if we want to co­op­er­ate and share goals, and we already agreed that we should do so, co­op­er­a­tion can fall prey to ac­ci­den­tal steer­ing and co­or­di­na­tion failures.

Lastly, Paul Chris­ti­ano’s Iter­ated Am­plifi­ca­tion ap­proach, which in part re­lies on small agents co­op­er­at­ing, seems to need to deal with this even more ex­plic­itly. But I’m still think­ing about the con­nec­tions be­tween these prob­lems and the ones his ap­proach takes, and I’ll wait for his se­quence to be finished, and time for me to think about it, to com­ment about this and get more clar­ity.