Conceptual Problems with UDT and Policy Selection

Abstract

UDT doesn’t give us con­cep­tual tools for deal­ing with mul­ti­a­gent co­or­di­na­tion prob­lems. There may have ini­tially been some hope, be­cause a UDT player can se­lect a policy which in­cen­tivises oth­ers to co­op­er­ate, or be­cause UDT can rea­son (EDT-style) that other UDTs are more likely to co­op­er­ate if it co­op­er­ates it­self, or other lines of thought. How­ever, it now ap­pears that this doesn’t pan out, at least not with­out other con­cep­tual break­throughs (which I sug­gest won’t look that much like UDT). I sug­gest this is con­nected with UDT’s difficul­ties han­dling log­i­cal un­cer­tainty.

Introduction

I tend to mostly think of UDT as the ideal, with other de­ci­sion the­o­ries be­ing of in­ter­est pri­mar­ily be­cause we don’t yet know how to gen­er­al­ize UDT be­yond the sim­plis­tic do­main where it definitely makes sense. This per­spec­tive has been in­creas­ingly prob­le­matic for me, how­ever, and I now feel I can say some rel­a­tively con­crete things about UDT be­ing wrong in spirit rather than only in de­tail.

Re­lat­edly, in late 2017 I made a post ti­tled Policy Selec­tion Solves Most Prob­lems. Policy se­lec­tion is the com­pro­mise solu­tion which does ba­si­cally what UDT is sup­posed to do, with­out the same de­gree of con­cep­tual el­e­gance, and with­out pro­vid­ing the hoped-for clar­ity which was a ma­jor mo­ti­va­tion for study­ing these foun­da­tional prob­lems. The cur­rent post can be seen as a fol­low-up to that, giv­ing an idea of the sort of thing which policy se­lec­tion doesn’t seem to solve.

The ar­gu­ment can also be thought of as an ar­gu­ment against veil-of-ig­no­rance moral­ity of a cer­tain kind.

I don’t think any of this will be re­ally sur­pris­ing to peo­ple who have been think­ing about this for a while, but, my thoughts seem to have re­cently gained clar­ity and definite­ness.

Ter­minol­ogy Notes/​References

UDT 1.0, on see­ing ob­ser­va­tion , takes the ac­tion which max­i­mizes the ex­pected util­ity of “my code out­puts ac­tion on see­ing ob­ser­va­tion ”, with ex­pected value eval­u­ated ac­cord­ing to the prior.

UDT 1.1, on see­ing ob­ser­va­tion , takes the ac­tion which the globally op­ti­mal policy (ac­cord­ing to the prior) maps to. This pro­duces the same re­sult as UDT 1.0 in many cases, but en­sures that the agent can hunt stag with it­self.

UDT 2 is like UDT 1.1, ex­cept it (1) rep­re­sents poli­cies as pro­grams rather than in­put-out­put map­pings, and (2) dy­nam­i­cally de­cides how much time to spend think­ing about the op­ti­mal policy.

What I’m call­ing “policy se­lec­tion” is similar to UDT 2. It has a fixed (small) amount of time to choose a policy be­fore think­ing more. How­ever, it could always choose the policy of wait­ing un­til it has thought longer be­fore it re­ally chooses a strat­egy, so that’s not so differ­ent from dy­nam­i­cally de­cid­ing when to choose a policy.

Two Ways UDT Hasn’t Generalized

Log­i­cal Uncertainty

UDT 2 tries to tackle the is­sue of “think­ing longer”, which is the is­sue of log­i­cal un­cer­tainty. This is a con­cep­tual prob­lem for UDT, be­cause think­ing longer is a kind of up­dat­ing. UDT is sup­posed to avoid up­dat­ing. UDT 2 doesn’t re­ally solve the prob­lem in a nice way.

The prob­lem with think­ing for only a short amount of time is that you get bad re­sults. Log­i­cal in­duc­tion, the best the­ory of log­i­cal un­cer­tainty we have at the mo­ment, gives es­sen­tially no guaran­tees about the qual­ity of be­liefs at short times. For UDT 2 to work well, it would need early be­liefs to at least be good enough to avoid se­lect­ing a policy quickly—early be­liefs should at least cor­rectly un­der­stand how poor-qual­ity they are.

The ideal for UDT is that early be­liefs re­flect all the pos­si­bil­ities in­her­ent in later up­dates, so that a policy op­ti­mized ac­cord­ing to early be­liefs re­acts ap­pro­pri­ately to later com­pu­ta­tions. Thin pri­ors are one way of think­ing of this. So far, noth­ing like this has been found.

Game Theory

The sec­ond way UDT has failed to gen­er­al­ize, and the main topic of this post, is to game the­ory (ie, multi-agent sce­nar­ios). Cousin_it noted that sin­gle-player ex­ten­sive-form games pro­vided a toy model of UDT. The cases where he says that the toy model breaks down are the cases where I am now say­ing the con­cept of UDT it­self breaks down. Ex­ten­sive-form games rep­re­sent the situ­a­tions where UDT makes real sense: those with no log­i­cal un­cer­tainty (or at least, no non-Bayesian phe­nom­ena in the log­i­cal un­cer­tainty), and, only one agent.

What’s the con­cep­tual prob­lem with ex­tend­ing UDT to mul­ti­ple agents?

When deal­ing with up­date­ful agents, UDT has the up­per hand. For ex­am­ple, in Chicken-like games, a UDT agent can be a bully, or com­mit not to re­spond to bul­lies. Un­der the usual game-the­o­retic as­sump­tion that play­ers can de­ter­mine what strate­gies each other have se­lected, the up­date­ful agents are forced to re­spond op­ti­mally to the up­date­less ones, IE give in to UDT bul­lies /​ not bully the un-bul­lyable UDT.

Put sim­ply, UDT makes its de­ci­sions “be­fore” other agents. (The “be­fore” is in log­i­cal time, though, not nec­es­sar­ily re­ally be­fore.)

When deal­ing with other UDT agents, how­ever, the UDT agents have to make a de­ci­sion “at the same time”.

Naively, the co­or­di­na­tion mechanism “write your de­ci­sions on slips of pa­per si­mul­ta­neously—no peek­ing!” is a bad one. But this is the whole idea of UDT—writ­ing down its strat­egy un­der a “no peek­ing!” con­di­tion.

Other de­ci­sion the­o­ries also have to make de­ci­sions “at the same time” in game-the­o­retic situ­a­tions, but they don’t op­er­ate un­der the “no peek­ing”. Guess­ing the be­hav­ior of the other play­ers could be difficult, but the agent can draw on past ex­pe­rience to help solve this prob­lem. UDT doesn’t have this ad­van­tage.

Fur­ther­more, we’re ask­ing more of UDT agents. When faced with a situ­a­tion in­volv­ing other UDT agents, UDT is sup­posed to “hand­shake”—Löbian hand­shakes be­ing at least a toy model—and find a co­op­er­a­tive solu­tion (to the ex­tent that there is one).

So far, mod­els of how hand­shakes could oc­cur have been limited to spe­cial cases or un­re­al­is­tic as­sump­tions. (I’d like to write a full re­view—I think there’s some non-ob­vi­ous stuff go­ing on—but for this post I think I’d bet­ter fo­cus on what I see as the fun­da­men­tal prob­lem.) I’d like to see bet­ter mod­els, but, I sus­pect that sig­nifi­cant de­par­tures from UDT will be re­quired.

Even if you don’t try to get UDT agents to co­op­er­ate with each other, though, the con­cep­tual prob­lem re­mains—UDT is go­ing in blind. It has a much lower abil­ity to de­ter­mine what equil­ibrium it is in.

I think there is a deep re­la­tion­ship be­tween the is­sue with log­i­cal un­cer­tainty and the is­sue with game the­ory. A sim­ple mo­ti­vat­ing ex­am­ple is Agent Si­mu­lates Pre­dic­tor, which ap­pears to be strongly con­nected to both is­sues.

How does Equil­ibrium Selec­tion Work?

The prob­lem I’m point­ing to is the prob­lem of equil­ibrium se­lec­tion. How are two UDT agents sup­posed to pre­dict each other? How can they trust each other?

There are many differ­ent ways to think about agents end­ing up in game-the­o­retic equil­ibria. Most of them, as I un­der­stand it, rely on iter­at­ing the game so that the agents can learn about it. This iter­a­tion can be thought of as re­ally oc­cur­ring, or as oc­cur­ring in the imag­i­na­tion of the play­ers (an ap­proach called “fic­ti­tious play”). Often, these sto­ries re­sult in agents play­ing cor­re­lated equil­ibria, rather than Nash equil­ibria. How­ever, that’s not a very big differ­ence for our pur­poses here—cor­re­lated equil­ibria only al­low the DD out­come in Pri­soner’s Dilemma, just like Nash.

There’s some­thing ab­surd about us­ing iter­ated play to learn sin­gle-shot strate­gies, a prob­lem Yoav Sho­ham et al dis­cuss in If multi-agent learn­ing is the an­swer, what is the ques­tion?. If the game is iter­ated, what stops agents from tak­ing ad­van­tage of its iter­ated na­ture?

That’s the essence of my ar­gu­ment in In Log­i­cal Time, All Games are Iter­ated Games—in or­der to learn to rea­son about each other, agents use fic­ti­tious play, or some­thing similar. But this turns the game into an iter­ated game.

Turn­ing a game into an iter­ated game can cre­ate a lot of op­por­tu­nity for co­or­di­na­tion, but the Folk The­o­rem says that it also cre­ates a very large equil­ibrium se­lec­tion prob­lem. The Folk The­o­rem in­di­cates that ra­tio­nal play­ers can end up in very bad out­comes. Fur­ther­more, we’ve found this difficult to avoid in de­ci­sion al­gorithms we know how to write down. How can we elimi­nate the “bad” equil­ibria and keep only the “good” pos­si­bil­ities?

What we’ve ac­com­plished is the re­duc­tion of the “hand­shake” prob­lem to the prob­lem of avoid­ing bad equil­ibria. (We could say this turns pris­oner’s dilemma into stag hunt.)

Hand­shake or no hand­shake, how­ever, the “fic­ti­tious play” view sug­gests that equil­ibrium se­lec­tion re­quires learn­ing. You can get agents into equil­ibria with­out learn­ing, but the se­tups seem ar­tifi­cial (so far as I’ve seen). This re­quires up­date­ful rea­son­ing in some sense. (Although, it can be a log­i­cal up­date only; be­ing em­piri­cally up­date­less still seems wiser).

Log­i­cal Uncer­tainty & Games

Tak­ing this idea a lit­tle fur­ther, we can re­late log­i­cal un­cer­tainty and games via the fol­low­ing idea:

Our un­cer­tain ex­pec­ta­tions are a statis­ti­cal sum­mary of how things have gone in similar situ­a­tions in the (log­i­cal) past. The way we re­act to what we see can be thought of as an iter­ated strat­egy which de­pends on the over­all statis­tics of that his­tory (rather than a sin­gle pre­vi­ous round).

I’m not con­fi­dent this anal­ogy is a good one—in par­tic­u­lar, the way poli­cies have to de­pend on statis­ti­cal sum­maries of the his­tory rather than on spe­cific pre­vi­ous rounds is a bit frus­trat­ing. How­ever, the anal­ogy goes deeper than I’m go­ing to spell out here. (Per­haps in a differ­ent post.)

One in­ter­est­ing point in fa­vor of this anal­ogy: it also works for modal agents. The proof op­er­a­tor, , is like a “pre­dic­tion”: proofs are how modal agents thinks about the world in or­der to figure out what to do. So is like “the agent thinks ”. If you look at how modal agents are ac­tu­ally com­puted in the MIRI guide to Löb’s the­o­rem, it looks like an iter­ated game, and looks like a sim­ple kind of sum­mary of the game so far. On any round, is true if and only if has been true in ev­ery pre­vi­ous round. So, you can think of as ” has held up so far”—as soon as turns out to be false once, is never true again.

In this in­ter­pre­ta­tion, FairBot (the strat­egy of co­op­er­at­ing if and only if the other player prov­ably co­op­er­ates) be­comes the “Grim Trig­ger” strat­egy: co­op­er­ate on the first round, and co­op­er­ate on ev­ery sub­se­quent round so long as the other player has co­op­er­ated so far. If the other player ever defects, switch to defect­ing, and never co­op­er­ates again.

A take-away for the broader pur­pose of this post could be: one of the best mod­els we have of the UDT “hand­shake” is the Grim Trig­ger strat­egy in dis­guise. This sets the tone nicely for what fol­lows.

My point in offer­ing this anal­ogy, how­ever, is to drive home the idea that game-the­o­retic rea­son­ing re­quires learn­ing. Even logic-based agents can be un­der­stood as run­ning sim­ple learn­ing al­gorithms, “up­dat­ing” on “ex­pe­rience” from (counter)log­i­cal pos­si­bil­ities. UDT can’t dance with its eyes closed.

This is far from a proof of any­thing; I’m just con­vey­ing in­tu­itions here.

What UDT Wants

One way to look at what UDT is try­ing to do is to think of it as always try­ing to win a “most meta” com­pe­ti­tion. UDT doesn’t want to look at any in­for­ma­tion un­til it has de­ter­mined the best way to use that in­for­ma­tion. UDT doesn’t want to make any de­ci­sions di­rectly; it wants to find op­ti­mal poli­cies. UDT doesn’t want to par­ti­ci­pate in the usual game-the­o­retic setup where it (some­how) knows all other agent’s poli­cies and has to re­act; in­stead, it wants to un­der­stand how those poli­cies come about, and act in a way which max­i­mally shapes that pro­cess to its benefit.

It wants to move first in ev­ery game.

Ac­tu­ally, that’s not right: it wants the op­tion of mov­ing first. De­cid­ing ear­lier is always bet­ter, if one of the op­tions is to de­cide later.

It wants to an­nounce its bind­ing com­mit­ments be­fore any­one else has a chance to, so that ev­ery­one has to re­act to the rules it sets. It wants to set the equil­ibrium as it chooses. Yet, at the same time, it wants to un­der­stand how ev­ery­one else will re­act. It would like to un­der­stand all other agents in de­tail, their be­hav­ior a func­tion of it­self.

So, what hap­pens if you put two such agents in a room to­gether?

Both agents race to de­cide how to de­cide first. Each strives to un­der­stand the other agent’s be­hav­ior as a func­tion of its own, to se­lect the best policy for deal­ing with the other. Yet, such ex­am­i­na­tion of the other needs to it­self be done in an up­date­less way. It’s a race to make the most un­in­formed de­ci­sion.

I claim this isn’t a very good co­or­di­na­tion strat­egy.

One is­sue is that jump­ing up a meta-level in­creases the com­plex­ity of a de­ci­sion. De­cid­ing a sin­gle ac­tion is much eas­ier than de­cid­ing on a whole policy. Some kind of race to in­creas­ing meta-lev­els makes de­ci­sions in­creas­ingly challeng­ing.

At the same time, the de­sire for your policy to be log­i­cally ear­lier than ev­ery­one else, so that they ac­count for your com­mit­ments in mak­ing their de­ci­sions, means you have to make your de­ci­sions faster and in sim­pler, more pre­dictable ways.

The ex­pand­ing meta-space and the con­tract­ing time do not seem like a good match. You have to make a more com­pli­cated de­ci­sion via less-ca­pa­ble means.

Two peo­ple try­ing to de­cide poli­cies early are just like two peo­ple try­ing to de­cide ac­tions late, but with more op­tions and less time to think. It doesn’t seem to solve the fun­da­men­tal co­or­di­na­tion prob­lem.

The race for most-meta is only one pos­si­ble in­tu­ition about what UDT is try­ing to be. Per­haps there is a more use­ful one, which could lead to bet­ter gen­er­al­iza­tions.

Veils of Ignorance

UDT tries to co­or­di­nate with it­self by step­ping be­hind a veil. In do­ing so, it fails to co­or­di­nate with oth­ers.

Veil-of-ig­no­rance moral the­o­ries de­scribe mul­ti­a­gent co­or­di­na­tion re­sult­ing from step­ping be­hind a veil. But there is a se­ri­ous prob­lem. How can ev­ery­one step be­hind the same veil? You can’t tell what veil ev­ery­one else stepped be­hind if you stay be­hind your own veil.

UDT can suc­cess­fully self-co­or­di­nate in this way be­cause it is very rea­son­able to use the com­mon prior as­sump­tion with a sin­gle agent. There is no good rea­son to sup­pose this in the mul­ti­a­gent case. In prac­tice, the com­mon prior as­sump­tion is a good ap­prox­i­ma­tion of re­al­ity be­cause ev­ery­one has dealt with es­sen­tially the same re­al­ity for a long time and has learned a lot about it. But if we have ev­ery­one step be­hind a veil of ig­no­rance, there is no rea­son to sup­pose they know how to con­struct the same veil as each other—they’re ig­no­rant!

Is UDT Al­most Right, Nonethe­less?

I find my­self in an awk­ward po­si­tion. I still think UDT gets a lot of things right. Cer­tainly, it still seems worth be­ing up­date­less about em­piri­cal un­cer­tainty. It doesn’t seem to make sense for log­i­cal un­cer­tainty… but treat­ing log­i­cal and em­piri­cal un­cer­tainty in such differ­ent ways is quite un­com­fortable. My in­tu­ition is that there should not be a clean di­vi­sion be­tween the two.

One pos­si­ble re­ac­tion to all this is to try to learn to be up­date­less. IE, don’t ac­tu­ally try to be up­date­less, but do try to get the prob­lems right which UDT got right. Don’t ex­pect ev­ery­thing to go well with a fixed Bayesian prior, but try to spec­ify the learn­ing-the­o­retic prop­er­ties which ap­prox­i­mate that ideal.

Would such an ap­proach do any­thing to help mul­ti­a­gent co­or­di­na­tion? Un­clear. Ther­mo­dy­namic self-mod­ifi­ca­tion hi­er­ar­chies might work with this kind of ap­proach.

In terms of veil-of-ig­no­rance moral­ity, it seems po­ten­tially helpful. Take away ev­ery­thing we’ve learned, and we don’t know how to co­op­er­ate from be­hind our in­di­vi­d­ual veils of ig­no­rance. But if we each have a veil of ig­no­rance which is care­fully con­structed, a learned-up­date­less view which ac­cu­rately re­flects the pos­si­bil­ities in some sense, they seem more likely to match up and en­able co­or­di­na­tion.

Or per­haps a more rad­i­cal de­par­ture form the UDT on­tol­ogy is needed.