Asymptotic Decision Theory (Improved Writeup)

(asymp­totic de­ci­sion the­ory, ini­tially de­tailed in this pa­per) was a pro­posed de­ci­sion the­ory with log­i­cal in­duc­tors, de­vel­oped af­ter EDT with log­i­cal in­duc­tors and ex­plo­ra­tion steps. It is a pos­si­ble can­di­date for con­cep­tual progress in de­ci­sion the­ory, but some ba­sic ques­tions about its perfor­mance are still un­set­tled, and there are sev­eral is­sues with the cur­rent im­ple­men­ta­tion of it.

Defi­ni­tions:

A mar­ket is a com­putable func­tion with type (where is the set of sen­tences of math) that will be de­noted as . We will be con­sid­er­ing log­i­cal in­duc­tors in this set­ting, and in par­tic­u­lar, each finite stage of a log­i­cal in­duc­tor is a mar­ket.

Let be a finite set of ac­tions. An agent is a com­putable se­quence of com­putable func­tions of type . It takes a timestep and a prob­a­bil­ity dis­tri­bu­tion, and se­lects a dis­tri­bu­tion over ac­tions. Be­cause log­i­cal in­duc­tors as­so­ci­ate each timestep with an ac­tion, spec­i­fy­ing the log­i­cal in­duc­tor and the agent speci­fies a se­quence of ac­tions. Be­cause we will be as­sum­ing some fixed log­i­cal in­duc­tor in the back­ground, we will sup­press it in the no­ta­tion and re­fer to the ac­tion pro­duced by an agent at time t as or . Ran­dom­iza­tion can be han­dled by us­ing the di­ag­o­nal­iza­tion sen­tences used to define ex­plo­ra­tion (this acts in such a way that no trader in the in­duc­tor is able to re­li­ably pre­dict what the agent does), or it can be han­dled by hav­ing one of the ac­tions be to call a ran­dom num­ber gen­er­a­tor in the en­vi­ron­ment.

An em­bed­der is a se­quence of ran­dom­ized func­tions of type (de­noted by or , with the func­tion at time t be­ing de­noted by or ). Let de­note the ran­dom vari­able that cor­re­sponds to the en­vi­ron­ment us­ing a uniform ran­dom dis­tri­bu­tion of bit­strings for its ran­dom­ness tape. An em­bed­der must have the prob­a­bil­ity dis­tri­bu­tion of be­ing com­putably ap­prox­imable on com­putable in­puts.

An argmax agent is an agent that takes a finite set of other agents , and a sin­gle em­bed­der , and out­puts on turn . is defined as it usu­ally is for log­i­cal in­duc­tors. No­tice that, al­though may be very difficult to com­pute, putting it in­side an ex­pec­ta­tion al­lows a log­i­cal in­duc­tor to out­put a de­cent guess as to what it is any­ways. The agents all had to be com­putable in or­der for the argmax agent to du­pli­cate their be­hav­ior at ar­bi­trary turns. There is a de­pen­dence on the set , the em­bed­der , and the log­i­cal in­duc­tor, but be­cause the log­i­cal in­duc­tor and will be treated as fixed, we will write the time-t ac­tion pro­duced by the argmax agent as .

will de­note the “true en­vi­ron­ment”, which is a se­quence of val­ues in .

Fi­nally, the log­i­cal in­duc­tor will be re­quired to have “fast feed­back”. This means that af­ter each turn, the fol­low­ing in­for­ma­tion will show up in the de­duc­tive pro­cess at time .

1: (an in­ter­val con­tain­ing) the true value of .

The de­duc­tive pro­cess is the se­quence of the­o­rems that the log­i­cal in­duc­tor sees. The rea­son that the true util­ity has to be in an in­ter­val, in­stead of re­ported di­rectly, is be­cause the true util­ity may have ar­bi­trar­ily many digits, which ham­pers the abil­ity of traders to use that in­for­ma­tion to judge bets on how will turn out. This con­di­tion is pre­sent to en­sure that, if is gen­er­ated by tak­ing and feed­ing it a uniform ran­dom bit­string, .

The­o­rem 1: If is gen­er­ated on each turn by run­ning with a bit­string sam­pled uniformly at ran­dom, then .

We will ap­ply the ba­sic trad­ing strat­egy from The­o­rem 4.5.9 in the log­i­cal in­duc­tion pa­per. In this case, if the left-hand side is over­priced by in­finitely of­ten (by a failure of con­ver­gence, and the same ar­gu­ment can be ap­plied sym­met­ri­cally to the right-hand side be­ing over­priced in­finitely of­ten), the traders buy a frac­tion of a share of and sell a frac­tion of a share of , and keep do­ing it un­til 1 share has been moved in to­tal. This takes care of hav­ing -gen­er­a­ble trade mag­ni­tudes. Ac­cord­ing to the law of large num­bers, with prob­a­bil­ity 1, there is a finite time af­ter which all of the sub-traders have their pile of shares have a value within of , so the ini­tial traders in the effi­ciently em­u­lat­able se­quence of traders can be clipped off, and all the other sub-traders get or more value from sel­l­ing the high-priced ex­pec­ta­tion and buy­ing the low-priced one, so the nec­es­sary con­di­tions for the -ROI Lemma are fulfilled and the re­sult­ing trader ex­ploits the mar­ket.

Fi­nally, we will define dom­i­nance. An agent dom­i­nates an agent on an em­bed­der if

This could be thought of as hav­ing sub­lin­ear re­gret rel­a­tive to .

What’s ADT?:

The asymp­totic de­ci­sion the­ory al­gorithm works as fol­lows.

In­puts: A se­quence of num­bers asymp­tot­i­cally de­creas­ing to , de­noted as , a finite list of em­bed­ders , a finite list of agents , and a log­i­cal in­duc­tor with fast feed­back on the true en­vi­ron­ment,

Ok, so what does this do in­tu­itively? Well, it takes all the em­bed­ders, and runs them through a “re­al­ity filter” which checks whether the em­bed­der, run on the agent it­self, repli­cates what the true re­al­ity ac­tu­ally does, in ex­pec­ta­tion. For all the em­bed­ders that pass the re­al­ity filter, it uses the in­duc­tor to as­sess what score argmax gets in that em­bed­der, takes the one that gets the high­est score, and then im­ple­ments argmax for that em­bed­der. In short, it op­ti­misti­cally chooses the em­bed­der where the high­est score is at­tain­able (that isn’t wrong about re­al­ity), and op­ti­mizes for that. If it got a good score, that’s fine. If it got a lower score than it was ex­pect­ing, that em­bed­der is less likely to pass the re­al­ity filter next time, be­cause un­der­shot . There isn’t a prob­lem (as there is in stan­dard log­i­cal in­duc­tor de­ci­sion the­ory) of sys­tem­at­i­cally un­der­scor­ing an em­bed­der for­ever, be­cause since an em­bed­der is a ran­dom­ized func­tion, it’s pos­si­ble to ac­tu­ally ap­prox­i­mate a dis­tri­bu­tion over what it out­puts, so argmax will even­tu­ally catch on and start tak­ing (pre­dicted-to-be) op­ti­mal ac­tions, and the value of these can also be em­piri­cally as­sessed.

Prob­lems with ADT:

The most promi­nent un­de­sir­able fea­ture of this is that it’s re­stricted to a finite set of em­bed­ders. Op­ti­mistic choice fails very badly on an in­finite set of em­bed­ders, be­cause we can con­sider an in­finite se­quence of em­bed­ders that are like “press­ing the but­ton dis­penses 1 util­ity for­ever”, “press­ing the but­ton de­liv­ers an elec­tric shock the first time, and then dis­penses 1 util­ity for­ever”… “press­ing the but­ton de­liv­ers an elec­tric shock for the first n times, and then dis­penses 1 util­ity for­ever”… “press­ing the but­ton just always de­liv­ers an elec­tric shock”. Op­ti­mistic choice among em­bed­ders keeps press­ing the but­ton, be­cause, al­though it keeps get­ting shocked, there’s always an em­bed­der that promises that the agent’s for­tunes will turn around on the next turn.

Also, this op­ti­mizes for each choice in­di­vi­d­u­ally, and does not nat­u­rally deal with se­quences of choices, which is nec­es­sary to han­dle gen­eral en­vi­ron­ments.

Good Prop­er­ties of ADT:

A nice fea­ture of this is that ex­plo­ra­tion steps are not re­quired for good be­hav­ior, which is im­por­tant be­cause coun­ter­fac­tu­als are not con­di­tion­als.

Another nice fea­ture of this is that it gets ASP (agent simu­lates pre­dic­tor) right, which is a sur­pris­ingly hard de­ci­sion the­ory prob­lem to do well on. When you are re­warded with dol­lars for pick­ing ac­tion 2, and paid dol­lars, the best move is to just take ac­tion 1 to win the mil­lion dol­lars, but stan­dard log­i­cal in­duc­tor de­ci­sion the­ory con­verges to tak­ing ac­tion 2, be­cause the pre­dic­tor isn’t pow­er­ful enough to pre­dict the rare ex­plo­ra­tion steps, so the agent will learn that ac­tion 2 always gets it more money than ac­tion 1, and dial up the prob­a­bil­ity of tak­ing ac­tion 2 un­til it ends up get­ting a thou­sand dol­lars on each round and miss­ing the larger pay­off.

How­ever, be­cause the em­bed­der that rep­re­sents the true en­vi­ron­ment is plug­ging things like “always pick 1” and “always pick 2″ into the en­vi­ron­ment, argmax on that en­vi­ron­ment will con­verge to copy­ing the “always pick 1” strat­egy, so the log­i­cal in­duc­tor learns that argmax will always pick 1, and then the true em­bed­der claims that dol­lars are at­tain­able. If learns to use the true em­bed­der, then it will con­verge to always one-box­ing, which is the de­sired be­hav­ior.

This same fea­ture also al­lows it to win on Parfit’s Hitchiker and XOR Black­mail (I think, 90% prob­a­bil­ity)

This does not pay up on coun­ter­fac­tual mug­ging.

The origi­nal pa­per on Asymp­totic De­ci­sion The­ory had much more re­stric­tive re­stric­tions for good be­hav­ior, such as the em­bed­ders hav­ing to be con­tin­u­ous, all the agents in hav­ing to con­verge to a sin­gle dis­tri­bu­tion in the limit, all the em­bed­ders in hav­ing to con­verge to a sin­gle pay­off in the limit when fed a con­ver­gent agent, and hav­ing to use a con­tin­u­ous analogue of argmax, and in com­bi­na­tion, this meant that most of the games you could define (even “pre­dict this se­quence of bits”) were out­side of the class of prob­lems where op­ti­mal­ity is guaran­teed.

Why Haven’t I Heard of this Be­fore?

Well, for quite a while, we thought it was bad be­cause we thought it crashed into it­self in games of chicken, so it got tabled for a while. I’ll now go over why the ar­gu­ment in that post is false.

To be­gin, note that there’s a cru­cial differ­ence be­tween the op­po­nent in the “Spoofer” em­bed­der and in the “Delu­sion” em­bed­der. In the first case, the only em­bed­der that the op­po­nent is op­ti­miz­ing over is the one with “go-straight bot” (or “swerve bot”, de­pend­ing on what you sub­sti­tute in). In the sec­ond case, the op­po­nent is op­ti­miz­ing over the same three em­bed­ders that you are us­ing. with only the “vs. go-straight bot” em­bed­der con­verges to swerv­ing, and with only the “vs. swerve bot” em­bed­der con­verges to go­ing straight. So, in the “Spoofer” em­bed­der, the argmax agent will con­verge to think­ing that “go straight” is the best thing to plug in be­cause it gets a straight-swerve out­come.

As­sume for the sake of con­tra­dic­tion that the agent (with the 3 em­bed­ders listed) con­verges to go­ing straight. Then on the true en­vi­ron­ment, it will keep get­ting straight-straight out­comes, while the “Spoofer” em­bed­der keeps pre­dict­ing that you’ll get straight-swerve out­comes. So, the “Spoofer” em­bed­der even­tu­ally fails the re­al­ity filter, so the agent will learn to not use it. Both of the re­main­ing em­bed­ders ad­vise swerv­ing as get­ting the best out­come, so the agent con­verges to swerv­ing, and we have a con­tra­dic­tion.

What went wrong in the origi­nal rea­son­ing? As far as I can tell, it origi­nated from ac­ci­den­tally treat­ing ” with only one em­bed­der” and ” with the three listed em­bed­ders” as the same, be­cause “” was used to de­note both of them.

Evil Prob­lems and The­o­rem Failure:

To make things more con­fus­ing, the sec­ond the­o­rem in the ADT pa­per , about argmax dom­i­nat­ing all agents in , is wrong, and not in a fix­able way. There’s an ex­plicit coun­terex­am­ple.

It will be in­struc­tive to take a de­tour to talk about dr­nick­bone’s evil prob­lem. Defin­ing a “fair” prob­lem in de­ci­sion the­ory is nec­es­sary, be­cause you can’t just say that a de­ci­sion the­ory is the best on all prob­lems. Con­sider the prob­lem where you are up against an op­po­nent who just re­ally hates some par­tic­u­lar de­ci­sion the­ory, and pe­nal­izes any­one that uses it to se­lect ac­tions. Of course, the de­ci­sion the­ory of your choice will fail on this prob­lem. So, in­stead, we would hope that there’s some no­tion of a “fair” prob­lem, such that we can say that a de­ci­sion the­ory at­tains equal or greater util­ity than all com­peti­tors on all fair prob­lems.

An ini­tial at­tempt at defin­ing a fair prob­lem was one where all agents that se­lect the same ac­tion get the same re­sult. The prob­lem is fully ac­tion-de­ter­mined, and your pay­off doesn’t de­pend at all on your rit­ual of cog­ni­tion. This is the no­tion of “fair” used in the origi­nal pa­per.

The Evil prob­lem is a de­ci­sion the­ory prob­lem that is com­pletely ac­tion-de­ter­mined, and fair by the old defi­ni­tion of “fair” where (de­ci­sion the­ory of your choice, heretofore ab­bre­vi­ated as for “Your De­ci­sion The­ory”) gets sys­tem­at­i­cally lower util­ity than most other com­pet­ing de­ci­sion the­o­ries. Con­sider a var­i­ant of New­comb’s prob­lem where there are two boxes, and you may se­lect one of them. Omega’s de­ci­sion pro­cess is to pre­dict what box would take, and then put 1000 dol­lars in the other box, and noth­ing in the box that takes. Also, one of the boxes is lus­trous and sparkly and would make a nice ad­di­tion to your box col­lec­tion, and you value that box at 10 dol­lars.

Now, you are like “well, I’m us­ing , and Omega is re­ally ac­cu­rate, and I left my ran­dom num­ber gen­er­a­tor at home, so no mat­ter which box I pick, I’ll get 0 dol­lars. Might as well go for the shiny box”. And all the other de­ci­sion the­o­ries like and can run through that rea­son­ing and take the blank box, which con­tains 1000 dol­lars, and walk away sub­stan­tially richer than be­fore. Note that any agent that takes the same box gets the same pay­off. So it is in­tu­itively un­fair, de­spite be­ing com­pletely ac­tion-de­ter­mined.

This same sort of prob­lem, when put into em­bed­der form, leads to argmax sys­tem­at­i­cally un­der­perform­ing “take the blank box”. The argmax agent will con­verge to tak­ing the shiny box about of the time. This is be­cause, since it is pos­si­ble to go back and com­pute what “blank-box” and “shiny-box” would have yielded on turns where you didn’t take that box, you keep think­ing that they’d get de­cent pay­offs. In ex­pec­ta­tion, “blank-box” gets 502.5 dol­lars, and “shiny-box” gets 502.5 dol­lars, while “argmax” gets 5.025 dol­lars. This is an is­sue for all de­ci­sion the­o­ries that op­er­ate by treat­ing the en­vi­ron­ment as a func­tion, and plug­ging ac­tions into it. It will always con­sider the ac­tion it didn’t take to have a higher util­ity, and this is ac­tu­ally true, be­cause there are “ob­jec­tive coun­ter­fac­tu­als” that can be checked. In par­tic­u­lar, con­di­tion 1 in my prob­a­bil­is­tic tiling post was re­quired speci­fi­cally to rule out these sorts of shenani­gans. Check the dis­cus­sion on why fails this con­di­tion, it’s talk­ing about how these sorts of evil prob­lems cause the ex­pected util­ity of “I take an ac­tion” to sys­tem­at­i­cally de­vi­ate from the ex­pected util­ity of the ac­tion you ac­tu­ally take.

To re­turn back to since it’s treat­ing the en­vi­ron­ment as a func­tion that it can plug stuff into, it’s vuln­er­a­ble to this ex­ploit.

So, what went wrong in the origi­nal proof that argmax dom­i­nates ev­ery­thing in ? It was the (not ex­plic­itly listed) step that went from (which is true be­cause and are in a max­i­mal equiv­alence class) to (which is a nec­es­sary step to get argmax to have the same prop­erty). How­ever, just be­cause the util­ities are the same doesn’t mean the prob­a­bil­ity dis­tri­bu­tions are the same!! In fact, given the con­tin­u­ous ver­sion of argmax in that pa­per, it’s pos­si­ble to come up with a much more mun­dane case where it fails that definitely isn’t an evil prob­lem. Con­sider the prob­lem where you must make the same move as your op­po­nent, and your prob­a­bil­ity dis­tri­bu­tion over moves is the same as your op­po­nent’s prob­a­bil­ity dis­tri­bu­tion over moves. Let and be the agents that just im­ple­ment the two con­stant strate­gies. , but, by the defi­ni­tion of con­tin­u­ous argmax that was given in the pa­per, argmax would con­verge to a 5050 mix of the two strate­gies since they are of equal value. When this is sub­sti­tuted into both your and your op­po­nent’s moves­lot, it pro­duces an ex­pected util­ity of .

What is Fair­ness?:

Th­ese evil prob­lems are a prob­lem for show­ing that argmax dom­i­nates ev­ery­thing in . While at­tempt­ing a proof, I came up with a pos­si­ble al­ter­nate defi­ni­tion of “fair”, but it’s more ac­cu­rate to call it “re­gret-free”. An em­bed­der is “re­gret-free” iff

.

In­tu­itively, argmax doesn’t re­gret its de­ci­sion, in the sense of not wish­ing it was op­ti­miz­ing for some spe­cific other world to get a more for­tu­nate se­quence of ac­tions. The agent doesn’t wish it was de­luded about the world in or­der to be decor­re­lated with what­ever is pun­ish­ing its ac­tion se­quence. One effect of this defi­ni­tion is that it shows that (ac­cord­ing to the ex­pec­ta­tion of the agent), argmax will do bet­ter than any par­tic­u­lar strat­egy in the strat­egy set, be­cause you can con­sider an en­vi­ron­ment that re­wards that ex­act strat­egy be­ing played on ev­ery round. Another no­table effect is that it rules out the origi­nal Evil prob­lem as “un­fair”, but the mod­ified var­i­ant where the ac­tion of the agent is sub­sti­tuted into both your ac­tion, and Omega’s pre­dic­tion, is fair. So there’s still a re­gret-free/​”fair” en­vi­ron­ment that can model the Evil prob­lem faith­fully, but it says that the proper thing to judge against isn’t en­vi­ron­ments where Omega pe­nal­izes , but rather en­vi­ron­ments where Omega pe­nal­izes the de­ci­sion the­ory of who­ever is pick­ing the boxes. And of course, , , and ev­ery­thing else also gets ei­ther 0 or 10 dol­lars worth of value in this mod­ified prob­lem, and bal­ance is re­stored.

Sadly, this par­tic­u­lar defi­ni­tion of “fair” is in­ad­e­quate for gen­eral use, be­cause it is pos­si­ble to con­struct en­vi­ron­ments where some ar­bi­trary de­ci­sion the­ory other than argmax sys­tem­at­i­cally gets lower util­ity than argmax. This can be fixed by mak­ing this defi­ni­tion rel­a­tive to the agent un­der con­sid­er­a­tion, but then you run into the prob­lem of sim­ple agents call­ing (in­tu­itively fair) games “un­fair rel­a­tive to me”. There will be an­other post about this.

Lack-of-Proof-of-Op­ti­mal­ity for ADT:

We might try to go for a the­o­rem like

Con­jec­ture 1: If all en­vi­ron­ments in are re­gret-free, then, on all em­bed­ders s.t. , dom­i­nates .

In short, seems like it will learn to do as well as argmax on all en­vi­ron­ments that don’t get ruled out. This con­jec­ture is still open! I thought I had a proof, and it failed, and then I thought I had a dis­proof, and it failed, and then I thought I had an­other dis­proof, and it turned out that I couldn’t show that one of the en­vi­ron­ments was re­gret-free. Maybe some ex­tra con­di­tions are re­quired for this con­jec­ture to be a the­o­rem, but I sus­pect that the en­vi­ron­ment is ac­tu­ally re­gret-free and the con­jec­ture is false, point­ing to a gen­uine hole in .

I’ll de­scribe the first and last of these failed at­tempts here, in the hopes that they will provide help on how to con­clu­sively prove or dis­prove the con­jec­ture.

Failed At­tempt 1:

To be­gin with, if the true en­vi­ron­ment is in , and one other en­vi­ron­ment is “clearly in the lead” (ie, for some fixed ), and this op­por­tu­nity re­curs in­finitely of­ten, it is pos­si­ble to money-pump this. In par­tic­u­lar, since (the re­al­ity filter for the true en­vi­ron­ment) con­verges to , . The money pump works by buy­ing one share in , sel­l­ing one share in , sel­l­ing one share in , and buy­ing one share in . Be­cause both of the lat­ter have prices very similar to each other (be­cause both and pass the re­al­ity-filter), their price can­cels out, yield­ing dol­lars up­front from the first buy-sell pair, along with the fact that is a re­gret-free en­vi­ron­ment, so is priced at or be­low in the limit. Now, be­cause the em­bed­der will be ac­tu­ally picked, all the pur­chased and sold shares have value that can­cels out to . (well, ac­tu­ally, it’s a bit more com­pli­cated than that, you need to add an ad­di­tional con­di­tion that the in­for­ma­tion about what copies shows up in the de­duc­tive pro­cess in a timely man­ner, be­cause log­i­cal in­duc­tors judge the value of a trader by their worst-case value.) Any­ways, this pumps free money from each time where one par­tic­u­lar em­bed­der is “in the lead” above the true en­vi­ron­ment.

How­ever, this doesn’t rule out cases where mul­ti­ple em­bed­ders that aren’t the true en­vi­ron­ment are “in the lead”. In­tu­itively, it’s pos­si­ble to have a pileup where mul­ti­ple em­bed­ders have ap­prox­i­mately the same es­ti­mate of their argmax value, and es­ti­mate of what ac­tu­ally does in them. There’s an­other case where the true en­vi­ron­ment is “in the lead” along with sev­eral oth­ers. The ob­vi­ous fix to this is to have the trader buy and sell con­di­tional con­tracts, so the money-pump goes through for whichever em­bed­der is ac­tu­ally se­lected, and all the oth­ers can­cel out to value. The point where this fails is that it wouldn’t nec­es­sar­ily be buy­ing the shares as origi­nally stated. In or­der to get the share, it would be pur­chas­ing the share at a price of which may not equal . At­tempts to con­struct more so­phis­ti­cated money-pumps all met with failure due to this spe­cific is­sue where re­gret-free en­vi­ron­ments may say that argmax does worse when copies it.

Failed At­tempt 2:

We will give a set of en­vi­ron­ments, and see that two of the em­bed­ders will con­verge to pass­ing the re­al­ity filter, but dom­i­nates nei­ther of the argmax agents (how­ever, one of the en­vi­ron­ments is not known to be re­gret-free.)

In par­tic­u­lar, let the true en­vi­ron­ment be a game of chicken against the same agent. will con­sist of the three ac­tions “swerve”, “straight”, and “con­sult ran­dom num­ber gen­er­a­tor in en­vi­ron­ment to swerve of the time” (aug­ment­ing to al­low con­sult­ing the ran­dom num­ber gen­er­a­tor with differ­ent ran­dom­iza­tion prob­a­bil­ities doesn’t change the re­sult.) will con­sist of the two en­vi­ron­ments (which sub­sti­tutes the ac­tion into both your ac­tion slot and the op­po­nent’s ac­tion slot), and (which sub­stutes the ac­tion in your ac­tion slot, and the op­po­nent is you. The util­ity func­tion is for go­ing straight while your op­po­nent swerves, for crash­ing, and for swerv­ing.

To be­gin with, , be­cause both en­vi­ron­ments are iden­ti­cal when you plug in the agent it­self, so both of them pass the re­al­ity filter. Argmax
on the em­bed­der con­verges to tak­ing the ac­tion “swerve of the time”.

As­sume that learns to use the em­bed­der , pre­dictably, in­finitely of­ten (that is, in­finitely of­ten, the mar­ket as­signs a prob­a­bil­ity very near 1 of us­ing the em­bed­der ). Then, on this se­quence, it will con­verge to swerv­ing of the time, and in that case, will learn to always use the em­bed­der in re­sponse, and go straight, be­cause this passes the re­al­ity filter and offers a pay­off of , which is greater than the pay­off of that offers. So, we have a con­tra­dic­tion.

As­sume that learns to use the em­bed­der , pre­dictably, in­finitely of­ten. Then, if the prob­a­bil­ity as­signs to it­self go­ing straight is , it goes straight, if the prob­a­bil­ity as­signs to it­self go­ing straight is , it swerves. So the prob­a­bil­ity as­signs to it­self go­ing straight on that sub­se­quence will con­verge to , which means that , and be­cause promises a greater util­ity, it will learn to use the em­bed­der in those cases. So again, we have a con­tra­dic­tion.

So, ac­cord­ing to the prob­a­bil­ities of the log­i­cal in­duc­tor it­self, it is un­pre­dictable what will do, and by cal­ibra­tion, this ap­plies to what does in re­al­ity. Since differ­ences in and al­low pre­dict­ing which em­bed­der will pick, the two must con­verge to each other, and since the lat­ter con­verges to , the former must as well.

Now (I’m mod­er­ately sure it’s the only solu­tion, but don’t have a proof), there is a sta­ble solu­tion where the agent doesn’t know which em­bed­der it picks, but with prob­a­bil­ity it picks , and with prob­a­bil­ity it picks (and argmax always picks “go straight”, so it runs into it­self), for an ex­pected util­ity of , while both em­bed­ders claim that argmax for them speci­fi­cally gets an ex­pected util­ity of . Be­cause it’s un­pre­dictable which em­bed­der it will go with, the ex­pected value of go­ing straight (in em­bed­der ) is , the ex­pected value of swerv­ing is , and the ex­pected value of ran­dom­iz­ing is . So argmax con­verges to always go­ing straight, for an ex­pected util­ity (in em­bed­der ) of .