Failures of UDT-AIXI, Part 1: Improper Randomizing

While at­tempt­ing to con­struct a UDT-like var­i­ant of AIXI, the al­gorithm I wrote down turned out to have a bunch of prob­lems with it that were block­ing at­tempts to prove that it had var­i­ous nice prop­er­ties. This will be the first post in a se­quence about the var­i­ous nonob­vi­ous things that go wrong when you try to make UDT-AIXI. (Check back here for more links)

No­ta­tion and Setup:

To be­gin with, the ba­sic set­ting was one where the Tur­ing ma­chines in the en­vi­ron­ment had or­a­cle ac­cess to the policy. In the first at­tempt, where was the set of ob­ser­va­tions, and was the set of all finite strings of ob­ser­va­tions, poli­cies had the type sig­na­ture , and en­vi­ron­ments had the type sig­na­ture , and func­tioned by hav­ing the op­tion of mak­ing or­a­cle calls to the policy with var­i­ous ob­ser­va­tion strings to see what ac­tion was re­turned. If the policy ran­dom­izes on that in­put, then the ac­tion is sim­ply drawn from the dis­tri­bu­tion .

Note that we’re just us­ing in­stead of the AIXI in­put type of , this is be­cause the ac­tions that the agent sees on the in­put tape can just be seen as a sort of ob­ser­va­tion, like any other, and this gen­er­al­iza­tion per­mits look­ing at cases where the agent may not have an in­ter­nal sense of which ac­tion it took, or cases where the agent gets to peek at what it did in hy­po­thet­i­cal situ­a­tions that didn’t hap­pen.

Policy Search and Nash Equil­ibria:

Gloss­ing over is­sues of non­halt­ing en­vi­ron­ments, we can re­frame the UDT search for the best policy as the re­sult of an in­finite game, where each player is in­dexed with an ob­ser­va­tion string , and re­sponds with a (prob­a­bil­ity dis­tri­bu­tion over) move(s) in . Each cell cor­re­sponds to a de­ter­minis­tic policy of type , and the util­ity in that cell is the ex­pected re­ward over the stan­dard mix­ture of en­vi­ron­ments, when the en­vi­ron­ments are mak­ing or­a­cle calls to that se­lected policy. All play­ers con­ve­niently have the same util­ity func­tion, which makes things eas­ier. The op­ti­mal de­ter­minis­tic policy is a Nash equil­ibrium, be­cause any de­vi­a­tion from any of the play­ers would re­sult in an equal or lower ex­pected util­ity.

This rephrases prob­lems of se­lect­ing a policy, to an equil­ibrium se­lec­tion prob­lem, be­cause even when all play­ers have the same util­ity func­tion, there can be in­ad­e­quate equil­ibria. Con­sider a mul­ti­player game of stag hunt where rab­bits are shared equally among the group, but it’s still bet­ter for ev­ery­one to team up and cap­ture the stag than for ev­ery­one to catch a rab­bit. The all-rab­bit equil­ibrium is a Nash equil­ibrium that isn’t the best ash equil­ibrium. This case seems sim­ple to re­solve, but there are much more difficult ex­am­ples, such as a MAX-3-SAT game where each hy­po­thet­i­cal ver­sion of the agent in Omega’s head has in­com­pat­i­ble ob­ser­va­tion strings, some short, some very long, and they are each re­spon­si­ble for fix­ing the value of one vari­able.

Prob­lem 1: Im­proper Randomization

Set­ting the Vingean and game the­ory is­sues to one side, there’s an­other thing that’s go­ing wrong here. The stated setup of how the policy in­ter­acts with the en­vi­ron­ment is in­com­pat­i­ble with this Nash equil­ibrium view of UDT.

As a sim­ple ex­am­ple of what’s go­ing wrong, con­sider a sin­gle-turn game against an en­vi­ron­ment that makes two queries to what the policy does, and re­turns a re­ward of 1 if they are the same, and 0 if they are differ­ent. In stan­dard game the­ory, ran­dom­iz­ing be­tween two ac­tions with iden­ti­cal pay­off should have the same ex­pected pay­off. But here there’s a penalty for ran­dom­iz­ing! And be­cause each call to the policy re­turns an ac­tion sam­pled from the dis­tri­bu­tion , iden­ti­cal or­a­cle calls to the policy may re­turn differ­ent out­puts. The proper way to view this to make it com­pat­i­ble with the Nash equil­ibrium view is to sam­ple the ac­tion first, and then plug it into the open slots in the en­vi­ron­ment. You only ran­dom­ize once.

This isn’t quite as triv­ial as it looks, be­cause Reflec­tive Or­a­cles have this “in­trin­sic ran­dom­ness” thing go­ing on, two iden­ti­cal or­a­cle calls to the same agent may re­turn differ­ent out­puts. In Abram’s words, re­flec­tive or­a­cles are an­noy­ingly re­al­ist about prob­a­bil­ity, they act as if a agent can ran­dom­ize in such a way as to take differ­ent ac­tions in the same in­for­ma­tion state. The proper way to do this is to sam­ple a de­ter­minis­tic policy first, and then plug it into the slots in the en­vi­ron­ment. When think­ing about a situ­a­tion where an agent (pos­si­bly your­self) is ran­dom­iz­ing, you may not know what ac­tion it will take, but you should think that the agent will take the same ac­tion in an iden­ti­cal state.

The ob­vi­ous hack is to just cache the re­sult ev­ery time the en­vi­ron­ment makes an or­a­cle call, be­cause if you com­pute it up-front, the policy sam­ple will prob­a­bly take an in­finite amount of in­for­ma­tion to spec­ify. But if you ap­ply this hack and cache what was said about the policy so far, you run into an­other is­sue with non-halt­ing en­vi­ron­ments. Be­cause Reflec­tive Or­a­cle-AIXI calls an or­a­cle on an en­vi­ron­ment to get the next bit, in­stead of just run­ning the en­vi­ron­ment (maybe the en­vi­ron­ment doesn’t halt, and you still need to get the next bit some­how), you don’t get the policy sam­ple. If the par­tial policy sam­ple used by the en­vi­ron­ment on turn takes less than (where is com­putable) bits to en­code, you can use the or­a­cle to re­cover it and use that in­for­ma­tion on the next turn, but again this fails in full gen­er­al­ity. The prob­lem arises when you’ve got an en­vi­ron­ment that, on some turn, with some pos­i­tive prob­a­bil­ity, doesn’t halt and makes in­finitely many or­a­cle calls. Then there’s no way (that I know of, yet) to guaran­tee that the be­hav­ior of the policy on fu­ture turns is con­sis­tent with what it did on that turn.

Prob­lem 2: Can’t Build De­sired Oracle

What if we then at­tempt to get a new no­tion of or­a­cle where we have a prob­a­bil­ity dis­tri­bu­tion over sam­ples? That might have promise, be­cause only a sin­gle sam­ple would be used. For in­stance, the type of a sam­ple would be , and we’d be look­ing for a fixed-point in , ie, a dis­tri­bu­tion over sam­ples s.t. the prob­a­bil­ity of draw­ing a sam­ple that maps M to 0 is the same as the prob­a­bil­ity of M out­putting 0 when it se­lects a sam­ple from the given dis­tri­bu­tion.

Then it doesn’t straight­for­wardly fol­low from the Kaku­tani fixed-point-the­o­rem, be­cause one of the re­stric­tions is that the set of in­ter­est is a nonempty, com­pact, con­vex sub­set of a lo­cally con­vex Haus­dorff space. Fix­ing some bi­jec­tion be­tween and so we’re look­ing for a fixed-point in , we run into the is­sue that this space isn’t com­pact un­der any of the nat­u­ral topolo­gies, and even re­strict­ing to com­putable func­tions from (and bi­ject­ing that with ), isn’t com­pact un­der any of the nat­u­ral topolo­gies ei­ther.

Due to this is­sue, the ear­lier post about or­a­cles that re­sult in cor­re­lated-equil­ibria, is in­cor­rect, be­cause the space of in­ter­est suffers from this lack of com­pact­ness, and the topolog­i­cal pre­con­di­tions for Kaku­tani ap­ply­ing weren’t checked. The re­sult still holds up for finitely many play­ers with finitely many moves, be­cause we don’t need a dis­tri­bu­tion over for that, only a dis­tri­bu­tion over finitely many in­te­gers, so the fixed-point the­o­rem goes through in that case.