Musings on Exploration

[epistemic sta­tus: Polem­i­cal, rep­re­sen­ta­tive of my opinions and not those of any oth­ers, plau­si­bly flawed in some places, gen­er­ally en­dorsed.]

Look­ing at the two main set­tings of de­ci­sion the­ory so far, Proof-based de­ci­sion the­ory and Log­i­cal In­duc­tor de­ci­sion the­ory, they both have ex­plo­ra­tion steps.

Proof-based de­ci­sion the­ory has an im­plicit ex­plo­ra­tion step, where in mod­els of where there’s a proof that the agent doesn’t take a par­tic­u­lar ac­tion, it is pos­si­ble to prove ar­bi­trar­ily good out­comes, in­duc­ing the agent to take the ac­tion.

Log­i­cal In­duc­tor de­ci­sion the­ory has an ex­plicit ex­plo­ra­tion step, where if the agent is suffi­ciently cer­tain that it won’t take a par­tic­u­lar ac­tion, it takes that ac­tion.

Both of these are mo­ti­vated by the is­sue where, if it is cer­tain that an agent won’t take a par­tic­u­lar ac­tion, ar­bi­trar­ily bad guesses of what hap­pens if the agent takes an ac­tion can per­sist, and won’t be re­moved, be­cause the agent doesn’t take that ac­tion. Both forms of ex­plo­ra­tion step can be thought of as main­tain­ing a model/​tiny prob­a­bil­ity where the ac­tion is taken, so EDT-style con­di­tion­ing works. How­ever, con­di­tion­als are not coun­ter­fac­tu­als (This par­tic­u­lar link is ex­tra-im­por­tant to read)

Fur­ther, this isn’t an is­sue speci­fi­cally with 0-prob­a­bil­ity events, in gen­eral, the es­ti­mates of what hap­pens con­di­tional on a low-prob­a­bil­ity event are worse than es­ti­mates of what hap­pens con­di­tional on a high-prob­a­bil­ity event, as shown in the­o­rem 3.3 of the Op­ti­mal Poly-Time Es­ti­ma­tor pa­per. In short, if you’ve got an op­ti­mal es­ti­ma­tor for the in­di­ca­tor func­tion that dic­tates whether prop­erty holds, and you’ve got an op­ti­mal es­ti­ma­tor for the func­tion that’s 0 when is false, and when is true, then is an op­ti­mal es­ti­ma­tor for con­di­tional on L be­ing true. How­ever, the er­ror in the origi­nal op­ti­mal es­ti­ma­tors is blown up by a fac­tor of , where is the por­tion of the prob­a­bil­ity dis­tri­bu­tion where prop­erty holds. As the prob­a­bil­ity of some­thing shrinks, the er­ror in us­ing stan­dard con­di­tional prob­a­bil­ities in­creases be­cause a tiny er­ror in gets am­plified when you di­vide by , which is very small.

#Troll Bridge

Troll Bridge is the de­ci­sion the­ory prob­lem where there’s a troll that blows up a bridge that you want to cross pre­cisely when your ex­plo­ra­tion step is ac­tive. The util­ity of stay­ing on your side of the bridge is 0.5, the util­ity of cross­ing the bridge is 1, and the util­ity of cross­ing the bridge when it blows up is 0. It is the log­i­cal in­duc­tor ver­sion of a similar prob­lem for proof-based de­ci­sion the­ory. In this ver­sion, the troll blows up the bridge when PA is in­con­sis­tent, which is a nec­es­sary con­di­tion for proof-based de­ci­sion the­ory to “ex­plore” into sub­op­ti­mal ac­tions.

Troll bridge is ac­tu­ally not a good ar­gu­ment against EDT-style con­di­tion­ing with ex­plo­ra­tion steps. This is be­cause an analogue of it plau­si­bly ap­plies to pretty much any agent you could come up with. Ex­plo­ra­tion is in­ti­mately re­lated to the ex­plo­ra­tion-ex­ploita­tion trade­off, so, con­sult­ing Wikipe­dia’s list of multi-armed ban­dit al­gorithms...

For all the semi-uniform strate­gies based on ran­dom­iza­tion, there’s a clear cor­re­spon­dence to the ex­plo­ra­tion step ap­proach, and the troll blows up the bridge when­ever the agent ran­dom­izes.

The solu­tion from the pa­per, “Op­ti­mal Adap­tive Poli­cies for Markov De­ci­sion Pro­cesses” fea­tures two ex­plo­ra­tion mechanisms. The first is sys­tem­at­i­cally over­es­ti­mat­ing the util­ity gained by tak­ing un­taken ac­tions, and the sec­ond is tak­ing his­tor­i­cally low-prob­a­bil­ity ac­tions, at ap­prox­i­mately a log­a­r­ith­mic rate over time, as in den­sity zero ex­plo­ra­tion. Yet again, it is pos­si­ble to de­sign a var­i­ant of troll bridge where the agent starts off with a sig­nifi­cant un­der­es­ti­mate of the util­ity of cross­ing the bridge, and the crite­rion for bridge ex­plo­sion is the ex­plo­ra­tion clause be­ing ac­tive. The agent will con­verge to not cross­ing the bridge.

For pric­ing strate­gies, there is also a var­i­ant of troll bridge that defeats them. Each op­tion is as­so­ci­ated with a score that is the sum of ex­pected re­ward, and an es­ti­mate of ex­tra fu­ture re­wards gained through the ad­di­tional knowl­edge. So, if the troll’s bridge-ex­plo­sion crite­rion is the value of in­for­ma­tion for bridge-cross­ing be­ing above a cer­tain thresh­old, the agent can con­verge to not cross­ing the bridge.

As far as I can tell, the only strat­egy that doesn’t have some sort of tar­getable ex­plo­ra­tion be­hav­ior is Thomp­son sam­pling.

Even for hu­mans, if you were fac­ing troll bridge, and didn’t know how it worked, and the troll’s crite­rion for bridge ex­plo­sion was “the hu­man is think­ing “I feel like do­ing some­thing un­usual and see­ing what hap­pens”″, I wouldn’t be sur­prised if you also got pos­si­ble con­ver­gence to failure.

Causal in­fer­ence is greatly as­sisted by the abil­ity to perform ran­dom­ized con­trol­led tri­als, and be­cause troll bridge tar­gets ex­actly that key ca­pa­bil­ity, it prob­a­bly isn’t an en­vi­ron­ment where we should ex­pect op­ti­mal­ity. In gen­eral, if there’s some key as­sump­tion that is nec­es­sary to get good perfor­mance on an en­vi­ron­ment, we shouldn’t nec­es­sar­ily de­mand op­ti­mal­ity on prob­lems that vi­o­late that key as­sump­tion.

So, since you can tune a var­i­ant of troll bridge to work on most al­gorithms for han­dling the ex­plo­ra­tion/​ex­ploita­tion trade­off, and it vi­o­lates the abil­ity to perform ran­dom­ized con­trol­led tri­als of var­i­ous ac­tions (which is key for in­fer­ring the causal struc­ture of an en­vi­ron­ment)… It’s prob­a­bly just an un­fair en­vi­ron­ment and we shouldn’t ex­pect op­ti­mal­ity on it. As far as I can tell, it’s just de­signed to screw over al­gorithms with ex­plicit ex­plo­ra­tion steps.

#Against Ex­plo­ra­tion Steps

How­ever, even though Troll Bridge is a bad ar­gu­ment against ex­plo­ra­tion steps, ex­plo­ra­tion steps are still bad for com­pletely differ­ent rea­sons. In­finite ex­plo­ra­tion, when the en­vi­ron­ment con­tains ir­re­versible traps, guaran­tees catas­tro­phe. Ex­plo­ra­tion into the fol­low­ing ac­tions is not ad­vis­able.

Fur­ther, ex­plo­ra­tion steps seem to in­tro­duce an awk­ward dis­con­ti­nu­ity into the de­ci­sion crite­rion, where ex­pected util­ity is max­i­mized, ex­cept some­times an ex­plo­ra­tion step oc­curs. Due to the po­ten­tial for catas­tro­phe, I’d an­ti­ci­pate that any agent with an ex­plo­ra­tion step would self-mod­ify it away as it grew up. In the spe­cific case of AIXI, ac­cord­ing to the prob­a­bil­ity dis­tri­bu­tion over hy­pothe­ses at some finite time, the ex­pected util­ity of fol­low­ing the AIXI policy is greater than the ex­pected util­ity of fol­low­ing the AIXI+ex­plo­ra­tion policy.

The true rea­son to do ex­plo­ra­tion seems to be be­cause the agent be­lieves the ac­tion it is tak­ing will not lead to an ir­re­versible trap, and be­cause it be­lieves that the ac­tion will re­veal in­for­ma­tion about the true en­vi­ron­ment that en­ables a bet­ter policy later on, which in ex­pec­ta­tion up to the time hori­zon, out­weighs the tem­po­rary loss in­curred due to ex­plor­ing. AIXI works this way, as an ex­am­ple. The set­ting of Log­i­cal In­duc­tor de­ci­sion the­ory ren­ders anal­y­sis of this im­pos­si­ble, be­cause there is a sep­a­rate util­ity score for each round, in­stead of a cu­mu­la­tive util­ity score over the rounds, which is nec­es­sary to model the pos­si­bil­ity of fal­ling into ir­re­versible traps. The real world has traps, so it is un­wise to as­sume them away in the prob­lem set­ting. It seems use­ful to at­tempt to im­port this fea­ture of AIXI to Log­i­cal In­duc­tor de­ci­sion the­ory.

Ad­mit­tedly, there is a spe­cific class of failure in AIXI that is solved by ex­plo­ra­tion steps, and is rem­i­nis­cent of get­ting locked into a bad ac­tion for­ever by an en­forc­ing trader, that is de­tailed in this pa­per by Leike and Hut­ter Pretty much, if you pick a uni­ver­sal tur­ing ma­chine that as­signs high prob­a­bil­ity to a uni­verse where you go to hell if you do any­thing other than [string of ac­tions], the agent will just out­put [string of ac­tions].

As pre­vi­ously men­tioned, ex­plicit ex­plo­ra­tion steps would prob­a­bly be self-mod­ified away. Then how is this is­sue to be solved? For early prun­ing of hy­pothe­ses, the posts on Del­ega­tive In­verse Re­in­force­ment Learn­ing provide an an­swer. The agent defers to the hu­man for a choice of ac­tion when it sus­pects some­thing might be an ir­re­versible trap (as in the heaven-hell ex­am­ple from Leike and Hut­ter), and this per­mits hu­man-as­sisted iden­ti­fi­ca­tion of ob­vi­ous traps/​not traps. The agent up­dates on this and can asymp­tot­i­cally learn to take bet­ter ac­tions than the hu­man, and over rounds, defers to the hu­man on rounds.

Once the agent has figured out some of how the world works, most en­vi­ron­ments/​hy­pothe­ses where there is a trap have ev­i­den­tial clues el­se­where to rule them out. The world in which firing a full nu­clear ar­se­nal is dan­ger­ous, can be dis­t­in­guished by smaller-scale nu­clear test­ing and mod­el­ing in un­in­hab­ited ar­eas. The GUT-scale par­ti­cle ac­cel­er­a­tor that may in­duce vac­uum col­lapse pre­sum­ably has the differ­ence in phys­i­cal laws be­ing testable at a lower en­ergy scale. For “bayesian para­noia” of the sort “if I take [some ac­tion], it will lead to an ir­re­versible loss of util­ity”, it doesn’t seem much of an is­sue. There would be a K-com­plex­ity penalty on the or­der of “num­ber of bits to spec­ify an ad­di­tional phys­i­cal law that kicks in upon tak­ing a par­tic­u­lar ac­tion, and at no other time”. To cause prob­lems, it isn’t enough to just have an en­vi­ron­ment that im­plies that an ac­tion is an ir­re­versible trap, this en­vi­ron­ment also must be in­dis­t­in­guish­able from the en­vi­ron­ment where an ac­tion isn’t an ir­re­versible trap, by any means ex­cept try­ing it and see­ing what hap­pens. So I don’t ex­pect ir­ra­tional aver­sion of ob­vi­ously-not-traps to be an is­sue in prac­tice, and this ap­plies with even more force when del­ega­tive ap­proaches are thrown into the mix.


Ex­plo­ra­tion steps are in­el­e­gant, will prob­a­bly be self-mod­ified away, and im­ply un­ac­cept­ably bad be­hav­ior in re­al­ity.

At the same time, troll bridge is an un­fair ar­gu­ment against ex­plo­ra­tion steps be­cause it ex­plic­itly tar­gets one of the fea­tures which en­ables learn­ing causal struc­ture.

The best fea­si­ble re­sult seems to be some­thing like vanilla AIXI. Speci­fi­cally, ex­plo­ra­tion should be mo­ti­vated the re­sult­ing re­duc­tion in the hy­poth­e­sis space to en­able bet­ter ac­tions in the fu­ture. Also, the set­ting of Log­i­cal In­duc­tor de­ci­sion the­ory is not right for study­ing this be­hav­ior.

Ad­mit­tedly, AIXI does have the is­sue of pos­si­bly get­ting per­ma­nently locked into bad hy­pothe­ses where it para­noid of pos­si­ble traps. How­ever, a bet­ter solu­tion would prob­a­bly be DIRL’s ap­proach of defer­ring to the hu­man early on. Later on, I an­ti­ci­pate that the ma­jor­ity of wor­lds with spu­ri­ous traps can be dis­t­in­guished from the real world by some ev­i­dence, which will be sought out.