Policy Selection Solves Most Problems

It seems like log­i­cally up­date­less rea­son­ing is what we would want in or­der to solve many de­ci­sion-the­ory prob­lems. I show that sev­eral of the prob­lems which seem to re­quire up­date­less rea­son­ing can in­stead be solved by se­lect­ing a policy with a log­i­cal in­duc­tor that’s run a small amount of time. The policy speci­fies how to make use of knowl­edge from a log­i­cal in­duc­tor which is run longer. This ad­dresses the difficul­ties which seem to block log­i­cally up­date­less de­ci­sion the­ory in a fairly di­rect man­ner. On the other hand, it doesn’t seem to hold much promise for the kind of in­sights which we would want from a real solu­tion.


LI Policy Selection

Rather than run­ning a log­i­cal in­duc­tor all the way to and mak­ing a de­ci­sion via the ex­pected util­ities im­plied by that dis­tri­bu­tion, we want to first run it to for , and use that dis­tri­bu­tion to make a de­ci­sion about how to uti­lize to de­ter­mine which ac­tion to take. (For ex­am­ple, we could choose .)

For sim­plic­ity, I will as­sume log­i­cal in­duc­tion over ex­po­nen­tial-time traders rather than the usual poly-time as­sump­tion. I also as­sume that num­bers can be rep­re­sented effi­ciently in the lan­guage of the log­i­cal in­duc­tor, IE, the writ­ten length is log­a­r­ith­mic in the num­ber.

We can se­lect from a set of poli­cies, each of which is a pro­gram tak­ing the mar­ket state and out­putting an ac­tion. With in the fol­low­ing:

The util­ity is a func­tion of be­cause it points to the uni­verse where lives. The argmax is taken over , where is un­cer­tain, be­cause may be too large for to have good be­liefs about; the sen­tences in­volv­ing can be so large that they aren’t be­ing traded on the mar­ket yet. How­ever, since we are us­ing a mar­ket which can’t be ex­ploited by exp-time traders, must have good be­liefs about sen­tences whose length is asymp­tot­i­cally a mul­ti­ple of the length of . (The un­cer­tainty should not harm us much, since the policy can look at sen­tences in­volv­ing the true and act ac­cord­ingly.)

The val­ues of and are made pseu­do­ran­dom rel­a­tive to the mar­ket state at f(n) by di­ag­o­nal­iz­ing against it, like the ex­plo­ra­tion defined here. The sub­script is on be­cause we need to grow the policy set, to en­sure that the op­ti­mal policy is even­tu­ally in­cluded. This must be done some­what care­fully. If grows too quickly, then it could be that at any the ma­jor­ity of poli­cies have very in­ac­cu­rate ex­pected util­ity es­ti­mates, even though it con­verges to an ac­cu­rate es­ti­mate for each in­di­vi­d­ual policy. This is eas­ily pre­vented, how­ever. If did in­deed grow too quickly, then it could be pre­dicted that the ex­pected value of the strat­egy cho­sen by the argmax would be sys­tem­at­i­cally lower than its mar­ket es­ti­mate. Sup­pose we knew it was sys­tem­at­i­cally over­es­ti­mated by . Although the traders at might not be able to pre­dict pre­cisely which policy will be cho­sen by the argmax (due to im­pre­ci­sion in the con­tin­u­ous-func­tion trad­ing strate­gies), a trader can find the set of poli­cies whose ex­pected val­ues are within some of the max­i­mum. If , that trader can make money by bet­ting that all of the top poli­cies are over­es­ti­mated.

So, we just need to grow slowly enough that a trader can im­ple­ment such a strat­egy in poly time (, that is, since we’re do­ing this at the early mar­ket time), with shrink­ing as gets larger. The de­tails of what speed this im­plies seem a bit te­dious to work out, since I’d have to dig into the for­mal­ism for ex­press­ing trad­ing strate­gies and the run­times in­volved. But, it seems quite likely that there’s some rate of growth at which this works.

This is not quite just a re­for­mu­la­tion of son-of-X. It’s true that if we’re let­ting the early mar­ket state choose any pro­gram to run, it doesn’t mat­ter much that we feed to that pro­gram—it could recre­ate it­self, or any num­ber of al­ter­nate be­lief struc­tures which might be preferred. How­ever, most of the in­ter­est­ing poli­cies in cases we’ve thought about seem to in fact be fairly re­stricted in run­time them­selves, let­ting do most of the think­ing. For ex­am­ple, im­ple­ment­ing the LIDT policy of tak­ing the max-ex­pected-util­ity ac­tion just re­quires com­par­ing the ex­pected value of each ac­tion. Im­ple­ment­ing fair­bot in pris­oner’s dilemma just re­quires look­ing at the prob­a­bil­ity that the other player co­op­er­ates, and the prob­a­bil­ity of a self-di­ag­o­nal­iza­tion sen­tence for pseu­do­ran­dom­ness. And so on. (Although it makes sense to put run­time re­stric­tions on , I won’t spec­ify any here, since it wouldn’t re­ally help achieve any in­ter­est­ing prop­er­ties. Ideally we would want some sort of struc­ture to the good poli­cies to ren­der them leg­ible, rather than just blindly choos­ing pro­grams.)

On the pos­i­tive side, this sort of policy se­lec­tion is similar to the idea of ob­ser­va­tion-coun­ter­fac­tu­als men­tioned in the happy dance prob­lem; the mar­ket state is the ob­ser­va­tion. It’s also an im­ple­men­ta­tion of pre­dictable ex­plo­ra­tion, so there’s some in­tu­itive grounds to think that the coun­ter­fac­tu­als will make more sense than those of reg­u­lar ep­silon-ex­plo­ra­tion, and be more use­ful for game the­ory (at least, I think so).

I also think there’s a sense in which this ap­proach re­cov­ers the hopes of ADT, pro­vid­ing a way to get Agent Si­mu­lates Pre­dic­tor right with­out get­ting game the­ory hor­ribly wrong. In effect, the ap­proach here uses the early-stage log­i­cal in­duc­tor as the em­bed­ding func­tion; its state of ig­no­rance al­lows it to be­lieve in the cor­re­la­tions needed to see the im­pli­ca­tions of a policy choice, un­like LIDT. Cer­tainly it is a bet­ter ap­proach to solv­ing the prob­lems av­er­age de­ci­sion the­ory was try­ing to solve, since it pro­duces “up­date­less-like” be­hav­ior with­out an ar­bi­trary place­ment of the de­ci­sion prob­lem in a se­quence of de­ci­sion prob­lems.

On the other hand, policy se­lec­tion is just a dirty trick which doesn’t provide any in­sight and which in prac­tice would de­pend on care­ful choice of the func­tion , so that the policy se­lec­tion is done af­ter be­liefs have sta­bi­lized to a “sen­si­ble” de­gree but be­fore the crit­i­cal in­for­ma­tion which we’d like to be­have up­date­lessly about ar­rives. No prin­ci­pled solu­tion to this prob­lem is offered here; only asymp­totic re­sults. I will be speak­ing of what strat­egy policy se­lec­tion “will con­verge to”; but note, of course, that we’re slow­ing down con­ver­gence by the func­tion .

Nonethe­less, a dirty hack that cleanly solves more prob­lems at once that any other dirty hack we know of is worth rec­og­niz­ing as such. So, let’s get on with the show.

5 and 10

This one barely mer­its men­tion­ing, but, it solves the 5 and 10 prob­lem: if there is an op­tion where it gets util­ity 5, and an op­tion where it gets util­ity 10, con­verges to tak­ing the util­ity 10 op­tion. This is be­cause its ex­plo­ra­tion en­sures that it even­tu­ally has cor­rect be­liefs about the re­sult of us­ing differ­ent poli­cies; it can’t think tak­ing the 10 gives −10 util­ity, never take the 10, and then never learn bet­ter.

Coun­ter­fac­tual Mug­ging (with a log­i­cal coin)

This one is also pretty straight­for­ward. In coun­ter­fac­tual mug­ging with a log­i­cal coin, policy se­lec­tion con­verges to giv­ing Omega the money so long as the difficulty of com­put­ing the coin ex­ceeds the power of the mar­ket at time. This is be­cause, be­fore we know which way the coin lands, the ex­pected value of giv­ing Omega the money is higher than that of not do­ing so.

One might ob­ject that the cases where the coin is com­putable at time aren’t solved by this ap­proach. How­ever, you can always choose a slower-grow­ing to in­clude more cases. The trade-off is that slower-grow­ing will take longer to con­verge to rea­son­able poli­cies. As far as I’m con­cerned, this is a real trade-off, re­flect­ing the fact that to ac­cept coun­ter­fac­tual mug­ging for eas­ier-to-com­pute log­i­cal coins, you have to ac­cept cra­zier pos­si­bil­ities. The prob­lem with policy se­lec­tion is more that it pro­vides no guidance with re­spect to choice of .

Agent Si­mu­lates Predictor

In proof-based DT, ASP is set up by giv­ing the agent more pro­cess­ing power than the pre­dic­tor, but giv­ing the pre­dic­tor more proof-the­o­retic strength than the agent. This al­lows the pre­dic­tor to prove what the agent will do, while al­low­ing the agent to find the out­put of the pre­dic­tor by simu­lat­ing it ac­cu­rately.

In the realm of log­i­cal in­duc­tion, the two di­men­sions col­lapse into one, since a log­i­cal in­duc­tor with enough pro­cess­ing power will learn to trust the out­puts of stronger proof sys­tems. How­ever, we can still set up a sce­nario similar to ASP. The agent has time to com­pute , while the pre­dic­tor only has time to run , for . The pre­dic­tor then fills the box based on its ex­pec­ta­tion that the agent will one-box. It could have some cut­off, filling the box if the agent one-boxes with 95% prob­a­bil­ity or greater, say. Or, it could fill the box with money pro­por­tional to its be­lief that the agent will one-box.

Although the pre­dic­tor may not be ex­actly ac­cu­rate, it will be a good pre­dic­tor in an asymp­totic sense: if the agent con­verges to ei­ther one-box­ing or two-box­ing, the pre­dic­tor even­tu­ally pre­dicts this cor­rectly all the time. On the other hand, as long as is small enough, the agent can see ex­actly what the pre­dic­tor will do.

The idea, of course, is that policy se­lec­tion does well here as long as is suffi­ciently small com­pared to . If the pre­dic­tor can simu­late the com­pu­ta­tion which se­lects the policy, then it will re­spond to that policy. So, the policy se­lec­tion can get the best score by pre­dictably one-box­ing. If the pre­dic­tor uses a cut­off like the 95% men­tioned ear­lier, the policy se­lec­tion can co­op­er­ate with a prob­a­bil­ity just slightly higher than that. Or, if the pre­dic­tor fills the box with money pro­por­tional to its trust in the agent, the agent will se­lect the policy which one-boxes all the way. In other words, policy se­lec­tion seems to just do the right thing.

Plant­ing Flowers

Sup­pose that the agent is cre­ated for the pur­pose of plant­ing flow­ers, but is given a util­ity func­tion in some ar­bi­trary way. How­ever, the cre­ator ex­am­ined the code very care­fully, and only ran the agent if it would pre­dictably plant flow­ers. Plant­ing flow­ers is costly, but not more costly than never hav­ing ex­isted. Should the agent plant flow­ers?

This is very similar to the pre­vi­ous prob­lem, ex­cept the agent is rea­son­ing about whether it ex­ists, rather than whether a box has some money in it. must be cho­sen to be small enough that the agent doesn’t know yet whether it ex­ists, at policy-se­lec­tion time. Whether this is fea­si­ble de­pends on the prob­lem state­ment. If the en­tire prob­lem is packed into , like we’ve been do­ing so far, then the agent’s ex­is­tence isn’t im­plied; must also eval­u­ate the util­ity of wor­lds where the agent isn’t cre­ated. Then, the prob­lem seems just like Agent Si­mu­lates Pre­dic­tor: it can be solved by choos­ing slow enough that the cre­ator’s de­ci­sion is not known at policy-se­lec­tion time. On the other hand, it might seem very rea­son­able to put in­for­ma­tion in the de­duc­tive state from which the agent can in­fer its own ex­is­tence, such as ax­ioms about it­self, or sense-data. This would pre­vent policy se­lec­tion from solv­ing the policy so eas­ily, since the policy must be se­lected be­fore self-ex­is­tence can be in­ferred.

XOR Blackmail

Set­ting up XOR black­mail is a bit tricky, es­pe­cially the case with a perfect pre­dic­tor (or any pre­dic­tor who knows the agent bet­ter than the agent knows it­self, re­ally).

In or­der for the agent to know whether it got a let­ter, the let­ter has to be writ­ten di­rectly to the de­duc­tive state of . But, also, the pre­dic­tor needs to know what the agent’s re­sponse to that is in or­der to send the let­ter or not. So, we have a simu­lated agent whose de­duc­tive state in­cludes the let­ter, whether the let­ter is re­ally sent or not. The re­sponse of the simu­lated agent is ob­served by the pre­dic­tor, as is the in­for­ma­tion con­cern­ing whether the rare dis­aster has oc­curred; then, the pre­dic­tor sends the let­ter in the case that the dis­aster will oc­cur xor the agent will pay up in re­sponse to the let­ter.

If the policy se­lec­tion chooses not to re­spond to the let­ter, the pre­dic­tor would ex­pect this, and not send the let­ter ex­cept in the case of dis­aster. On the other hand, if the policy does re­spond to the let­ter, then the let­ter is sent ex­cept in case of dis­aster. Either way, the prob­a­bil­ity of dis­aster does not change. So, policy se­lec­tion will con­verge to choos­ing a policy which doesn’t re­spond to the let­ter.

Pri­soner’s Dilemma

I don’t know what space of be­hav­iors is re­ally pos­si­ble in this case. But, a plau­si­ble out­come is the NicerBot strat­egy: co­op­er­ate with prob­a­bil­ity slightly higher than your be­lief that the other player will co­op­er­ate. Co­op­er­at­ing with the prob­a­bil­ity that they do in­cen­tivises co­op­er­a­tion, and adding a lit­tle prob­a­bil­ity of co­op­er­a­tion on top of this en­sures con­ver­gence to co­op­er­a­tion rather than other fixed-points when both play­ers play NicerBot. (This isn’t ac­tu­ally an equil­ibrium; if one player plays NicerBot, the other is in­cen­tivised to co­op­er­ate with slightly less prob­a­bil­ity. But, it’s plau­si­ble that it bounces around near NicerBot.) The ran­dom­ness for this can come from self-di­ag­o­nal­iz­ing sen­tences, like the ran­dom­ness for ex­plo­ra­tion.

In gen­eral, there could be strate­gies which ac­cess a lot of other sen­tences in the mar­ket state, rather than just the ones rele­vant to the game. NicerBot is a policy which is a func­tion of the other player’s prob­a­bil­ity of co­op­er­a­tion. I could also have a policy which is a func­tion of some fea­tures of the other player’s policy. There’s some re­sem­blance to the meta-threat hi­er­ar­chy, though it likely differs greatly in de­tail and would need mod­ifi­ca­tions (such as some kind of multi-stage meta-policy se­lec­tion) in or­der to mimic meta-threat.

There are a lot of other log­i­cal propo­si­tions which the agent’s strat­egy can uti­lize, too, be­yond any­thing about the other player’s strat­egy or even any­thing go­ing on in the game. This makes it seem like the situ­a­tion will re­sem­ble cor­re­lated equil­ibria more than Nash equil­ibria.

The situ­a­tion will also likely shift dra­mat­i­cally if the agents have differ­ent . Over­all, I’m quite cu­ri­ous what the game the­ory be­tween policy-se­lec­tion agents will look like. (Granted, we don’t even know very well what the game the­ory be­tween LIDT agents looks like.)

Troll Bridge

This de­serves its own more de­tailed write-up. I have not checked the proof sketch in de­tail. How­ever, I’m fairly sure policy se­lec­tion does not solve Troll Bridge. (Ideas in this sec­tion are mainly due to Sam. Over­sim­plifi­ca­tions in this sec­tion are mainly due to me.)

Sup­pose that the agent wants to cross a bridge, but there is a troll who blows up the bridge ex­actly when the ex­plo­ra­tion clause in the agent is ac­tive. In our case, . Ob­vi­ously, this poses a prob­lem for learn­ing the right coun­ter­fac­tu­als. The agent could eas­ily start (due to a poor choice of prior) think­ing that it is a bad idea to cross the bridge. Early poli­cies would re­fuse to cross the bridge ex­cept on ex­plo­ra­tion steps. Ex­pe­rience would seem to re­in­force the bad prior, en­sur­ing that the agent always thinks cross­ing the bridge is a bad idea.

How­ever, the situ­a­tion is ac­tu­ally much worse than this. Even with a “good” prior which ini­tially thinks it can safely cross the bridge on non-ex­plo­ra­tion rounds, it seems we will con­verge to re­fus­ing to cross the bridge.

Let in­di­cate cross­ing the bridge, and in­di­cate not cross­ing. Let mean that the log­i­cal in­duc­tor’s de­duc­tive state has a proof of by stage . is defined as fol­lows:

Here’s the proof sketch, us­ing in­den­ta­tion to no­tate rea­son­ing un­der a sup­po­si­tion, and de-in­den­ta­tion to come back out:

  • Sup­pose for some .

    • Fur­ther sup­pose that .

      • Either or .

      • But, if , then we should have , for large enough that the log­i­cal in­duc­tor at is ap­prox­i­mately con­sis­tent with re­spect to this kind of thing.

      • So, for suffi­ciently large, we must have . But then, .

    • There­fore, for suffi­ciently large.

  • There­fore, for suffi­ciently large.

  • The agent can rea­son through all this. So, by an ap­pli­ca­tion of Löb’s the­o­rem, the agent con­cludes for suffi­ciently large.

  • The log­i­cal in­duc­tor at can know whether is suffi­ciently large by check­ing whether its be­liefs are close enough to con­sis­tent with re­spect to “that kind of thing”.

  • There­fore, , so ex­cept on ex­plo­ra­tion rounds.

So, even if the agent starts with a “good” dis­tri­bu­tion on traders which pre­dicts than cross­ing the bridge on non-ex­plo­ra­tion rounds is the best policy, at some point the agent finds the above proof. From then on, it avoids the bridge.

It might be tempt­ing to say that policy se­lec­tion im­proves the situ­a­tion some­what; small enough can pre­vent the proof from go­ing through, so isn’t it just a trade­off of slow-grow­ing solv­ing more de­ci­sion prob­lems asymp­tot­i­cally but hav­ing worse be­liefs at small run­times, as with the de­ci­sion prob­lems con­sid­ered ear­lier? No. No mat­ter what you choose, so long as it does keep grow­ing, it’ll even­tu­ally start see­ing the above proof.

Reflec­tive Consistency

In or­der to talk about whether a policy-se­lec­tion agent would try to re­place it­self with a differ­ent kind of agent, we have to de­cide how to rep­re­sent se­quen­tial de­ci­sions. The se­quence of de­ci­sion prob­lems is not sup­posed to rep­re­sent a se­quen­tial de­ci­sion prob­lem: the agent is only sup­posed to care about op­ti­miz­ing the sin­gle in­stance , not any kind of ag­gre­ga­tion over the se­quence.

We could se­lect a sin­gle policy for a whole se­quen­tial de­ci­sion prob­lem. Rather than the agent de­ter­min­ing a sin­gle ac­tion, we in­tro­duce , and also a se­quence of ob­ser­va­tions , so . could in­clude a log­i­cal in­duc­tor state which has been run for steps, plus sen­sory ob­ser­va­tions (al­though we could and prob­a­bly should fold the sen­sory ob­ser­va­tions into the in­duc­tive state). Even this case is difficult to analyse. We need to set aside the ques­tion of self-mod­ifi­ca­tion for sim­ple code-op­ti­miza­tion rea­sons, and situ­a­tions where Omega offers a large re­ward for self-mod­ifi­ca­tion. Could there be cases where the policy ini­tially se­lected would want to re­place it­self with a differ­ent policy? It seems likely that some kind of tem­po­ral con­sis­tency re­sult could be proved, be­cause any policy which would want to re­place it­self with a similarly fast-run­ning policy could in­stead look at the log­i­cal in­duc­tor’s be­liefs about what that policy would do, and do that. (Note that this re­sult would likely need to make use of the re­source bound on poli­cies, which I men­tioned in the be­gin­ning as be­ing in­tu­itively ap­peal­ing but not im­por­tant to any­thing I show in this post.)

This might not be very satis­fy­ing, though, be­cause it may not rep­re­sent a solu­tion to the prob­lem of al­low­ing a sys­tem to think longer about what to do while re­main­ing re­flec­tively con­sis­tent. The policy is locked-in. The policy may be clever enough to think about what other poli­cies might do bet­ter and strat­egy-steal from them, but it might not—one would want an anal­y­sis of how it de­cides to do that, to see that it’s rea­son­able. It would be more com­pel­ling to have some­thing that could keep think­ing longer about what policy to use, and still have some kind of re­flec­tive con­sis­tency. This seems un­likely to fall out of the ap­proach.

Counterfactuals

(Edited to add this sec­tion.)

The prob­lem of coun­ter­fac­tual rea­son­ing seems al­most im­pos­si­ble to solve, be­cause our in­tu­itions about “what would hap­pen if only the agent did X” largely come from what we can see stand­ing out­side of the de­ci­sion prob­lem. From our per­spec­tive, a de­ci­sion prob­lem ac­tu­ally does have a func­tional form: we can sub­sti­tute in differ­ent agents to see what hap­pens. We can de­sign a solu­tion. From an agent’s per­spec­tive, situ­ated in­side a de­ci­sion prob­lem, it does not have a func­tional form; things are just the way they are, so it makes sig­nifi­cantly less sense to talk about how things could be if only the agent acted differ­ently.

From that per­spec­tive, CDT-like pro­pos­als to solve de­ci­sion the­ory prob­lems with re­vised no­tions of coun­ter­fac­tual feel fake: us­ing the word “coun­ter­fac­tual” gives you enough free­dom to claim just about any­thing, so that you can pack what you think the agent should be think­ing into the coun­ter­fac­tu­als, and have some false hope that one day you’ll figure out how to for­mally spec­ify coun­ter­fac­tu­als which have all the prop­er­ties you claim.

The policy-se­lec­tion per­spec­tive is that EDT-like con­di­tion­ing is all you’ll have, but we can in fact get the “func­tional form” of the prob­lem mostly as we want it by back­ing up to an early state and rea­son­ing about what to do from there. From this per­spec­tive, there is not one cor­rect “func­tional form” to be dis­cov­ered and em­bod­ied in the right the­ory of coun­ter­fac­tu­als. Rather, there’s a se­ries of func­tional forms which you get by view­ing the prob­lem from differ­ent lev­els of pro­cess­ing power. Un­for­tu­nately, none of these have a dis­t­in­guish­ing char­ac­ter of “cor­rect­ness”. Nonethe­less, cer­tain stages do seem to have most of the prop­er­ties we would in­tu­itively want of a no­tion of coun­ter­fac­tual.

Conclusion

Policy se­lec­tion seems to solve the de­ci­sion prob­lems which call for up­date­less rea­son­ing in a di­rect way, so long as is cho­sen well. How­ever, it does fail trick­ier prob­lems like Troll Bridge, and it doesn’t seem to provide as much in­sight into re­flec­tive con­sis­tency as we would want.

It’s still worth think­ing more about how much re­flec­tive con­sis­tency can be got­ten—the story is far from clear to me at the mo­ment, and it could be that nice re­sults ac­tu­ally can come from it.

I’m also in­ter­ested in think­ing about what the no­tion of “care­fully cho­sen ” ac­tu­ally means. Can any­thing for­mal be said about where the sweet spot is? Is there a prin­ci­pled way to avoid the chaos of a too-early mar­ket state while also steer­ing clear of knowl­edge we need to be up­date­less to­ward? It would be sur­pris­ing if there were, but it would also be sur­pris­ing if there were noth­ing for­mal to say about it.

I’m more ex­cited about ap­proaches which some­how break the con­cepts used here.