Counterfactuals for Perfect Predictors

Parfit’s Hitch­hiker with a perfect pre­dic­tor has the un­usual prop­erty of hav­ing a Less Wrong con­sen­sus that you ought to pay, whilst also be­ing sur­pris­ingly hard to define for­mally. For ex­am­ple, if we try to ask about whether an agent that never pays in town is ra­tio­nal, then we en­counter a con­tra­dic­tion. A perfect pre­dic­tor would not ever give such an agent a lift, so by the Prin­ci­ple of Ex­plo­sion we can prove any state­ment to be true given this coun­ter­fac­tual.

On the other hand, even if the pre­dic­tor mis­tak­enly picks up defec­tors only 0.01% of the time, then this coun­ter­fac­tual seems to have mean­ing. Let’s sup­pose that a ran­dom num­ber from 1 to 10,000 is cho­sen and the pre­dic­tor always picks you up when the num­ber is 1 and is perfect oth­er­wise. Even if we draw the num­ber 120, we can fairly eas­ily imag­ine the situ­a­tion where the num­ber drawn was 1 in­stead. This is then a co­her­ent situ­a­tion where an Always Defect agent would end up in town, so we can talk about how the agent would have coun­ter­fac­tu­ally cho­sen.

So one re­sponse to the difficul­ties of dis­cussing coun­ter­fac­tual de­ci­sions with perfect pre­dic­tors would be to sim­ply com­pute the coun­ter­fac­tual as though the agent has a (tiny) chance of be­ing wrong. How­ever, agents may quite un­der­stand­ably wish to act differ­ently de­pend­ing on whether they are fac­ing a perfect or im­perfect pre­dic­tor, even choos­ing differ­ently when fac­ing an agent with a very low er­ror rate.

Another would be to say that the pre­dic­tor pre­dicts whether plac­ing the agent in town is log­i­cally co­her­ent. On the ba­sis that the agent only picks up those who it pre­dicts (with 100% ac­cu­racy) will pay, it can as­sume that it will be payed if the situ­a­tion is co­her­ent. Un­for­tu­nately, it isn’t clear what this means in con­crete terms for an agent to be such that it couldn’t co­her­ently be placed in such a situ­a­tion. How is, “I com­mit to not pay­ing in <im­pos­si­ble situ­a­tion>” any kind of mean­ingful com­mit­ment at all? We could look at, “I com­mit to mak­ing <situ­a­tion> im­pos­si­ble”, but that doesn’t mean any­thing ei­ther. If you’re in a situ­a­tion, then it must be pos­si­ble? Fur­ther, such situ­a­tions are con­tra­dic­tory and ev­ery­thing is true given a con­tra­dic­tion, so all con­tra­dic­tory situ­a­tions seem to be the same.

As the for­mal de­scrip­tion of my solu­tion is rather long, I’ll provide a sum­mary: We will as­sume that each pos­si­ble world model cor­re­sponds to at least one pos­si­ble se­quence of ob­ser­va­tions. For world mod­els that are con­sis­tent con­di­tional on the agent mak­ing cer­tain de­ci­sions, we’ll take the set of ob­ser­va­tions for agents that are con­sis­tent and feed it into the set of agents who aren’t. This will be in­ter­preted as what they would have coun­ter­fac­tu­ally cho­sen in such a situ­a­tion.

A For­mal De­scrip­tion of the Problem

(You may wish to skip di­rectly to the dis­cus­sion)

My solu­tion will be to in­clude ob­ser­va­tions in our model of the coun­ter­fac­tual. Most such prob­lems can be mod­el­led as fol­lows:

Let x be a la­bel that refers to one par­tic­u­lar agent that will be called the cen­tered agent for short. It should gen­er­ally re­fer to the agent whose de­ci­sions we are op­ti­mis­ing. In Parfit’s Hitch­hiker, x refers to the Hitch­hiker.

Let W be a set of pos­si­ble “world mod­els with holes”. That is, each is a col­lec­tion of facts about the world, but not in­clud­ing facts about the de­ci­sion pro­cesses of x which should ex­ist as an agent in this world. Th­ese will in­clude the prob­lem state­ment.

To demon­strate, we’ll con­struct I for this prob­lem. We start off by defin­ing the vari­ables:

  • t: Time

    • 0 when you en­counter the Driver

    • 1 af­ter you’ve ei­ther been dropped off in Town or left in the Desert

  • l: Lo­ca­tion. Either Desert or Town

  • Act: The ac­tual ac­tion cho­sen by the hitch­hiker if they are in Town at t=1. Either Pay or Don’t Pay or Not in Town

  • Pred: The driver’s pre­dic­tion of x’s ac­tion if the driver were to drop them in town. Either Pay or Don’t Pay (as we’ve already noted, defin­ing this coun­ter­fac­tual is prob­le­matic, but we’ll provide a cor­rec­tion later)

  • u: Utility of the hitchhiker

We can now provide the prob­lem state­ment as a list of facts:

  • Time: t is a time variable

  • Lo­ca­tion:

    • l=Desert at t=0

    • l=Town at t=1 if Pred=Pay

    • l=Desert at t=1 if Pred=Don’t Pay

  • Act:

    • Not in Town at t=0

    • Not in Town if l=Desert at t=1

    • Pay or Don’t Pay if l=Town at t=1

  • Pre­dic­tion: The Pre­dic­tor is perfect. A more for­mal defi­ni­tion will have to wait

  • Utility:

    • u=0 at t=0

    • At t=1: Sub­tract 1,000,000 from u if l=Desert

    • At t=1: Sub­tract 50 from u if Act=Pay

W then con­tains three dis­tinct world mod­els:

  • Start­ing World Model—w1:

    • t=0, l=Desert, Act=Not in Town, Pred: varies, u=0

  • End­ing Town World Model—w2:

    • t=1, l=Town, Act: varies, Pred: Pay, u: varies

  • End­ing Desert World Model—w3:

    • t=1, l=Desert, Act: Not in Town, Pred: Don’t Pay, u=-1,000,000

The prop­er­ties listed as varies will only be known once we have in­for­ma­tion about x. Fur­ther, it is im­pos­si­ble for cer­tain agents to ex­ist in cer­tain wor­lds given the rules above.

Let O be a set of pos­si­ble se­quences of ob­ser­va­tions. It should be cho­sen to con­tain all ob­ser­va­tions that could be made by the cen­tered agent in the given prob­lem and there should be at least one set of ob­ser­va­tions rep­re­sent­ing each pos­si­ble world model with holes. We will do some­thing slightly un­usual and in­clude the prob­lem state­ment as a set of ob­ser­va­tions. One in­tu­ition that might help illus­trate this is to imag­ine that the agent has an or­a­cle that al­lows it to di­rectly learn these facts.

For this ex­am­ple, the pos­si­ble in­di­vi­d­ual ob­ser­va­tions grouped by type are:

  • Lo­ca­tion Events: <l=Desert> OR <l=Town>

  • Time Events: <t=0> OR <t=1>

  • Prob­lem State­ment: There should be an en­try for each point in the prob­lem state­ment as de­scribed for I. For ex­am­ple:

    • <l=desert at t=0>

O then con­tains three dis­tinct ob­ser­va­tion se­quences:

  • Start­ing World Model—o1:

    • <Prob­lem State­ment> <t=0> <l=Desert>

  • End­ing Town World Model—o2:

    • <Prob­lem State­ment> <t=0> <l=Desert> <t=1> <l=Town>

  • End­ing Desert World Model—o3:

    • <Prob­lem State­ment> <t=0> <l=Desert> <t=1> <l=Desert>

Of course, <t=0><l=Desert> is ob­served ini­tially in each world so we could just re­move it to provide sim­plified se­quences of ob­ser­va­tions. I sim­ply write <Prob­lem State­ment> in­stead of ex­plic­itly list­ing each item as an ob­ser­va­tion.

Re­gard­less of its de­ci­sion al­gorithm, we will as­so­ci­ate x with a fixed Fact-Deriva­tion Al­gorithm f. This al­gorithm will take a spe­cific se­quences of ob­ser­va­tions o and pro­duce an id rep­re­sent­ing a world model with holes w. The rea­son why it pro­duces an id is that some se­quences of ob­ser­va­tions won’t lead to a co­her­ent world model for some agents. For ex­am­ple, the End­ing in Town Se­quence of ob­servers can never be ob­served by an agent that never pays. To han­dle this, we will as­sume that each in­com­plete world model w is as­so­ci­ated with a unique in­te­ger [w]. In this case, we might log­i­cally choose, [w1]=1, [w2]=2, [w3]=3 and then f(o1)=[w1], f(o2)=[w2], f(o3)=[w3]. We will define m to map from these id’s to the cor­re­spond­ing in­com­plete world model.

We will write D for the set of pos­si­ble de­ci­sions al­gorithms that x might pos­sess. In­stead of hav­ing these al­gorithms op­er­ate on ei­ther ob­ser­va­tions or world mod­els, we will make them op­er­ate on the world ids that are pro­duced by the Fact-Deriva­tion Al­gorithm so that they still pro­duce ac­tions in con­tra­dic­tory wor­lds. For ex­am­ple, define:

  • d2 - Always Pay

  • d3- Never Pay

If d2 sees [O3] or d3 sees [O2], then it knows that this is im­pos­si­ble ac­cord­ing to its model. How­ever, it isn’t ac­tu­ally im­pos­si­ble as its model could be wrong. Fur­ther, these “im­pos­si­ble” pre-com­mit­ments now mean some­thing tan­gible. The agent has pre-com­mit­ted to act a cer­tain way if it ex­pe­riences a par­tic­u­lar se­quence of ob­ser­va­tions.

We can now for­mal­ise the Driver’s Pre­dic­tion as fol­lows for situ­a­tions that are only con­di­tion­ally con­sis­tent (we noted be­fore that this needed to be cor­rected). Let o be the se­quence of ob­ser­va­tions and d0 be a de­ci­sion al­gorithm that is con­sis­tent with o, while d1 is a de­ci­sion al­gorithm that is in­con­sis­tent with it. Let w=m(f(o)), which is a con­sis­tent world given d0. Then the coun­ter­fac­tual of what d1 would do in w is defined as: d1(f(o)). We’ve now defined what it means to be a “perfect pre­dic­tor”. There is how­ever one po­ten­tial is­sue, per­haps mul­ti­ple ob­ser­va­tions led to w? In this case, we need to define the world more pre­cisely and in­clude ob­ser­va­tional de­tails in the model. Even if these de­tails don’t seem to change the prob­lem from a stan­dard de­ci­sion the­ory per­spec­tive, they may still af­fect the pre­dic­tions of ac­tions in im­pos­si­ble coun­ter­fac­tu­als.


In most de­ci­sion the­ory prob­lems, it is eas­ier to avoid dis­cussing ob­ser­va­tions any more than nec­es­sary. Gen­er­ally, the agent makes some ob­ser­va­tions, but their knowl­edge of most of the setup is mostly as­sumed. This ab­strac­tion gen­er­ally works well, but it leads to con­fu­sion in cases like this where we are deal­ing with pre­dic­tors who want to know if they can co­her­ently put an­other agent in a spe­cific situ­a­tion. As we’ve shown, even though it is mean­ingless to ask what an agent would do given an im­pos­si­ble situ­a­tion, it is mean­ingful to ask what the agent would do given an im­pos­si­ble in­put.

When ask­ing what any real agent would do in a real world prob­lem, we can always restate it as ask­ing about what the agent would do given a par­tic­u­lar in­put. How­ever, us­ing the trick of sep­a­rat­ing ob­ser­va­tions doesn’t limit us to real world prob­lems; as we’ve seen, we can use the trick of rep­re­sent­ing the prob­lem state­ment as di­rect ob­ser­va­tions to rep­re­sent more ab­stract prob­lems. The next log­i­cal step is to try ex­tend­ing this to cases such as, “What if the 1000th digit of Pi were even?” This al­lows us to avoid the con­tra­dic­tion and deal with situ­a­tions that are at least con­sis­tent, but it doesn’t provide much in the way of hints of how to solve these prob­lems in gen­eral. Nonethe­less, I figured that I may as well start with the the one prob­lem that was the most straight­for­ward.

Up­date: After reread­ing the de­scrip­tion of Up­date­less De­ci­sion The­ory, I re­al­ise that it is already us­ing some­thing very similar to this tech­nique as de­scribed here. So the main con­tri­bu­tion of this ar­ti­cle seems to be ex­plor­ing a part of UDT that is nor­mally not ex­am­ined in much de­tail.

One differ­ence though is that UDT uses a Math­e­mat­i­cal In­tu­ition Func­tion that maps from in­puts to a prob­a­bil­ity dis­tri­bu­tion of ex­e­cu­tion his­to­ries, in­stead of a Fact-Deriva­tion Al­gorithm that maps from in­puts to mod­els and only for con­sis­tent situ­a­tions. One ad­van­tage of break­ing it down as I do is to clar­ify that UDT’s ob­ser­va­tion-ac­tion maps don’t only in­clude en­tries for pos­si­ble ob­ser­va­tions, but ob­ser­va­tions that it would be con­tra­dic­tory for an agent to make. Se­condly, it clar­ifies that UDT pre­dic­tors pre­dict agents based on how they re­spond to in­puts rep­re­sent­ing situ­a­tions, rather than di­rectly on situ­a­tions them­selves, which is im­por­tant for im­pos­si­ble situ­a­tions.