# Counterfactuals for Perfect Predictors

Parfit’s Hitchhiker with a perfect predictor has the unusual property of having a Less Wrong consensus that you ought to pay, whilst also being surprisingly hard to define formally. For example, if we try to ask about whether an agent that never pays in town is rational, then we encounter a contradiction. A perfect predictor would not ever give such an agent a lift, so by the Principle of Explosion we can prove any statement to be true given this counterfactual.

On the other hand, even if the predictor mistakenly picks up defectors only 0.01% of the time, then this counterfactual seems to have meaning. Let’s suppose that a random number from 1 to 10,000 is chosen and the predictor always picks you up when the number is 1 and is perfect otherwise. Even if we draw the number 120, we can fairly easily imagine the situation where the number drawn was 1 instead. This is then a coherent situation where an Always Defect agent would end up in town, so we can talk about how the agent would have counterfactually chosen.

So one response to the difficulties of discussing counterfactual decisions with perfect predictors would be to simply compute the counterfactual as though the agent has a (tiny) chance of being wrong. However, agents may quite understandably wish to act differently depending on whether they are facing a perfect or imperfect predictor, even choosing differently when facing an agent with a very low error rate.

Another would be to say that the predictor predicts whether placing the agent in town is logically coherent. On the basis that the agent only picks up those who it predicts (with 100% accuracy) will pay, it can assume that it will be payed if the situation is coherent. Unfortunately, it isn’t clear what this means in concrete terms for an agent to be such that it couldn’t coherently be placed in such a situation. How is, “I commit to not paying in <impossible situation>” any kind of meaningful commitment at all? We could look at, “I commit to making <situation> impossible”, but that doesn’t mean anything either. If you’re in a situation, then it must be possible? Further, such situations are contradictory and everything is true given a contradiction, so all contradictory situations seem to be the same.

As the formal description of my solution is rather long, I’ll provide a summary: We will assume that each possible world model corresponds to at least one possible sequence of observations. For world models that are consistent conditional on the agent making certain decisions, we’ll take the set of observations for agents that are consistent and feed it into the set of agents who aren’t. This will be interpreted as what they would have counterfactually chosen in such a situation.

A Formal Description of the Problem

(You may wish to skip directly to the discussion)

My solution will be to include observations in our model of the counterfactual. Most such problems can be modelled as follows:

Let x be a label that refers to one particular agent that will be called the centered agent for short. It should generally refer to the agent whose decisions we are optimising. In Parfit’s Hitchhiker, x refers to the Hitchhiker.

Let W be a set of possible “world models with holes”. That is, each is a collection of facts about the world, but not including facts about the decision processes of x which should exist as an agent in this world. These will include the problem statement.

To demonstrate, we’ll construct I for this problem. We start off by defining the variables:

• t: Time

• 0 when you encounter the Driver

• 1 after you’ve either been dropped off in Town or left in the Desert

• l: Location. Either Desert or Town

• Act: The actual action chosen by the hitchhiker if they are in Town at t=1. Either Pay or Don’t Pay or Not in Town

• Pred: The driver’s prediction of x’s action if the driver were to drop them in town. Either Pay or Don’t Pay (as we’ve already noted, defining this counterfactual is problematic, but we’ll provide a correction later)

• u: Utility of the hitchhiker

We can now provide the problem statement as a list of facts:

• Time: t is a time variable

• Location:

• l=Desert at t=0

• l=Town at t=1 if Pred=Pay

• l=Desert at t=1 if Pred=Don’t Pay

• Act:

• Not in Town at t=0

• Not in Town if l=Desert at t=1

• Pay or Don’t Pay if l=Town at t=1

• Prediction: The Predictor is perfect. A more formal definition will have to wait

• Utility:

• u=0 at t=0

• At t=1: Subtract 1,000,000 from u if l=Desert

• At t=1: Subtract 50 from u if Act=Pay

W then contains three distinct world models:

• Starting World Model—w1:

• t=0, l=Desert, Act=Not in Town, Pred: varies, u=0

• Ending Town World Model—w2:

• t=1, l=Town, Act: varies, Pred: Pay, u: varies

• Ending Desert World Model—w3:

• t=1, l=Desert, Act: Not in Town, Pred: Don’t Pay, u=-1,000,000

The properties listed as varies will only be known once we have information about x. Further, it is impossible for certain agents to exist in certain worlds given the rules above.

Let O be a set of possible sequences of observations. It should be chosen to contain all observations that could be made by the centered agent in the given problem and there should be at least one set of observations representing each possible world model with holes. We will do something slightly unusual and include the problem statement as a set of observations. One intuition that might help illustrate this is to imagine that the agent has an oracle that allows it to directly learn these facts.

For this example, the possible individual observations grouped by type are:

• Location Events: <l=Desert> OR <l=Town>

• Time Events: <t=0> OR <t=1>

• Problem Statement: There should be an entry for each point in the problem statement as described for I. For example:

• <l=desert at t=0>

O then contains three distinct observation sequences:

• Starting World Model—o1:

• <Problem Statement> <t=0> <l=Desert>

• Ending Town World Model—o2:

• <Problem Statement> <t=0> <l=Desert> <t=1> <l=Town>

• Ending Desert World Model—o3:

• <Problem Statement> <t=0> <l=Desert> <t=1> <l=Desert>

Of course, <t=0><l=Desert> is observed initially in each world so we could just remove it to provide simplified sequences of observations. I simply write <Problem Statement> instead of explicitly listing each item as an observation.

Regardless of its decision algorithm, we will associate x with a fixed Fact-Derivation Algorithm f. This algorithm will take a specific sequences of observations o and produce an id representing a world model with holes w. The reason why it produces an id is that some sequences of observations won’t lead to a coherent world model for some agents. For example, the Ending in Town Sequence of observers can never be observed by an agent that never pays. To handle this, we will assume that each incomplete world model w is associated with a unique integer [w]. In this case, we might logically choose, [w1]=1, [w2]=2, [w3]=3 and then f(o1)=[w1], f(o2)=[w2], f(o3)=[w3]. We will define m to map from these id’s to the corresponding incomplete world model.

We will write D for the set of possible decisions algorithms that x might possess. Instead of having these algorithms operate on either observations or world models, we will make them operate on the world ids that are produced by the Fact-Derivation Algorithm so that they still produce actions in contradictory worlds. For example, define:

• d2 - Always Pay

• d3- Never Pay

If d2 sees [O3] or d3 sees [O2], then it knows that this is impossible according to its model. However, it isn’t actually impossible as its model could be wrong. Further, these “impossible” pre-commitments now mean something tangible. The agent has pre-committed to act a certain way if it experiences a particular sequence of observations.

We can now formalise the Driver’s Prediction as follows for situations that are only conditionally consistent (we noted before that this needed to be corrected). Let o be the sequence of observations and d0 be a decision algorithm that is consistent with o, while d1 is a decision algorithm that is inconsistent with it. Let w=m(f(o)), which is a consistent world given d0. Then the counterfactual of what d1 would do in w is defined as: d1(f(o)). We’ve now defined what it means to be a “perfect predictor”. There is however one potential issue, perhaps multiple observations led to w? In this case, we need to define the world more precisely and include observational details in the model. Even if these details don’t seem to change the problem from a standard decision theory perspective, they may still affect the predictions of actions in impossible counterfactuals.

Discussion

In most decision theory problems, it is easier to avoid discussing observations any more than necessary. Generally, the agent makes some observations, but their knowledge of most of the setup is mostly assumed. This abstraction generally works well, but it leads to confusion in cases like this where we are dealing with predictors who want to know if they can coherently put another agent in a specific situation. As we’ve shown, even though it is meaningless to ask what an agent would do given an impossible situation, it is meaningful to ask what the agent would do given an impossible input.

When asking what any real agent would do in a real world problem, we can always restate it as asking about what the agent would do given a particular input. However, using the trick of separating observations doesn’t limit us to real world problems; as we’ve seen, we can use the trick of representing the problem statement as direct observations to represent more abstract problems. The next logical step is to try extending this to cases such as, “What if the 1000th digit of Pi were even?” This allows us to avoid the contradiction and deal with situations that are at least consistent, but it doesn’t provide much in the way of hints of how to solve these problems in general. Nonetheless, I figured that I may as well start with the the one problem that was the most straightforward.

Update: After rereading the description of Updateless Decision Theory, I realise that it is already using something very similar to this technique as described here. So the main contribution of this article seems to be exploring a part of UDT that is normally not examined in much detail.

One difference though is that UDT uses a Mathematical Intuition Function that maps from inputs to a probability distribution of execution histories, instead of a Fact-Derivation Algorithm that maps from inputs to models and only for consistent situations. One advantage of breaking it down as I do is to clarify that UDT’s observation-action maps don’t only include entries for possible observations, but observations that it would be contradictory for an agent to make. Secondly, it clarifies that UDT predictors predict agents based on how they respond to inputs representing situations, rather than directly on situations themselves, which is important for impossible situations.

• Turns out Cousin_it actually discussed this problem many years before me. He points out that when the situation is inconsistent, we can run into issues with spurious counterfactuals.

No nominations.
No reviews.
• If you’re in a situation, then it must be possible?

There is a sense in which you can’t conclude this. For a given notion of reasoning about potentially impossible situations, you can reason about such situations that contain agents, and you can see how these agents in the impossible situations think. If the situation doesn’t tell the agent whether it’s possible or impossible (say, through observations), the agent inside won’t be able to tell if it’s an impossible situation. Always concluding that the present situation is possible will result in error in the impossible situations (so it’s even worse than being unjustified). Errors in impossible situations may matter if something that matters depends on how you reason in impossible situations (for example, a “predictor” in a possible situation that asks what you would do in impossible situations).

We could look at, “I commit to making <situation> impossible”, but that doesn’t mean anything either.

A useful sense of an “impossible situation” won’t make it impossible to reason about. There’s probably something wrong with it, but not to the extent that it can’t be considered. Maybe it falls apart if you look too closely, or maybe it has no moral worth and so should be discarded from decision making. But even in these cases it might be instrumentally valuable, because something in morally relevant worlds depends on this kind of reasoning. You might not approve of this kind of reasoning and call it meaningless, but other things in the world can perform it regardless of your judgement, and it’s useful to understand how that happens to be able to control them.

Finally, some notions of “impossible situation” will say that a “situation” is possible/​impossible depending on what happens inside it, and there may be agents inside it. In that case, their decisions may affect whether a given situation is considered “possible” or “impossible”, and if these agents are familiar with this notion they can aim to make a given situation they find themselves in possible or impossible.

• “There is a sense in which you can’t conclude this”—Well this paragraph is pretty much an informal description of how my technique works. Only I differentiate between world models and representations of world models. Agents can’t operate on incoherent world models, but they can operate on representations of world models that are incoherent for this agent. It’s also the reason why I separated out observations from models.

“In that case, their decisions may affect whether a given situation is considered “possible” or “impossible”, and if these agents are familiar with this notion they can aim to make a given situation they find themselfs in possible or impossible”—My answer to this question is that it is meaningless to ask what an agent does given an impossible situation, but meaningful to ask what it does given an impossible input (which ultimately represents an impossible situation).

I get the impression that you didn’t quite grasp the general point of this post. I suspect that the reason may be that the formal description may be less skippable than I originally thought.

• I was replying specifically to those remarks, on their use of terminology, not to the thesis of the post. I disagree with the framing of “impossible situations” and “meaningless” for the reasons I described. I think it’s useful to let these words (in the context of decision theory) take default meaning that makes the statements I quoted misleading.

My answer to this question is that it is meaningless to ask what an agent does given an impossible situation, but meaningful to ask what it does given an impossible input (which ultimately represents an impossible situation).

That’s the thing: if this “impossible input” represents an “impossible situation”, and it’s possible to ask what happens for this input, that gives a way of reasoning about the “impossible situation”, in which case it’s misleading to say that “it is meaningless to ask what an agent does given an impossible situation”. I of course agree that you can make a technical distinction, but even then it’s not clear what you mean by calling an idea “meaningless” when you immediately proceed to give a way of reasoning about (a technical reformulation of) that idea.

If an idea is confused in some way, even significantly, that shouldn’t be enough to declare it “meaningless”. Perhaps “hopelessly confused” and “useless”, but not yet “meaningless”. Unless you are talking about a more specific sense of “meaning”, which you didn’t stipulate. My guess is that by “meaningless” you meant that you don’t see how it could ever be made clear in its original form, or that in the context of this post it’s not at all clear compared to the idea of “impossible input” that’s actually clarified. But that’s an unusual sense for that word.

• I guess I saw those mainly as framing remarks, so I may have been less careful with my language than elsewhere. Maybe “meaningless” is a strong word, but I only meant it in a specific way that I hoped was clear enough from context.

I was using situations to refer to objects where the equivalence function is logical equivalence, whilst I was using representations to refer to objects where the equivalence function is the specific formulation. My point was that all impossible situations are logically equivalent, so asking what an agent does in this situation is of limited use. An agent that operates directly on such impossible situations can only have one such response to these situations, even across multiple problems. On the other hand, representations don’t have this limitation.

• My point was that all impossible situations are logically equivalent

Yes, the way you are formulating this, as a theory that includes claims about agent’s action or other counterfactual things together with things from the original setting that contradict them such as agent’s program. It’s also very natural to excise parts of a situation (just as you do in the post) and replace them with the alternatives you are considering. It’s what happens with causal surgery.

An agent that operates directly on such impossible situations can only have one such response to these situations, even across multiple problems.

If it respects equivalence of theories (which is in general impossible to decide) and doesn’t know where the theories came from, so that this essential data is somehow lost before that point. I think it’s useful to split this process into two phases, where first the agent looks for itself in the worlds it cares about, and only then considers the consequences of alternative actions. The first phase gives a world that has all discovered instances of the agent excised from it (a “dependence” of world on agent), so that on the second phase we can plug in alternative actions (or strategies, maps from observations to actions, as the type of the excised agent will be something like exponential if the agent expects input).

At that point the difficulty is mostly on the first phase, formulation of dependence. (By the way, in this view there is no problem with perfect predictors, since they are just equivalent to the agent and become one of the locations where the agent finds itself, no different from any other. It’s the imperfect predictors, such as too-weak predictors of Agent-Simulates-Predictor (ASP) or other such things that cause trouble.) The main difficulty here is spurious dependencies, since in principle the agent is equivalent to their actual action, and so conversely the value of their actual action found somewhere in the world is equivalent to the agent. So the agent finds itself behind all answers “No” in the world (uttered by anyone and anything) if it turns out that their actual action is “No” etc., and the consequences of answering “Yes” then involve changing all answers “No” to “Yes” everywhere in the world. (When running the search, the agent won’t actually encounter spurious dependencies under certain circumstances, but that’s a bit flimsy.)

This shows that even equivalence of programs is too strong when searching for yourself in the world, or at least the proof of equivalence shouldn’t be irrelevant in the resulting dependence. So this framing doesn’t actually help with logical counterfactuals, but at least the second phase where we consider alternative actions is spared the trouble, if we somehow manage to find useful dependencies.

• “By the way, in this view there is no problem with perfect predictors, since they are just equivalent to the agent and become one of the locations where the agent finds itself”—Well, this still runs into issues as the simulated agent encounters an impossible situation, so aren’t we still required to use the work around (or another workaround if you’ve got one)?

“This shows that even equivalence of programs is too strong when searching for yourself in the world, or at least the proof of equivalence shouldn’t be irrelevant in the resulting dependence”—Hmm, agents may take multiple actions in a decision problem. So aren’t agents only equivalent to programs that take the same action in each situation? Anyway, I was talking about equivalence of worlds, not of agents, but this is still an interesting point that I need to think through. (Further, are you saying that agents should only be considered to have their behaviour linked to agents they are provably equivalent too and instead of all agents they are equivalent to?)

“A useful sense of an “impossible situation” won’t make it impossible to reason about”—That’s true. My first thought was to consider how the program represents its model the world and imagining running the program with impossible world model representations. However, the nice thing about modelling the inputs and treating model representations as integers rather than specific structures, is that it allows us to abstract away from these kinds of internal details. Is there a specific reason why you might want to avoid this abstraction?

UPDATE: I just re-read your comment and found that I significantly misunderstood it, so I’ve made some large edits to this comment. I’m still not completely sure that I understand what you were driving at.

• Well, this still runs into issues as the simulated agent encounters an impossible situation

The simulated agent, together with the original agent, are removed from the world to form a dependence, which is a world with holes (free variables). If we substitute the agent term for the variables in the dependence, the result is equivalent (not necessarily syntactically equal) to the world term as originally given. To test a possible action, this possible action is substituted for the variables in the dependence. The resulting term no longer includes instances of the agent, instead it includes an action, so there is no contradiction.

Hmm, agents may take multiple actions in a decision problem. So aren’t agents only equivalent to programs that take the same action in each situation?

A protocol for interacting with environment can be expressed with the type of decision. So if an agent makes an action of type A depending on an observation of type O, we can instead consider (O->A) as the type of its decision, so that the only thing that it needs to do is produce a decision in this way, with interaction being something that happens to the decision and not the agent.

Requiring that only programs completely equivalent to the agent are to be considered its instances may seem too strong, and it probably is, but the problem is that it’s also not strong enough, because even with this requirement there are spurious dependencies that say that an agent is equivalent to a piece of paper that happens to contain a decision that coincides with agent’s own. So it’s a good simplification for focusing on logical counterfactuals (in the logical direction, which I believe is less hopeless than finding answers in probability).

Further, are you saying that agents should only be considered to have their behaviour linked to agents they are provably equivalent [to] instead of all agents they are equivalent to?

Not sure what the distinction you are making is. How would you define equivalence? By equivalence I meant equivalence of lambda terms, where one can be rewritten into the other with a sequence of alpha, reduction and expansion rules, or something like that. It’s judgemental/​computational/​reductional equality of type theory, as opposed to propositional equality, which can be weaker, but since judgemental equality is already too weak, it’s probably the wrong place to look for an improvement.

• The simulated agent, together with the original agent, are removed from the world to form a dependence, which is a world with holes (free variables)

I’m still having difficulty understanding the process that you’re following, but let’s see if I can correctly guess this. Firstly you make a list of all potential situations that an agent may experience or for which an agent may be simulated. Decisions are included in this list, even if they might be incoherent for particular agents. In this example, these are:

• Actual_Decision → Co-operate/​Defect

• Simulated_Decision → Co-operate/​Defect

We then group all necessarily linked decisions together:

• (Actual_Decision, Simulated_Decision) → (Co-operate, Co-operate)/​(Defect, Defect)

You then consider the tuple (equivalent to an observation-action map) that leads to the best outcome.

I agree that this provides the correct outcome, but I’m not persuaded that the reasoning is particularly solid. At some point we’ll want to be able to tie these models back to the real world and explain exactly what kind of hitchhiker corresponds to a (Defect, Defect) tuple. A hitchhiker that doesn’t get a lift? Sure, but what property of the hitchhiker makes it not get a lift?

We can’t talk about any actions it chooses in the actual world history, as it is never given the chance to make this decision. Next we could try constructing a counterfactual as per CDT and consider what the hitchhiker does in the world model where we’ve performed model surgery to make the hitchhiker arrive in town. However, as this is an impossible situation, there’s no guarantee that this decision is connected to any decision the agent makes in a possible situation. TDT counterfactuals don’t help either as they are equivalent to these tuples.

Alternatively, we could take the approach that you seem to favour and say that the agent makes the decision to defect in a paraconsistent situation where it is in town. But this assumes that the agent has the ability to handle paraconsistent situations when only some agents have this ability. It’s not clear how to interpret this for other agents. However, inputs have neither of these problems—all real world agents must do something given an input even if it is doing nothing or crashing and these are easy to interpret. So modelling inputs allows us to more rigorously justify the use of these maps. I’m beginning to think that there would be a whole post worth of material if I expanded upon this comment.

How would you define equivalence?

I think I was using the wrong term. I meant linked in the logical counterfactual sense, say two identical calculators. Is there a term for this? I was trying to understand whether you were saying that we only care about the provable linkages, rather than all such linkages.

Edit: Actually, after rereading over UDT, I can see that it is much more similar than I realised. For example, it also separates inputs from models. More detailed information is included at the bottom of the post.

• Firstly you make a list of all potential situations that an agent may experience or for which an agent may be simulated. Decisions are included in this list, even if they might be incoherent for particular agents.

No? Situations are not evaluated, they contain instances of the agent, but when they are considered, it’s not yet known what the decision will be, so decisions are unknown, even if in principle determined by the (agents in the) situation. There is no matiching or assignment of possible decisions when we identify instances of the agent. Next, the instances are removed from the situation. At this point, decisions are no longer determined in the situations-with-holes (dependencies), since there are no agents and no decisions remaining in them. So there won’t be a contradiction in putting in any decisions after that (without the agents!) and seeing what happens.

I meant linked in the logical counterfactual sense, say two identical calculators.

That doesn’t seem different from what I meant, if appropriately formulated.

• I think the limit you’re running up against is how to formally define “possible”, and what model of decision-making and free will is consistent with a “perfect predictor”.

For many of us, “perfect predictor” implies “determinstic future, with choice being an illusion”. Whether that’s truly possible in our universe or not is unknown.

• Whether or not the universe is or isn’t truly deterministic (not the focus of this thread), it is a common enough belief that it’s worth modelling.

• Notice how none of these difficulties arise if you adopt the approach I had posted recently about, that you do not change the world, you discover what possible subjective world you live in. The question is always about what world model the agent has, not about the world itself, and about discovering more about that world model.

In the Parfit’s hitchhiker problem with the driver who is a perfect predictor, there is no possible world where the hitchhiker gets a lift but does not pay. The non-delirious agent will end up adjusting their world model to “Damn, apparently I am the sort of person who attempts to trick the driver, fails and dies” or “Happy I am the type of person who precommits to paying”, for example. There are many more possible worlds in that problem if we include the agents whose world model is not properly adjusted based on the input. In severe cases this is known as psychosis.

Similarly, “What if the 1000th digit of Pi were even?” is a question about partitioning possible worlds in your mind. Notice that there are not just two classes of those:

These classes include the possible worlds where you learn that the 1000th digit of Pi is even, the worlds where you learn that it is odd, the worlds where you never bother figuring it out, the worlds where you learned one answer, but then had to reevaluate it, or example, because you found a mistake in the calculations. There are also low-probability possible worlds, like those where Pi only has 999 digits, where the 1000th digit keeps changing, and so on. All those are possible world models, just some are not very probable apriori for the reference class of agents we are interested in.

...But that would be radically changing your world model from “there is the single objective reality about which we ask questions” to “agents are constantly adjusting models, and some models are better than others at anticipating future inputs.”

• I’m not sure sure that it solves the problem. The issue is that in the case where you always choose “Don’t Pay” it isn’t easy to define what the predictor predicts as it is impossible for you to end up in town. The predictor could ask what you’d do if you thought the predictor was imperfect (as then ending up in town would actually be possible), but this mightn’t represent how you’d behave against a perfect predictor.

(But further, I am working within the assumption that everything is deterministic and that you can’t actually “change” the world as you say. How have I assumed the contrary?)

• The principle of explosion isn’t a problem for all logics.

I think in a way, the problem with Parfit’s Hitchhiker is—how would you know that something is a perfect predictor? In order to have a probability p of making every prediction right, over n predictions, only requires a predictor be right in their predictions with probability x, where x^n>=p. So they have a better than 50% chance of making 100 consecutive predictions right if they’re right 99.31% of the time. By this metric, to be sure the chance they’re wrong is less than 1 in 10,000 (i.e. they’re right 99.99% of the time or more) you’d have to see them make 6,932 correct predictions. (This assumes that all these predictions are independent, unrelated events, in addition to a few other counterfactual requirements that are probably satisfied if this is your first time in such a situation.)

• Sure, in the real world you can’t know a predictor is perfect. But the point is that perfection is often a useful abstraction and the tools that I introduced allow you to either work with real world problems as you’d seem to prefer or with more abstract problems which are often easier to work with. Anyway, by representing the input of the problem explicitly I’ve created an abstraction that is closer to the real world than most of these problems are.

• I was suggesting that what model you should use if your current one is incorrect is based on how you got your current model, which is why it sounds like ‘I prefer real world problems’ - model generation details do seem necessarily specific. (My angle was that in life, few things are impossible, many things are improbable—like getting out of the desert and not paying.) I probably should have stated that, and that only, instead of the math.

by representing the input of the problem explicitly I’ve created an abstraction that is closer to the real world than most of these problems are.

Indeed. I found your post well thought out, and formal, though I do not yet fully understand the jargon.

Where/​how did you learn decision theory?

• Thanks, I appreciate the complement. Even though I have a maths degree, I never formally studied decision theory. I’ve only learned about it by reading posts on Less Wrong. So much of the jargon is my attempt to come up with words that succinctly describe the concept.