Decision theory and dynamic inconsistency

Link post

Here is my current take on decision theory:

  • When making a decision after observing X, we should condition (or causally intervene) on statements like “My decision algorithm outputs Y after observing X.”

  • Updating seems like a description of something you do when making good decisions in this way, not part of defining what a good decision is. (More.)

  • Causal reasoning likewise seems like a description of something you do when making good decisions. Or equivalently: we should use a notion of causality that captures the relationships relevant to decision-making rather than intuitions about physical causality. (More.)

  • “How much do I care about different copies of myself?” is an arbitrary question about my preferences. If my preferences change over time, it naturally gives rise to dynamic inconsistency unrelated to decision theory. (Of course an agent free to modify itself at time T would benefit by implementing some efficient compromise amongst all copies forked off after time T.)

In this post I’ll discuss the last bullet in more detail since I think it’s a bit unusual, it’s not something I’ve written about before, and it’s one fo the main ways my view of decision theory has changed in the last few years.

(Note: I think this topic is interesting, and could end up being relevant to the world in some weird-yet-possible situations, but I view it as unrelated to my day job on aligning AI with human interests.)

The transparent Newcomb problem

In the transparent version of Newcomb’s problem, you are faced with two transparent boxes (one small and one big). The small box always contains $1,000. The big box contains either $10,000 or $0. You may choose to take the contents of one or both boxes. There is a very accurate predictor, who has placed $10,000 in the big box if and only if they predict that you wouldn’t take the small box regardless of what you see in the big one.

Intuitively, once you see the contents of the big box, you really have no reason not to take the small box. For example, if you see $0 in the big box, you know for a fact that you are either getting $0 or $1,000. So why not just take the small box and walk away with $1,000? EDT and CDT agree about this one.

I think it’s genuinely non-obvious what you should do in this case (if the predictor is accurate enough). But I think this is because of ambiguity about what you want, not how you should make decision. More generally, I think that the apparent differences between EDT and UDT are better explained as differences in preferences. In this post I’ll explain that view, using transparent Newcomb as an illustration.

A simple inconsistent creature

Consider a simple creature which rationally pursues its goals on any given day—but whose goals change completely each midnight. Perhaps on Monday the creature is trying to create as much art and beauty as possible; on Tuesday it is trying to create joy and happiness; on Wednesday it might want something different still.

On any given day we can think of the creature as an agent. The creature on Tuesday is not being irrational when it decides to pursue joy and happiness instead of art and beauty. It has no special reason to try to “wind back the clock” and pursue the same projects it would have pursued on monday.

Of course on Monday the creature would prefer to arrest this predictable value drift—it knows that on Tuesday it will be replaced with a new agent, one that will stop contributing to the project of art and beauty. The creature on Monday ought to make plans accordingly, and if they had the ability to change this feature of themselves they would likely do so. It’s a matter of semantics whether we call this creature a single agent or a sequence of agents (one for each day).

This sequence of agents could benefit from cooperating with one another, and it can do so in different ways. Normal coordination is off the table, since causality runs only one way from each agent to the next. But there are still options:

  • The Tuesday-creature might believe that its decision is correlated with the Monday-creature. If the Tuesday-creature tries to stop the Wednesday-creature from existing, then the Monday-creature might have tried to stop the Tuesday-creature from existing. If the correlation is strong enough and stopping values change is expensive, then the Tuesday-creature is best served by being kind to its Wednesday-self, and helping to put it in a good position to realize whatever its goals may be. (Though note that this can unwind just like an iterated prisoner’s dilemma with finite horizon!)

  • The Tuesday-creature might believe that its decision is correlated with the Monday-creature’s predictions about what the Tuesday-creature would do. If the Tuesday-creature keeps on carrying out the Monday-creature’s plans, then the Monday-creature would be more motivated to help the Tuesday-creature succeed (and less motivated to try to prevent the value change). If the Monday-creature is a good enough predictor of the Tuesday-creature, then the Tuesday-creature is best served by at least “paying back” the Monday-creature for all of the preparation the Monday-creature did.

However none of these relationships are specific to the fact that it is the same creature on Monday and Tuesday; the fact that the cells are the same has no significance for the decision-theoretic situation. The Tuesday-creature has no intrinsic interest in the fact that it is not “reflectively stable”—of course that instability definitionally implies a desire to change itself, but not a further reason to try to help out the Monday-creature or Wednesday-creature, beyond the relationships described above.

A human inconsistency

I care a lot about what is going to happen to me in the future. I care much more about my future than about different ways that the world could have gone (or than my past for that matter). In fact I would treat those other possible versions of myself quite similarly to how I’d treat another person who just happened to be a lot like me.

This leads to a clear temporal inconsistency, which is so natural to humans that we don’t even think of it as an inconsistency. I’ll try to illustrate with a sequence of thought experiments.

Suppose that at 7AM I think that there is a 50% chance that a bell will ring at 8AM. At 7AM I am indifferent between the happiness of Paul-in-world-with-bell and Paul-in-silent-world. If you asked me which Paul I would prefer to stub his toe, I would be indifferent.

But by 8:01AM my preferences are quite different. After I’ve heard the bell ring, I care overwhelmingly about Paul-in-world-with-bell. I would very strongly prefer that the other Paul stub his toe than that I do.

Some people might say “Well you just cared about what happens to Paul, and then at 8AM you learned what is real. Your beliefs have changed, but not your preferences.” But consider a different experiment where I am duplicated at 7AM and each copy is transported to a different city, one where the bell would ring and the other where it will not. Until I learn which city I’m in, I’m indifferent between the happiness of Paul-in-city-with–bell and Paul-in-silent-city. But at the moment when I hear the bell ring, my preferences shift.

Some people could still say “Well you cared about the same thing all along—what happens to you—and you were merely uncertain about which Paul was you.” But consider the Paul from before the instant of copying, informed that he is about to be copied. That Paul knows full well that he cares about both copies. Yet sometime between the copying and the bell Paul has become much more parochial, and only cares about one. It seems to me like there is little way to escape from the inconsistency here.

One could still say “Nonsense, all along you just cared about what happened to you, you were just uncertain about which of the copies you were going to become.” I find this very unpersuasive (why think there is a fact of the matter about who “I” am?), but at this point I think it’s just a semantic dispute. Either my preferences change, or my preferences are fixed but defined in terms of a concept like “the real me” whose meaning changes. It all amounts to the same thing.

This is not some kind of universal principle of rationality—it’s just a fact about Paul. You can imagine different minds who care about all creatures equally, or who care only about their own future experiences, or who care about all the nearby possible copies of themselves. But I think many humans feel roughly the same way I do—they have some concern for others (including creatures very similar to themselves in other parts of the multiverse), but have a very distinctive kind of caring for what they themselves will actually experience in the future.

Altruism is more complicated

In the examples above I discussed stubbing my toe as the unit of caring. But what if we had instead talked of dollars? And what if I am a relatively altruistic person, who would use marginal dollars to try to make the world better?

Now in the case of two copies in separate cities it is clear enough that my preferences never change. I’m still willing to pay $1 to give my counterpart $2. After all, they can spend those dollars just as well as I can, and I don’t care who it was who did the good.

But in the case of a single city, where the bell either rings or it doesn’t, we run into another ambiguity in my preferences—another question about which we need not expect different minds to agree no matter how rational they are.

Namely: once I’ve heard the bell ringing, do I care about the happiness of the creatures in the world-with-bell (given that it’s the real world, the one we are actually in), or do I care about the happiness of creatures in both worlds even after I’ve learned that I happen to be in one of them?

I think people have different intuitions about this. And there are further subtle distinctions, e.g. many people have different intuitions depending on whether the ringing of the bell was a matter of objective chance (where you could imagine other copies of yourself on far away worlds, or other branches, facing the same situation with a different outcome), or a matter of logical necessity where we were simply ignorant.

While some of those disagreements may settle with more discussion, I think we should be able to agree that in principle we can imagine a mind that works either way, that either care about people in other worlds-that-could-have-been or who don’t.

Most humans have at least some draw towards caring only about the humans in this world. So the rest of my post will focus on their situation.

Back to transparent Newcomb (or: The analogy)

Consider again a human playing the transparent version of Newcomb’s problem. They see before them two boxes, a small one containing $1000 and a big one containing $0. They are told that the big box would have contained $10000 if a powerful predictor had guessed that they would never take the small box.

If the human cares only for their own future experiences, and would spend the money only on themselves, they have a pretty good case for taking the small box and walking away with $1000. After all, their own future experiences are either going to involve walking away with $1000 or with nothing, there is no possible world where they experience seeing an empty big box and then end up with the money after all.

Of course before taking the big box, the human would have much preferred to commit to never taking the small box. If they are an evidential decision theorist, they could also have just closed their eyes (curse that negative value of information!). That way they would have ended up with $10,000 instead of $1,000.

Does this mean that they have reason to take nothing after all, even after seeing the box?

I think the human’s situation is structurally identical to the inconsistent creature whose preferences change at midnight. Their problem is that in the instant when they see the empty big box, their preferences change. Once upon a time they cared about all of the possible versions of themselves, weighted by their probability. But once they see the empty big box, they cease to care at all about the versions of themselves who saw a full box. They end up in conflict with other very similar copies of themselves, and from the perspective of the human at the beginning of the process the whole thing is a great tragedy.

Just like the inconsistent creature, the human would have strongly preferred to make a commitment to avoid these shifting preferences. Just like the inconsistent creature, they might still find other ways to coordinate even after the preferences change, but it’s more contingent and challenging. Unlike the inconsistent creature, they can avoid the catastrophe by simply closing their eyes—because the preference change was caused by new information rather than by the passage of time.

The situation is most stark if we imagine the predictor running detailed simulations in order to decide whether to fill the big box. In this case, there is not one human but three copies of the human: two inside the predictor’s mind (one who sees an empty box and one who sees a full box) and one outside the predictor in the real world (seeing an empty or full box based on the results of the simulation). The problem for the human is that these copies of themselves can’t get along.

Even if you explained the whole situation to the human inside the simulation, they’d have no reason to go along with it. By avoiding taking the small box, all they can achieve is to benefit a different human outside of the simulation, who they no longer care at all about. From their perspective, better to just take the money (since there’s a 50% chance that they are outside of the simulation and will benefit by $1000).

(There are even more subtleties here if these different possible humans have preferences about their own existence, or about being in a simulation, or so on. But none of these change the fundamental bottom line.)

Altruism is still more complicated

If we consider a human who wants to make money to make the world better, the situation is similar but with an extra winkle.

Now if we explain the situation to the inside human, they may not be quite so callous. Instead they might reason “If I don’t take the small box, there is a good chance that a ‘real’ human on the outside will then get $10,000. That looks like a good deal, so I’m happy to walk away with nothing.”

Put differently, when we see an empty box we might not conclude that predictor didn’t fill the box. Instead, we might consider the possibility that we are living inside the predictor’s imagination, being presented with a hypothetical that need not have any relationship to what’s going on out there in the real world.

The most extreme version of this principle would lead me to entertain very skeptical /​ open-minded beliefs about the world. In any decision-problem where “what I’d do if I saw X” matters for what happens in cases where X is false, I could say that there is a “version” of me in the hypothetical who sees X. So I can never really update on my observations.

This leads to CDT=EDT=UDT. For people who endorse that perspective (and have no indexical preferences), this post probably isn’t very interesting. Myself, I think I somewhat split the difference: I think explicitly about my preferences about worlds that I “know don’t exist,” roughly using the framework of this post. But I justify that perspective in significant part from a position of radical uncertainty: I’m not sure if I’m thinking about worlds that don’t exist, or if it’s us who don’t exist and there is some real world somewhere thinking about us.

Conclusion

Overall the perspective in this post has made me feel much less confused about updatelessness. I expect I’m still wrong about big parts of decision theory, but for now I feel tentatively comfortable using UDT and don’t see the alternatives as very appealing. In particular, I no longer think that updating feels very plausible as a fundamental decision-theoretic principle, but at the same time don’t think there’s much of a reflective-stability-based argument for e.g. one-boxing in transparent Newcomb.

Most of the behaviors I associate with being “updateless” seem to really be about consistent preferences, and in particular continuing to care about worlds that are in some sense inconsistent with our observations. I believe my altruistic preferences are roughly stable in this sense (partially justified by a kind of radical epistemic humility about whether this is the “real” world), but my indexical preferences are not. The perspective in this post also more clearly frames the coordination problem faced by different copies of me (e.g. in different plausible futures) and I think has left me somewhat more optimistic about finding win-win deals.