Or: Formalizing Timeless Decision Theory

Previously:

0. Decision Theories: A Less Wrong Primer
1. The Problem with Naive Decision Theory
2. Causal Decision Theory and Substitution

WARNING: The main result of this post, as it’s written here, is flawed. I at first thought it was a fatal flaw, but later found a fix. I’m going to try and repair this post, either by including the tricky bits, or by handwaving and pointing you to the actual proofs if you’re curious. Carry on!

Summary of Post: Have you ever wanted to know how (and whether) Timeless Decision Theory works? Using the framework from the last two posts, this post shows you explicitly how TDT can be implemented in the context of our tournament, what it does, how it strictly beats CDT on fair problems, and a bit about why this is a Big Deal. But you’re seriously going to want to read the previous posts in the sequence before this one.

We’ve reached the frontier of decision theories, and we’re ready at last to write algorithms that achieve mutual cooperation in Prisoner’s Dilemma (without risk of being defected on, and without giving up the ability to defect against players who always cooperate)! After two substantial preparatory posts, it feels like it’s been a long time, hasn’t it?

But look at me, here, talking when there’s Science to do...

Once again, we’re programming an agent X to face another agent Y in a game G, where X and Y have access to each others’ source codes and the source code of G. (In this post, we’ll sweep the “player 1, player 2” bit back into the definition of G for simplicity in our code.) We may assume that at least one of the x_i is a Nash equilibrium, since X and Y can treat a mixed equilibrium like an extra option and define its expected payoffs as usual.

Not Quite Timeless Decision Theory

Here’s our first shot at TDT, based on the idea of substituting different things for the source code of X in our prediction of Y...

def X(code G, code Y):

For each x_i:
- Write a very simple agent A_i that always outputs x_i in game G.
- Try to deduce output Y(code G, code A_i).
- If this succeeds, calculate U(x_i, output Y(code G, code A_i)); call this u_i.
If u_i has been deduced for all x_i, return the x_i for which u_i is largest.
Otherwise, return the default Nash equilibrium.

To see what this algorithm does, let’s try it on Newcomb’s Problem as we formalized it last time; we’ll let A₁ denote the agent that always one-boxes and A₂ denote the agent that always two-boxes. If P has a valid and sufficiently powerful inference module, the output of P(code N, code A₁) is 1 and the output of P(code N, code A₂) is 2. Why? A₁ and A₂ are very simple bots, so P will be able to deduce their outputs, and we know what it does in that case.

And then, if X also has a valid and sufficiently powerful inference module, X will deduce those outputs for P. Thus X will deduce that u₁=100 and u₂=1, so X will indeed one-box; and if P has a sufficiently powerful inference module, it will therefore predict that X one-boxes, and fill the box, so that X gets payoff 100!

Again, what we’re doing here is not counterfactual reasoning, but instead reasoning about what your opponent would do against some other simpler agent; therefore we’re again evading the problem of spurious counterfactuals. (On a philosophical level, this decision theory finds the most successful constant strategy A_i and mimics its output.)

Alright, I said there were going to be two problems with this attempt, and indeed I called it Not Quite Timeless Decision Theory. So what goes wrong?

Problem: NQTDT won’t cooperate against itself!

If you face two NQTDT agents against each other in the Prisoner’s Dilemma, what happens? First, note that NQTDT will defect against either a CooperateBot (A_C) or a DefectBot (A_D) in the Prisoner’s Dilemma. If X and Y are both NQTDT, then X deduces that Y defects against both A_C and A_D, therefore X calculates u_C=0 and u_D=1, so X defects. (And of course Y does the same.) Drat!

And the second problem?

Problem: NQTDT might one-box and not be rewarded!

In Newcomb’s Problem, what if P has a powerful enough inference module to show that A₁ one-boxes and A₂ two-boxes, but not one powerful enough to figure out what X is doing? (After all, X is a bit more complicated than the A_i are.) By the reasoning above, X will one-box, but P won’t fill the box—the worst of all possible outcomes!

So NQTDT is broken. What can we do?

Idea 5: Sanity Check

Well, instead of assuming that our opponent Y will behave in the real game as it did against the optimal A_i, we could try to deduce Y’s actual behavior against X as a “sanity check” before we make our final decision. That is,

def X(code G, code Y):

For each x_i:
- Write a very simple agent A_i that always outputs x_i.
- Try to deduce output Y(code G, code A_i).
- If this succeeds, calculate U(x_i, output Y(code G, code A_i)); call this u_i.
If u_i has been deduced for all x_i, take the x_i for which u_i is largest, and try to deduce “U(x_i, output Y(code G, code X)) ≥ u_i”.
If that succeeds, then return x_i.
Else, return the default Nash equilibrium.

Note that the new line checks the behavior of Y against X itself, not the A_i. This “sanity check” ensures that, when X imitates the most successful of the A_i, it achieves at least the payoff that A_i does. Indeed, if our program does this then it’s again un-exploitable in the same sense as CDT, and it still one-boxes on Newcomb’s Problem (if and only if it deduces that P is sufficiently powerful to deduce this fact), so it’s no longer broken in the above sense.

But by the same argument as before, it still defects on the Prisoner’s Dilemma against itself, so this is still not TDT.

What else could we do better? Well, this agent “mimics” only the very simplest of strategies- the strategies that correspond to one-player games for its opponent. What if we instead let it mimic more complicated (but still basic) strategies, like versions of Newcomb’s Predictor?

Idea 6: Mimic something a bit more clever

This is going to be a bit more complicated than the last one; we first need to define the agents that we’ll examine and mimic. Consider each function f from the set {y₁, . . . , y_m} to the set {x₁, . . . , x_n} (there are n^m such functions), and define A_f as follows:

def A_f(code G, code Y):

Try and deduce the output of Y(code G, code A_f).
If this succeeds, return f(output Y). Else, return the Nash equilibrium.

Note that the constant strategies A_i are special cases of the A_f, corresponding to the constant functions; note also that Newcomb’s Predictor is an example of an A_f that’s not simply a constant strategy. Now we’ll write a decision theory that looks for the most successful of the A_f, applies a sanity check, and mimics it. This results in an instance of...

Timeless Decision Theory

def X(code G, code Y):

For each f:
- Write the agent A_f.
- Try to deduce output Y(code G, code A_f).
- If this succeeds, calculate U(output A_f, output Y(code G, code A_f)); call this u_f.
If u_f has been deduced for all f, take the f for which u_f is largest, and try to deduce “U(output A_f, output Y(code G, code X)) ≥ u_f”.
If that succeeds, then return (output A_f).
Otherwise, return the default Nash equilibrium.

Let’s try and parse what this algorithm does. It considers all possible Newcomblike agents (the A_f) and tests out how they fare against Y. Then it takes the best result and does a sanity check: will Y actually respond against X as it does against this particular A_f? And if that checks out, then X goes ahead and outputs what A_f does.⁰

We can sketch out some results that this decision theory gets (when equipped with a sufficiently powerful inference module). First, it correctly solves one-player games (since one of the constant strategies A_i is the optimum) and zero-sum two-player games (for the same reasons that CDT does). It one-boxes on Newcomb whenever P is a powerful enough predictor (since the best of the A_f are all one-boxers). And finally, it cooperates with itself on a simple Prisoner’s Dilemma!

Why? If X and Y are both running this decision theory, consider the function f with f(C)=C and f(D)=D, it’s clear that Y cooperates with this A_f (by the same logic that makes it one-box on Newcomb’s Problem) and that no A_f achieves a better result against Y than mutual cooperation. Thus, X will run the sanity check for this A_f, which is equivalent to trying to deduce that Y cooperates against X, while Y attempts to do the same. Here comes the delicate argument from mathematical logic; we’ll phrase things in terms of logical provability, but the analogous arguments hold for other (valid and sufficiently powerful) kinds of inference modules:

Löb’s Theorem states that, within a formal system, proving “if there’s a proof of S, then S actually holds” leads to a proof of S itself. In our current context, let S be the statement “U(output A_f, output Y(code G, code X)) ≥ u_f” that appears in X’s sanity test, and T be the corresponding statement “V(output A_f, output X(code G, code Y)) ≥ u_f” that appears in Y’s sanity-test. If there were a proof of S, then cooperation passes X’s sanity test¹, so that output X(code G, code Y))=C. But this directly implies the statement T, so cooperation passes Y’s sanity test, so that output Y(code G, code X))=C, which implies S itself. Thus, by Löb’s Theorem, there must be a proof of S, and therefore X and Y do in fact cooperate! (I do realize how confusing this process of logical bootstrapping is—it’s helpful to think of it as a useful kind of self-fulfilling prophecy.)

This algorithm is a general decision theory rather than something with ad-hoc patches, so it satisfies my criteria for a better decision theory. It also behaves like Timeless Decision Theory is supposed to on the classical problems. Is it therefore justified to call this TDT? Well, let’s again think about it on the philosophical level.

Each Newcomblike agent A_f represents a particular deal that it tries to strike with Y (if I deduce you’re doing this, then I’ll do that). X looks to see which deals Y will accept, and offers Y the most favorable (to X) of those deals, and if Y “accepts” (passes the sanity check) then X fulfills its end of the bargain. Since some deals are better than the best Nash equilibrium, this gets better results than CDT in many games, while still being un-exploitable. The novelty (compared to what traditional game theorists already know) is the fact that the communication of the deal and its acceptance all happen via prediction of each other rather than via exchange of messages, which allows for deals to happen even when the agents are unable to explicitly bargain or guarantee their fulfilment of their end by external precommitments

That’s a pretty good account of what TDT does, no? Also, I realized that I had the right description when I recalled Eliezer’s description of when TDT cooperates on the Prisoner’s Dilemma: if and only if the opponent would one-box with the TDT agent as Omega. This is exactly what this algorithm ends up doing.

It’s fair to say that Yudkowsky’s invention of Timeless Decision Theory was a triumph. In fact, I’ll make a note here...

TDT: HUGE SUCCESS

Before we move on, we should take a little while to remark on how awesome this is.

Since the Prisoner’s Dilemma was first formulated, it’s been held up as a dark and dismal portent for the prospects of cooperation among rational people. Douglas Hofstadter felt, on a philosophical level, that there must be a way to see cooperation as the rational choice, and called it superrationality; but his experiments with actual people were discouraging, and he could find no solid theoretical underpinning for that intuition, so game theorists continued to ignore it.

The insight of TDT (as well as the other decision theories we might term “reflexive”) is that better results can be achieved when you have more information than the original setup allows; in particular, if it’s common knowledge that agents have some ability to predict each others’ thought process, then mutual cooperation really can arise out of rational self-interest.

Although it’s difficult to make the right conditions obtain between individual humans (as in Hofstadter’s experiments), it may be possible for a TDT-friendly setup to exist between different organizations, or different countries, if they have means of keeping tabs on each others’ strategic planning. This may in fact have saved the world from nuclear destruction...

John von Neumann, the founder of game theory, advocated a pre-emptive nuclear strike on the Soviet Union throughout the 1950s; he was convinced that there was no other possibility of avoiding nuclear destruction ourselves. But here we have a good analogue to our setup; rather than exchanging source code, both governments had spies placed in the other, and so were likely to know what the other was thinking (albeit with some delay). This might have led to unconscious use of TDT thinking and thus spared us from nuclear war, despite the fact that the professional rationalists of the day hadn’t conceived of such means to mutual cooperation!²

But this sequence isn’t done yet: TDT may surpass CDT, but it’s still under-defined when it comes to bargaining, and there’s one notable problem in which TDT falls short of common sense...

TO BE CONTINUED (WITH CAKE) IN PART IV

Notes

0. As with CDT, there are multiple other variants to consider; this one insists on the best possible deal that Y would accept against an A_f, or else no deal at all. The downside is that two such agents will fail to reach a deal if they have different “favorite” deals, even if each would be willing to compromise against an A_f. We’ll say more about this in the context of bargaining games later.

Also, it’s very important that we have the sanity check here, since otherwise Y could defect against X and have X cooperate (as long as Y gets Newcomb’s Problem right). Without the sanity check, other agents take X’s lunch money, and rightly so!

1. Existence of a proof of S is not enough to conclude that X’s sanity test succeeds, since provability is not decidable and so the condition for failing the check doesn’t imply absence of the proof. One way out is to have a provability oracle perform the sanity check, which makes it impossible for players to be algorithms. Another is to prove some bounds on the length of the shortest proof of S, and write the algorithm in a way that guarantees the sanity check to succeed within these bounds. Yet another way is to never give up, work on the sanity check forever, which makes the agent completely unable to cope with many situations. (Thanks to Vladimir Nesov for this footnote.)

2. Thomas Schelling, among others, worked out a plausibly stable logic of mutually assured destruction, but it still invited all sorts of instability when limited conflict was involved. He also praised the idea of encouraging spies in certain departments rather than prosecuting them, in order to keep the Soviet Union from worrying that we were planning a first strike. If you’ve never read The Strategy of Conflict, I highly recommend it; it is simultaneously one of the most enlightening and most engaging books I’ve read in recent years.

Decision Theories: A Semi-Formal Analysis, Part III

Or: Formalizing Timeless Decision Theory

Not Quite Timeless Decision Theory

Problem: NQTDT won’t cooperate against itself!

Problem: NQTDT might one-box and not be rewarded!

Idea 5: Sanity Check

Idea 6: Mimic something a bit more clever

Timeless Decision Theory

TDT: HUGE SUCCESS

Notes