Sneaky Strategies for TDT

My previous article on problematic problems attracted quite a lot of feedback and comment. One of the questions it posed (2) was whether TDT should do something other than the initial analysis suggests.

I’ve had a couple of ideas on that, partly in response to the comments. I’m posting some follow-up thoughts in the hope that they might help clarify the original problems. Basically, it seems there are some sneaky things that TDT could try to do, but mostly they are not robust to slightly different variants of the problems. Interestingly, some of these variants look even “fairer”, since they contain no mentions of TDT anywhere in the problem statement.

An alternative approach could be that TDT resolves to never let itself be out-performed by any other decision theory, because of evolutionary considerations as discussed here. Even if that requires a large sacrifice of immediate utility (e.g. two-boxing and taking $1000 along with CDT, rather than one-boxing and taking $1 million, but with CDT getting $1,001,000.) I don’t currently know what to think about that, except that it makes my head spin; it also sounds like a rather Unfriendly form of AI.

Problem 1: Omega (who experience has shown is always truthful) presents the usual two boxes A and B and announces the following. “Before you entered the room, I ran a simulation of this problem as presented to an agent running TDT. I won’t tell you what the agent decided, but I will tell you that if the agent two-boxed then I put nothing in Box B, whereas if the agent one-boxed then I put $1 million in Box B. Regardless of how the simulated agent decided, I put $1000 in Box A. Now please choose your box or boxes.”

Consider the following sneaky strategy S, which TDT agents might try to adopt.

“Use every piece of evidence at your disposal to try to decide whether you are the simulated agent. If you can prove you are not the simulated agent, then 2-box; otherwise 1-box”.

We could imagine Omega making this strategy very easy for agents, by revealing the full source-code of the simulated agent “C-sim” as well as the full source code of the actual choosing agent “C-act”. By “full source code” here, we have to include not just code describing how the agent executes TDT, but also its exact computational state on entry to the room (or simulated room), including a database of existing beliefs and proven theorems, preferences, values etc. To be fully precise, it should also describe any computational constraints that the agent faces, including time-out after t steps, or an upper bound on memory usage m, if they exist. The choosing agent then checks whether C-sim = C-act, and unless there is an exact match, it two-boxes. (If Omega is not kind enough to reveal C-act, a TDT agent may be able to generate it anyway via a quine-program construction.)

There seem to be a few difficulties with strategy S:

1. Each particular C-act will still under-perform on a matching problem. Though, arguably, it is an “unfair” problem (specifically tailored by Omega to get that C-act to shoot itself in the foot).

2. It looks unattractive for different “flavours” of TDT agent to de-link themselves like this. Aren’t we going to run into problems with the different flavours refusing to co-operate with each other on Prisoner’s Dilemma, and only co-operating with exact clones? One answer is “We don’t know until we perform a formal analysis”. Another answer is “No, the different flavours of TDT are all linked at the strategy level, since they all follow strategy S; they are just de-linked at the execution level, since they do something different based on their different epistemic circumstances.” That’s slightly more convincing, but I still feel a need for analysis. One concern is if there are problems where some TDT agents/utilities benefit from a fine-grained de-linking of execution and others don’t; what then is the best overall TDT strategy?

3. It doesn’t solve all versions of the problem, since Omega doesn’t have to reveal the exact C-sim code which it used. Omega could just say that it picked C-sim’s code at random from a probability distribution across all full source-codes for TDT agents. (To model this as a formally-defined problem, we’d need to specify the distribution of course.) In such a case, no TDT agent can prove it is distinct from the simulation, so by strategy S, all TDT agents will one-box. And CDT will win against all flavours of TDT.

It is still arguable that the problem is “unfair” to TDT as a whole, but now suppose that Omega samples its C-sim from a probability distribution across multiple types of agent source-code, with TDT agents just part of the population. There is thus some probability p_t > 0 for the simulated agent being a TDT agent. If the difference in box values is big enough (roughly value_A / value_B < p_t e.g. suppose that 1000/1000000 = 1/1000 < p_t) then a TDT agent would still maximize expected winnings by 1-boxing. This doesn’t seem particularly unfair to TDT, and yet CDT would still do better.

An alternative strategy to S is what I’d informally call “TDT uber alles”. It relies on long-range consequentialism, and perhaps “utility trading” as well (for TDT agents that don’t inherently care about long-range consequences). A TDT agent might argue to itself “If TDT beats CDT—and other theories—at each and every available opportunity, and at least matches them where it can’t beat them, then TDT will come to dominate the agent space as quickly as possible, which will maximize my expected utility. So I’m not going to let CDT beat me here: I’ll two-box”.

Problem 2: Our ever-reliable Omega now presents ten boxes, numbered from 1 to 10, and announces the following. “Exactly one of these boxes contains $1 million; the others contain nothing. You must take exactly one box to win the money; if you try to take more than one, then you won’t be allowed to keep any winnings. Before you entered the room, I ran multiple simulations of this problem as presented to an agent running TDT, and determined the box which the agent was least likely to take. If there were several such boxes tied for equal-lowest probability, then I just selected one of them, the one labelled with the smallest number. I then placed $1 million in the selected box. Please choose your box.”

My original analysis gave TDT no more than 10% chance of winning the $1 million. However, here’s something a bit cleverer. Suppose each particular TDT agent has a favourite number in the set {1, 2} constructed as a balanced function over TDT full source-codes. (One way I imagine doing this is by taking a SHA-256 hash of the full source-code, extracting the first bit, and adding 1.) Each agent chooses the box of their favourite number with probability 1. Since the simulated agent does so as well, by the rules of the game, the $1 million must be in one of boxes 1 or 2. And unless the full source-code of the simulated agent is known, it will not be possible for the choosing agent to tell whether Box 1 or 2 was picked by the sim, so that if the choosing agent picks their own favourite box, they have a 50% chance of winning. And CDT won’t do any better.

Omega could prevent this approach by the information it reveals. If it reveals the full source-code for C-sim (and in its simulation, presents this same source-code C-sim to C-sim itself) then TDT shouldn’t try to execute the “favourite number” strategy. A better strategy is to pick each of the ten boxes with equal probability if finding that C-act = C-sim; or if finding that C-act differs from C-sim, then pick Box 1 with certainty.

Or much as for Problem 1, Omega can vary the problem as follows:

″...Before you entered the room, I ran multiple simulations of this problem as presented to different randomly-selected TDT agents. I determined which box they were collectively least likely to take...” (Again this needs a distribution to be specified to become formally precise.)

There doesn’t seem much that TDT agents can do about that, except to give a collective groan, and arrange that TDT collectively selects each of the ten boxes with equal probability. The simplest way to ensure that is for each TDT agent individually to select the boxes with equal probability (so each individual agent at least gets an equal chance at the prize). And any other agent just takes Box 1, laughing all the way to the bank.

Consider a final variant as follows:

″...Before you entered the room, I ran multiple simulations of this problem as presented to different agents, sampled uniformly from different possible future universes according to their relative numbers, with the universes themselves sampled from my best projections of the future. I determined the box which the agents were least likely to take...”

If TDT uber alles is the future, then almost all the sampled agents will be TDT agents, so the problem is essentially as before. And now it doesn’t look like Omega is being unfair at all (nothing discriminatory in the problem description). But TDT is still stuck, and can get beaten by CDT in the present.

One thought is that the TDT collective should vary the box probabilities very very slightly, so that Omega can tell which has the lowest probability, but regular CDT agents can’t—in that case CDT also has only maximum 10% chance of winning. Possibly, the computationally-advanced members of the collective toss a logical coin (which only they and Omega can compute) to decide which box to de-weight; the less advanced members—ones who actually have to compete against rival decision theories—just pick at random. If CDT tries to simulate TDT instances, it will detect equal probabilities, pick Box 1 and most likely get it wrong...

Edit 2: I’ve clarified the alternative to the “favourite number” strategy if Omega reveals C-sim in Problem 2. We can actually get a range of different problems and strategies by slight variations here. See the comment below from lackofcheese, and my replies.