Sneaky Strategies for TDT

My pre­vi­ous ar­ti­cle on prob­le­matic prob­lems at­tracted quite a lot of feed­back and com­ment. One of the ques­tions it posed (2) was whether TDT should do some­thing other than the ini­tial anal­y­sis sug­gests.

I’ve had a cou­ple of ideas on that, partly in re­sponse to the com­ments. I’m post­ing some fol­low-up thoughts in the hope that they might help clar­ify the origi­nal prob­lems. Ba­si­cally, it seems there are some sneaky things that TDT could try to do, but mostly they are not ro­bust to slightly differ­ent var­i­ants of the prob­lems. In­ter­est­ingly, some of these var­i­ants look even “fairer”, since they con­tain no men­tions of TDT any­where in the prob­lem state­ment.

An al­ter­na­tive ap­proach could be that TDT re­solves to never let it­self be out-performed by any other de­ci­sion the­ory, be­cause of evolu­tion­ary con­sid­er­a­tions as dis­cussed here. Even if that re­quires a large sac­ri­fice of im­me­di­ate util­ity (e.g. two-box­ing and tak­ing $1000 along with CDT, rather than one-box­ing and tak­ing $1 mil­lion, but with CDT get­ting $1,001,000.) I don’t cur­rently know what to think about that, ex­cept that it makes my head spin; it also sounds like a rather Un­friendly form of AI.

Prob­lem 1: Omega (who ex­pe­rience has shown is always truth­ful) pre­sents the usual two boxes A and B and an­nounces the fol­low­ing. “Be­fore you en­tered the room, I ran a simu­la­tion of this prob­lem as pre­sented to an agent run­ning TDT. I won’t tell you what the agent de­cided, but I will tell you that if the agent two-boxed then I put noth­ing in Box B, whereas if the agent one-boxed then I put $1 mil­lion in Box B. Re­gard­less of how the simu­lated agent de­cided, I put $1000 in Box A. Now please choose your box or boxes.”

Con­sider the fol­low­ing sneaky strat­egy S, which TDT agents might try to adopt.

“Use ev­ery piece of ev­i­dence at your dis­posal to try to de­cide whether you are the simu­lated agent. If you can prove you are not the simu­lated agent, then 2-box; oth­er­wise 1-box”.

We could imag­ine Omega mak­ing this strat­egy very easy for agents, by re­veal­ing the full source-code of the simu­lated agent “C-sim” as well as the full source code of the ac­tual choos­ing agent “C-act”. By “full source code” here, we have to in­clude not just code de­scribing how the agent ex­e­cutes TDT, but also its ex­act com­pu­ta­tional state on en­try to the room (or simu­lated room), in­clud­ing a database of ex­ist­ing be­liefs and proven the­o­rems, prefer­ences, val­ues etc. To be fully pre­cise, it should also de­scribe any com­pu­ta­tional con­straints that the agent faces, in­clud­ing time-out af­ter t steps, or an up­per bound on mem­ory us­age m, if they ex­ist. The choos­ing agent then checks whether C-sim = C-act, and un­less there is an ex­act match, it two-boxes. (If Omega is not kind enough to re­veal C-act, a TDT agent may be able to gen­er­ate it any­way via a quine-pro­gram con­struc­tion.)

There seem to be a few difficul­ties with strat­egy S:

1. Each par­tic­u­lar C-act will still un­der-perform on a match­ing prob­lem. Though, ar­guably, it is an “un­fair” prob­lem (speci­fi­cally tai­lored by Omega to get that C-act to shoot it­self in the foot).

2. It looks unattrac­tive for differ­ent “flavours” of TDT agent to de-link them­selves like this. Aren’t we go­ing to run into prob­lems with the differ­ent flavours re­fus­ing to co-op­er­ate with each other on Pri­soner’s Dilemma, and only co-op­er­at­ing with ex­act clones? One an­swer is “We don’t know un­til we perform a for­mal anal­y­sis”. Another an­swer is “No, the differ­ent flavours of TDT are all linked at the strat­egy level, since they all fol­low strat­egy S; they are just de-linked at the ex­e­cu­tion level, since they do some­thing differ­ent based on their differ­ent epistemic cir­cum­stances.” That’s slightly more con­vinc­ing, but I still feel a need for anal­y­sis. One con­cern is if there are prob­lems where some TDT agents/​util­ities benefit from a fine-grained de-link­ing of ex­e­cu­tion and oth­ers don’t; what then is the best over­all TDT strat­egy?

3. It doesn’t solve all ver­sions of the prob­lem, since Omega doesn’t have to re­veal the ex­act C-sim code which it used. Omega could just say that it picked C-sim’s code at ran­dom from a prob­a­bil­ity dis­tri­bu­tion across all full source-codes for TDT agents. (To model this as a for­mally-defined prob­lem, we’d need to spec­ify the dis­tri­bu­tion of course.) In such a case, no TDT agent can prove it is dis­tinct from the simu­la­tion, so by strat­egy S, all TDT agents will one-box. And CDT will win against all flavours of TDT.

It is still ar­guable that the prob­lem is “un­fair” to TDT as a whole, but now sup­pose that Omega sam­ples its C-sim from a prob­a­bil­ity dis­tri­bu­tion across mul­ti­ple types of agent source-code, with TDT agents just part of the pop­u­la­tion. There is thus some prob­a­bil­ity p_t > 0 for the simu­lated agent be­ing a TDT agent. If the differ­ence in box val­ues is big enough (roughly value_A /​ value_B < p_t e.g. sup­pose that 1000/​1000000 = 1/​1000 < p_t) then a TDT agent would still max­i­mize ex­pected win­nings by 1-box­ing. This doesn’t seem par­tic­u­larly un­fair to TDT, and yet CDT would still do bet­ter.

An al­ter­na­tive strat­egy to S is what I’d in­for­mally call “TDT uber alles”. It re­lies on long-range con­se­quen­tial­ism, and per­haps “util­ity trad­ing” as well (for TDT agents that don’t in­her­ently care about long-range con­se­quences). A TDT agent might ar­gue to it­self “If TDT beats CDT—and other the­o­ries—at each and ev­ery available op­por­tu­nity, and at least matches them where it can’t beat them, then TDT will come to dom­i­nate the agent space as quickly as pos­si­ble, which will max­i­mize my ex­pected util­ity. So I’m not go­ing to let CDT beat me here: I’ll two-box”.

Prob­lem 2: Our ever-re­li­able Omega now pre­sents ten boxes, num­bered from 1 to 10, and an­nounces the fol­low­ing. “Ex­actly one of these boxes con­tains $1 mil­lion; the oth­ers con­tain noth­ing. You must take ex­actly one box to win the money; if you try to take more than one, then you won’t be al­lowed to keep any win­nings. Be­fore you en­tered the room, I ran mul­ti­ple simu­la­tions of this prob­lem as pre­sented to an agent run­ning TDT, and de­ter­mined the box which the agent was least likely to take. If there were sev­eral such boxes tied for equal-low­est prob­a­bil­ity, then I just se­lected one of them, the one la­bel­led with the small­est num­ber. I then placed $1 mil­lion in the se­lected box. Please choose your box.”

My origi­nal anal­y­sis gave TDT no more than 10% chance of win­ning the $1 mil­lion. How­ever, here’s some­thing a bit clev­erer. Sup­pose each par­tic­u­lar TDT agent has a favourite num­ber in the set {1, 2} con­structed as a bal­anced func­tion over TDT full source-codes. (One way I imag­ine do­ing this is by tak­ing a SHA-256 hash of the full source-code, ex­tract­ing the first bit, and adding 1.) Each agent chooses the box of their favourite num­ber with prob­a­bil­ity 1. Since the simu­lated agent does so as well, by the rules of the game, the $1 mil­lion must be in one of boxes 1 or 2. And un­less the full source-code of the simu­lated agent is known, it will not be pos­si­ble for the choos­ing agent to tell whether Box 1 or 2 was picked by the sim, so that if the choos­ing agent picks their own favourite box, they have a 50% chance of win­ning. And CDT won’t do any bet­ter.

Omega could pre­vent this ap­proach by the in­for­ma­tion it re­veals. If it re­veals the full source-code for C-sim (and in its simu­la­tion, pre­sents this same source-code C-sim to C-sim it­self) then TDT shouldn’t try to ex­e­cute the “favourite num­ber” strat­egy. A bet­ter strat­egy is to pick each of the ten boxes with equal prob­a­bil­ity if find­ing that C-act = C-sim; or if find­ing that C-act differs from C-sim, then pick Box 1 with cer­tainty.

Or much as for Prob­lem 1, Omega can vary the prob­lem as fol­lows:

″...Be­fore you en­tered the room, I ran mul­ti­ple simu­la­tions of this prob­lem as pre­sented to differ­ent ran­domly-se­lected TDT agents. I de­ter­mined which box they were col­lec­tively least likely to take...” (Again this needs a dis­tri­bu­tion to be speci­fied to be­come for­mally pre­cise.)

There doesn’t seem much that TDT agents can do about that, ex­cept to give a col­lec­tive groan, and ar­range that TDT col­lec­tively se­lects each of the ten boxes with equal prob­a­bil­ity. The sim­plest way to en­sure that is for each TDT agent in­di­vi­d­u­ally to se­lect the boxes with equal prob­a­bil­ity (so each in­di­vi­d­ual agent at least gets an equal chance at the prize). And any other agent just takes Box 1, laugh­ing all the way to the bank.

Con­sider a fi­nal var­i­ant as fol­lows:

″...Be­fore you en­tered the room, I ran mul­ti­ple simu­la­tions of this prob­lem as pre­sented to differ­ent agents, sam­pled uniformly from differ­ent pos­si­ble fu­ture uni­verses ac­cord­ing to their rel­a­tive num­bers, with the uni­verses them­selves sam­pled from my best pro­jec­tions of the fu­ture. I de­ter­mined the box which the agents were least likely to take...”

If TDT uber alles is the fu­ture, then al­most all the sam­pled agents will be TDT agents, so the prob­lem is es­sen­tially as be­fore. And now it doesn’t look like Omega is be­ing un­fair at all (noth­ing dis­crim­i­na­tory in the prob­lem de­scrip­tion). But TDT is still stuck, and can get beaten by CDT in the pre­sent.

One thought is that the TDT col­lec­tive should vary the box prob­a­bil­ities very very slightly, so that Omega can tell which has the low­est prob­a­bil­ity, but reg­u­lar CDT agents can’t—in that case CDT also has only max­i­mum 10% chance of win­ning. Pos­si­bly, the com­pu­ta­tion­ally-ad­vanced mem­bers of the col­lec­tive toss a log­i­cal coin (which only they and Omega can com­pute) to de­cide which box to de-weight; the less ad­vanced mem­bers—ones who ac­tu­ally have to com­pete against ri­val de­ci­sion the­o­ries—just pick at ran­dom. If CDT tries to simu­late TDT in­stances, it will de­tect equal prob­a­bil­ities, pick Box 1 and most likely get it wrong...

Edit 2: I’ve clar­ified the al­ter­na­tive to the “favourite num­ber” strat­egy if Omega re­veals C-sim in Prob­lem 2. We can ac­tu­ally get a range of differ­ent prob­lems and strate­gies by slight vari­a­tions here. See the com­ment be­low from lack­ofcheese, and my replies.