Decision Theories, Part 3.5: Halt, Melt and Catch Fire

Fol­lowup to: De­ci­sion The­o­ries: A Semi-For­mal Anal­y­sis, Part III

UPDATE: As it turns out, ru­mors of Mas­quer­ade’s demise seem to have been greatly ex­ag­ger­ated. See this post for de­tails and proofs!

I had the chance, over the sum­mer, to dis­cuss the de­ci­sion the­ory out­lined in my April post with a bunch of rele­vantly awe­some peo­ple. The sad part is, there turned out to be a fatal flaw once we tried to for­mal­ize it prop­erly. I’m lay­ing it out here, not with much hope that there’s a fix, but be­cause some­times false starts can be pro­duc­tive for oth­ers.

Since it’s not ap­pro­pri­ate to call this de­ci­sion the­ory TDT, I’m go­ing to use a name sug­gested in one of these ses­sions and call it “Mas­quer­ade”, which might be an in­tu­ition pump for how it op­er­ates. So let’s first define some sim­ple agents called “masks”, and then define the “Mas­quer­ade” agent.

Say that our agent has ac­tions a1, … , an, and the agent it’s fac­ing in this round has ac­tions b1, … , bm. Then for any triple (bi, aj, ak), we can define a sim­ple agent Mask­ijk which takes in its op­po­nent’s source code and out­puts an ac­tion:

def Mask_ijk(opp_src):
look for proof that Opp(Mask_ijk) = bi
if one is found, then out­put aj
oth­er­wise, out­put ak

(This is slightly less gen­eral than what I out­lined in my post, but it’ll do for our pur­poses. Note that there’s no need for aj and ak to be dis­tinct, so con­stant strate­gies fall un­der this um­brella as well.)

A key ex­am­ple of such an agent is what we might call FairBot: on a Pri­soner’s Dilemma, FairBot tries to prove that the other agent co­op­er­ates against FairBot, and if it finds such a proof, then it im­me­di­ately co­op­er­ates. If FairBot fails to find such a proof, then it defects. (An im­por­tant point is that if FairBot plays against it­self and both have suffi­ciently strong de­duc­tive ca­pac­i­ties, then a short proof of one’s co­op­er­a­tion gives a slightly longer proof of the other’s co­op­er­a­tion, and thus in the right cir­cum­stances we have mu­tual co­op­er­a­tion via Löb’s The­o­rem.)

The agent Mas­quer­ade tries to do bet­ter than any in­di­vi­d­ual mask (note that FairBot fool­ishly co­op­er­ates against Co­op­er­ateBot when it could triv­ially do bet­ter by defect­ing). My origi­nal for­mu­la­tion can be qual­i­ta­tively de­scribed as try­ing on differ­ent masks, see­ing which one fares the best, and then run­ning a “san­ity check” to see if the other agent treats Mas­quer­ade the same way it treats that mask. The pseu­docode looked like this:

def Mas­quer­ade(opp_src):
for each (i,j,k), look for proofs of the form “Mask_ijk gets util­ity u against Opp”
choose (i,j,k) cor­re­spond­ing to the largest such u found
look for proof that Opp(Mas­quer­ade) = Opp(Mask_ijk)
if one is found, then out­put the same thing as Mask_ijk(Opp)
oth­er­wise, out­put a de­fault ac­tion

(The de­fault should be some­thing safe like a Nash equil­ibrium strat­egy, of course.)

In­tu­itively, when Mas­quer­ade plays the Pri­soner’s Dilemma against FairBot, Mas­quer­ade finds that the best util­ity against FairBot is achieved by some mask that co­op­er­ates, and then Mas­quer­ade’s san­ity-check is try­ing to prove that FairBot(Mas­quer­ade) = C as FairBot is try­ing to prove that Mas­quer­ade(FairBot) = C, and the whole Löbian cir­cus goes round again. Fur­ther­more, it’s in­tu­itive that when Mas­quer­ade plays against an­other Mas­quer­ade, the first one no­tices the proof of the above, and finds that the best util­ity against the other Mas­quer­ade is achieved by FairBot; thus both pass to the san­ity-check stage try­ing to imi­tate FairBot, both seek to prove that the other co­op­er­ate against them­selves, and both find the Löbian proof.

So what’s wrong with this in­tu­itive rea­son­ing?

Prob­lem: A de­duc­tive sys­tem can’t count on its own con­sis­tency!

Let’s re-ex­am­ine the ar­gu­ment that Mas­quer­ade co­op­er­ates with FairBot. In or­der to set up the Löbian cir­cle, FairBot needs to be able to prove that Mas­quer­ade se­lects a mask that co­op­er­ates with FairBot (like Co­op­er­ateBot or FairBot). There are nice proofs that each of those masks at­tains the mu­tual-co­op­er­a­tion pay­off against FairBot, but we also need to be sure that some other mask won’t get the very high­est (I defect, you co­op­er­ate) pay­off against FairBot. Now you and I can see that this must be true, be­cause FairBot sim­ply can’t be ex­ploited that way. But cru­cially, FairBot can’t de­duce its own in­ex­ploita­bil­ity with­out thereby be­com­ing ex­ploitable (for the same Gödelian rea­son that a for­mal sys­tem can’t prove its own con­sis­tency un­less it is ac­tu­ally in­con­sis­tent)!

Now, the caveats to this are im­por­tant: if FairBot’s de­duc­tive pro­cess is suffi­ciently stronger than the de­duc­tive pro­cess that’s try­ing to ex­ploit it (for ex­am­ple, FairBot might have an or­a­cle that can an­swer ques­tions about Mas­quer­ade’s or­a­cle, or FairBot might look for proofs up to length 2N while Mas­quer­ade only looks up to length N), then it can prove (by ex­haus­tion if noth­ing else) that Mas­quer­ade will se­lect a co­op­er­a­tive mask af­ter all. But since Mas­quer­ade needs to rea­son about Mas­quer­ade at this level, this ap­proach goes nowhere. (At first, I thought that hav­ing a weaker or­a­cle for Mas­quer­ade’s search through masks, and a stronger or­a­cle both for each mask and for Mas­quer­ade’s san­ity-check, would solve this. But that doesn’t get off the ground: the agent thus defined at­tains mu­tual co­op­er­a­tion with FairBot, but not with it­self, be­cause the weaker or­a­cle can’t prove that it at­tains mu­tual co­op­er­a­tion with FairBot.)

Another caveat is the fol­low­ing: FairBot may not be able to rule out the prov­abil­ity of some state­ment we know is false, but (given a large enough de­duc­tive ca­pac­ity) it can prove that a cer­tain re­sult is the first of its kind in a given or­der­ing of proofs. So if our agents act im­me­di­ately on the first proof they find, then we could make a ver­sion of Mas­quer­ade work… as long as each search does find a proof, and as long as that fact is prov­able by the same de­duc­tion sys­tem. But there’s an is­sue with this: two masks paired against each other won’t nec­es­sar­ily have prov­able out­comes!

Let’s con­sider the fol­low­ing mask agent, which we’ll call An­tiFairBot: it searches for a proof that its op­po­nent co­op­er­ates against it, and it defects if it finds one; if it doesn’t find such a proof, then it co­op­er­ates. This may not be a very op­ti­mal agent, but it has one in­ter­est­ing prop­erty: if you pit An­tiFairBot against FairBot, and the two of them use equiv­a­lent or­a­cles, then it takes an or­a­cle stronger than ei­ther to de­duce what the two of them will do! Thus, Mas­quer­ade can’t be sure that An­tiFairBot won’t get the high­est pay­off against FairBot (which of course it won’t) un­less it uses a stronger de­duc­tion sys­tem for the search through masks than FairBot uses for its proof search (which would mean that FairBot won’t be able to tell what mask Mas­quer­ade picks).

I tried to fix this by iter­at­ing over only some of the masks; af­ter all, there’s no re­al­is­tic op­po­nent against whom An­tiFairBot is su­pe­rior to both FairBot and Defec­tBot. Un­for­tu­nately, at this point I re­al­ized two things: in or­der to play suc­cess­fully against a rea­son­able range of op­po­nents on the Pri­soner’s Dilemma, Mas­quer­ade needs to be able to imi­tate at least both FairBot and Defec­tBot; and FairBot can­not prove that FairBot defects against Defec­tBot. (There are var­i­ants of FairBot that can do so, e.g. it could search both for proofs of co­op­er­a­tion and proofs of defec­tion and play­ing sym­met­ri­cally if it finds one, but this var­i­ant is no longer guaran­teed to co­op­er­ate against it­self!)

If there are any prob­lems with this rea­son­ing, or an ob­vi­ous fix that I’ve missed, please bring it to my at­ten­tion; but oth­er­wise, I’ve de­cided that my ap­proach has failed dras­ti­cally enough that it’s time to do what Eliezer calls “halt, melt, and catch fire”. The fact that Löbian co­op­er­a­tion works is enough to keep me op­ti­mistic about for­mal­iz­ing this side of de­ci­sion the­ory in gen­eral, but the ideas I was us­ing seem in­suffi­cient to suc­ceed. (Some var­i­ant of “play­ing chicken with my de­duc­tive sys­tem” might be a cru­cial com­po­nent.)

Many thanks to all of the ex­cel­lent peo­ple who gave their time and at­ten­tion to this idea, both on and offline, es­pe­cially Eliezer, Vladimir Slep­nev, Nisan, Paul Chris­ti­ano, Critch, Alex Al­tair, Misha Barasz, and Vladimir Nesov. Spe­cial ku­dos to Vladimir Slep­nev, whose gut in­tu­ition on the prob­lem with this idea was im­me­di­ate and cor­rect.