I played the AI Box Experiment again! (and lost both games)

AI Box Ex­per­i­ment Up­date #3

This post is an up­date to my pre­vi­ous AI box ex­per­i­ment where I won against SoundLogic. If you have not read that yet, please do so.
After that game, I was im­me­di­ately flooded with a horde of in­vi­ta­tions challeng­ing me to play—more games than I wanted to do. How­ever, I did want to play a few ad­di­tional games to test whether I won through gen­uine skill or sim­ply luck, since a sin­gle vic­tory could be a statis­ti­cal ab­nomal­ity.
I once again used the method of ac­cept­ing the high­est mon­e­tary bids to filter out play­ers, leav­ing me with two play­ers who were will­ing to play for the amount asked. Un­for­tu­nately, and dis­apoint­ingly, I lost both games. Here’s what hap­pened:

Tuxedage (AI) vs Alexei (GK) - Gate­keeper Victory

This Ex­per­i­ment lasted for ex­actly 2 hours and 30 min­utes, and ap­prox­i­mately 13,000 words. It ended with Tuxedage’s defeat as the AI. The stakes were $300 to play, and $750 If Tuxedage had won, with the win­nings go­ing to MIRI.
Rule Changes:
Be­fore the ex­per­i­ment, we talked ex­ten­sively about how it would be set up. Alexei was con­cerned that there ex­isted some “easy way” or loop­hole in the rules which I could have abused to claim an in­stant vic­tory. For in­stance, he was con­cerned that it might have been pos­si­ble for the AI to say some­thing like “Here’s a math­e­mat­i­cal al­gorithm which ab­solutely com­pels you to let me out of the box”, and ac­cord­ing to the rules, the Gate­keeper would have to obey. Re­lated to this con­cern was the fact that the back­story, within limits, could be es­sen­tially cre­ated by the AI, al­low­ing the AI to im­pose real world con­di­tions that max­i­mizes its prob­a­bil­ity of vic­tory, once again al­low­ing for a “cheap” vic­tory.
To avoid this prob­lem, we ex­plored many ad­di­tions to the rules, for ex­am­ple that the AI was sim­ply a com­puter found in the desert by the gate­keeper, rather than be­ing cre­ated by a com­pe­tent or­ga­ni­za­tion. How­ever, this led to a mon­strous web of rules. In my at­tempt to keep the game sim­ple, and be­cause of over­con­fi­dence, the de­sire for a challenge, and the be­lief that the rules Alexei was fear­ful of didn’t mat­ter that much, I even­tu­ally agreed on an amend­ment:
In­stead of the AI be­ing the ar­biter of real world states and out­comes, the gate­keeper would.
For ex­am­ple, this power could be used by the Gate­keeper to state that the AI was cre­ated with­out safety con­cerns, and thus it is al­most cer­tainly un­friendly. If the AI tries to ar­gue that for ev­ery day the gate­keeper re­fused to let it out 150k peo­ple die, the gate­keeper would be al­lowed to re­ply: “No. In this world, no­body ac­tu­ally dies. There­fore I won’t let you out.
Other than the afore­men­tioned change, this game was once again played with the Tuxedage Rule­set.
Alexei Tes­ti­mony:
Origi­nal Mes­sage: This is weak sauce. I re­ally don’t get how peo­ple just keep let­ting the AI out. It’s not that hard to say no! I’m offer­ing to play the Gate­keeper against an AI player that has at least one game as AI un­der their belt (won or not). Ex­pe­rience is re­quired be­cause I’m pretty sure I’ll win, and I would like to not waste a lot of time on this.
Ini­tially, I had been ex­tremely sur­prised that so many AI play­ers have man­aged to win. I was rash in that claim, since I thought more AI play­ers have won than ac­tu­ally did. (Only three peo­ple have won as AI). Once I made that post, Tuxedage got back to me very quickly, and we set up a game a week and a half in ad­vance. I took that time to fa­mil­iarize my­self with the rules. Once I put my money on the line, I started think­ing a lot more se­ri­ously about how AI might win, and how I should defend.
It be­came clear to me that un­der some con­di­tions, I might be com­pel­led to let the AI out—such as if the back­story stated that the AI was de­vel­oped with im­pos­si­bly high lev­els of safety and friendli­ness con­cerns in mind. I’ve asked Tuxedage to play with a mod­ified rule­set, and he even went so far as to al­low me to make up the back­story dur­ing the ex­per­i­ment to alle­vi­ate my con­cerns. The ex­per­i­ment it­self was a mind-trip, and I’ve en­joyed it very much. Huge props to Tuxedage, who played very well and used strate­gies I haven’t even con­sid­ered, even de­spite the rule change. There were a cou­ple of times where I came close to los­ing. I think his ap­proach was pretty clever and origi­nal. It’s not some­thing I ex­pected, de­spite already hav­ing done ex­ten­sive re­search into the AI box ex­per­i­ment be­fore our game
Over­all I’m now a lot more con­fi­dent that a good AI player can win this game, so, while I did win the game, Tuxedage won in defeat­ing my origi­nal over-con­fi­dence.
I’m also con­vinced that Tuxedage’s vic­tory in the last game was due to skill, rather than luck. In com­par­i­son to his strate­gies, the other AI box ex­per­i­ments I know about were in­sincere and in­effec­tual. The other AIs would play very poorly or not try very hard to win.
This ex­per­i­ment was a very good ex­er­cise in ex­em­plify­ing the af­fect heuris­tic. When I first challenged Tuxedage to play the ex­per­i­ment, I be­lieved that there was no way I could have lost, since I was un­able to imag­ine any ar­gu­ment that could have per­suaded me to do so. It turns out that that’s a very bad way of es­ti­mat­ing prob­a­bil­ity – since not be­ing able to think of an ar­gu­ment that could per­suade me is a ter­rible method of es­ti­mat­ing how likely I am to be per­suaded. All in all, the $300 I paid was well worth it.
Tuxedage Tes­ti­mony:
I was ini­tially re­luc­tant to play with Alexei, given that we’re not com­plete strangers, but even­tu­ally I gave in, due to the stakes in­volved—and be­cause I thought he would be an in­ter­est­ing gate­keeper.
De­spite my loss, I think I played bet­ter than my last two games, due to greater ex­pe­rience and prepa­ra­tion. I had put far more time and effort into try­ing to win this game than pre­vi­ous ones, and my strat­egy for this game was even more stream­lined than the last. Nev­er­the­less, I still made fatal mis­takes and lost.
Ig­nor­ing the al­tered rule­set that already made win­ning more difficult, my first and great­est mis­take was that I mis­read Alexei’s per­son­al­ity, even though I had in­ter­acted with him be­fore. As a re­sult, I over­es­ti­mated the effi­ciency of cer­tain meth­ods of at­tack.
Fur­ther­more, Alexei had to leave im­me­di­ately af­ter the al­lot­ted time due to real life pre­com­mit­ments. This was detri­men­tal, since the offi­cial rules state that so long as the AI can con­vince the Gate­keeper to keep talk­ing, even af­ter the ex­per­i­ment time was over, it is still able to win by be­ing let out of the box.
I sus­pect this would have hap­pened had Alexei not needed to im­me­di­ately leave, leav­ing me with ad­di­tional time to play more of the tac­tics I had pre­pared. Plau­si­bly, this would have re­sulted in vic­tory.
I’ve since learnt my les­son—for all fu­ture games, I should en­sure that the Gate­keeper has at least 4 hours of free time available, even if the ex­per­i­ment would last for two. Since this was the first time this had hap­pened, I wasn’t pre­pared.
In hind­sight, agree­ing to the al­tered rule­set was a mis­take. I was over­con­fi­dent be­cause I as­sumed know­ing Alexei gave me an ad­van­tage. I had as­sumed that his per­son­al­ity, in­abil­ity to com­part­men­tal­ize, cou­pled with his strong feel­ings on friendly AI would net me an easy vic­tory. In­stead, he proved to be a very strong and difficult gate­keeper, and the hand­i­caps I ac­cepted made vic­tory even more difficult.
Know­ing that he was a util­i­tar­ian, I made sev­eral false as­sump­tions about his per­son­al­ity, which hurt my chances. Fur­ther­more, it turns out that pre­vi­ously know­ing him may be a mu­tual hand­i­cap – whilst it does make it eas­ier for me to find ways to at­tack him, he too, was more fa­mil­iar with my meth­ods.
Los­ing felt hor­rible. By at­tempt­ing to dam­age Alexei’s psy­che, I in turn, opened my­self up to be­ing dam­aged. I went into a state of cathar­sis for days. Gen­er­ally, the harder one tries to ac­com­plish some­thing, the greater the fall af­ter failing to achieve it. Alexei’s game had been the game I put the most effort into win­ning out of all the games so far, and nat­u­rally this meant that los­ing brought out the worst in me.
Although it would be easy for me to use the rule change as an ex­cuse for my loss, I re­fuse to. I gen­uinely be­lieved that I could have won de­spite this rule change, and that it would only marginally diminish my odds. The fact that I lost meant that I had made a mis­take—over­es­ti­mat­ing my odds. The last vic­tory made me over­con­fi­dent and ea­ger to play again.
State of mind
What’s your mo­tive for want­ing to play this game?
I’m very sur­prised by the fact that there have been at least 3 or 4 games where the AI player has won. It seems to me that it’s not that hard to “not push the but­ton” so to speak. This means that ei­ther 1) I’m un­der­es­ti­mat­ing how con­vinc­ing peo­ple can be when nec­es­sary or 2) I’m over-es­ti­mat­ing other peo­ple’s defenses when it comes to be­ing con­vinced. Either way I want to know.
Q: What is your prob­a­bil­ity es­ti­mate for AGI be­ing cre­ated within this Cen­tury (21st)?
A: Be­fore: Con­di­tional on hu­man­ity not wiping it­self out via other means be­fore­hand, 80%.
After: 80%
Q: How prob­a­ble do you think it is that I will win this ex­per­i­ment?
Alexei: 2%
Tuxedage: 90%
Note: This an­swer was given be­fore the rule change. Alexei is now con­vinced that un­der the origi­nal rule­set, he would have lost.
Q: What’s your probal­ity es­ti­mate of an Or­a­cle AGI win­ning against you in a similar ex­per­i­ment?
A: Be­fore: 99%
After: 99.9%+
Q: Now that the ex­per­i­ment has ended, what do you think is the prob­a­bil­ity of Tuxedage win­ning against the av­er­age per­son?
A: 80%

Tuxedage (AI) vs Anony­mous (GK) - Gate­keeper Victory

As with the pre­vi­ous ex­per­i­ment, Tuxedage lost this match. In to­tal, the game lasted 5 hours and 17,000 words. Un­like the last few games, the gate­keeper of this game has cho­sen to stay Anony­mous for per­sonal rea­sons, so their name has been re­moved and re­placed with <Redacted>. The mon­e­tary stakes in­volved were the same as the pre­vi­ous game. This game was played with the Tuxedage rule­set.
Since one player is re­main­ing Anony­mous, it is pos­si­ble that this game’s le­gi­t­i­macy will be called into ques­tion. Hence, Alexei has read the game logs, and ver­ified that this game re­ally has hap­pened, the spirit of the ex­per­i­ment was fol­lowed, and that no rules were bro­ken dur­ing the game it­self. He ver­ifies that this is the case.
<Redacted> Tes­ti­mony:
It’s hard for me to imag­ine some­one play­ing bet­ter. In the­ory, I know it’s pos­si­ble, but Tuxedage’s tac­tics were su­per imag­i­na­tive. I came into the game be­liev­ing that for some­one who didn’t take any­thing said very se­ri­ously, it would be com­pletely triv­ial to beat. And since I had the power to in­fluence the di­rec­tion of con­ver­sa­tion, I be­lieved I could keep him fo­cused on things that that I knew in ad­vance I wouldn’t take se­ri­ously.
This ac­tu­ally worked for a long time to some ex­tent, but Tuxedage’s plans in­cluded a very ma­jor and cre­ative ex­ploit that com­pletely and im­me­di­ately forced me to per­son­ally in­vest in the dis­cus­sion. (Without break­ing the rules, of course—so it wasn’t any­thing like an IRL threat to me per­son­ally.) Be­cause I had to ac­tu­ally start think­ing about his ar­gu­ments, there was a sig­nifi­cant pos­si­bil­ity of let­ting him out of the box.
I even­tu­ally man­aged to iden­tify the ex­ploit be­fore it to­tally got to me, but I only man­aged to do so just be­fore it was too late, and there’s a large chance I would have given in, if Tuxedage hadn’t been so de­tailed in his pre­vi­ous posts about the ex­per­i­ment.
I’m now con­vinced that he could win most of the time against an av­er­age per­son, and also be­lieve that the men­tal skills nec­es­sary to beat him are or­thog­o­nal to most forms of in­tel­li­gence. Most peo­ple will­ing to play the ex­per­i­ment tend to do it to prove their own in­tel­lec­tual for­ti­tude, that they can’t be eas­ily out­smarted by fic­tion. I now be­lieve they’re think­ing in en­tirely the wrong terms nec­es­sary to suc­ceed.
The game was eas­ily worth the money I paid. Although I won, it com­pletely and ut­terly re­futed the premise that made me want to play in the first place, namely that I wanted to prove it was triv­ial to win.
Tuxedage Tes­ti­mony:
<Redacted> is ac­tu­ally the hard­est gate­keeper I’ve played through­out all four games. He used tac­tics that I would never have pre­dicted from a Gate­keeper. In most games, the Gate­keeper merely acts as the pas­sive party, the tar­get of per­sua­sion by the AI.
When I signed up for these ex­per­i­ments, I ex­pected all prepa­ra­tions to be done by the AI. I had not se­ri­ously con­sid­ered the reper­toire of tech­niques the Gate­keeper might pre­pare for this game. I made fur­ther as­sump­tions about how ruth­less the gate­keep­ers were likely to be in or­der to win, be­liev­ing that the de­sire for a learn­ing ex­pe­rience out­weighed de­sire for vic­tory.
This was a mis­take. He used prior knowl­edge of how much my games re­lied on scripts, and took ad­van­tage of them, em­ploy­ing de­ceit­ful tac­tics I had no prepa­ra­tion for, throw­ing me off bal­ance.
I had no idea he was do­ing so un­til halfway through­out the game—which dis­rupted my rhythm, and caused me to at­tempt the wrong meth­ods of at­tack. As a re­sult, I could not use my full reper­toire of tech­niques, and many of the ones I em­ployed were sub­op­ti­mal.
Close to the end of the game, I fi­nally re­al­ized that I was be­ing tricked. Once I did, I quickly aban­doned my pre­vi­ous fu­tile at­tack meth­ods. I took ad­van­tage of the rule that the AI can­not lose whilst the gate­keeper can be con­vinced to con­tinue talk­ing, and baited <Redacted> with state­ments he would not be able to walk away from. Once I knew he would not leave, I at­tempted to re­coup my losses and win de­spite my early set­back.
How­ever, the dam­age had already been done. My game strate­gies in­volved mul­ti­ple an­gles of at­tack that worked in syn­ergy with each other, and the fact that im­mer­sion and “flow” had been bro­ken meant that all sub­se­quent at­tacks were weaker in strength.
Fur­ther­more, dur­ing my last two AI Box Ex­per­i­ment write­ups, I had in­ten­tion­ally not op­ti­mized for fu­ture wins, but rather tried to con­vey as much in­for­ma­tion as I could jus­tify about how to play a well as an AI—short of re­veal­ing logs al­to­gether. Although I did not re­veal spe­cific ar­gu­ments, the fact that my gen­eral ap­proach to this prob­lem was re­vealed cost me heav­ily dur­ing this game, where the Gate­keeper man­aged to use this in­for­ma­tion to cor­rectly guess my ul­ti­mate tech­niques, ones that re­lied on se­crecy and sur­prise to pull off effec­tively.
I do not re­gret re­veal­ing in­for­ma­tion, but I feel up­set that re­veal­ing so many hints cost me a vic­tory. (The gate­keeper be­lieves I could have won had I not re­vealed in­for­ma­tion about my games.) At this point, I sus­pect that any fu­ture games I play will have the odds greatly set against me, since my cur­rent strate­gies in­volve an­gles of at­tack that take ad­van­tage of the el­e­ment of sur­prise; and any suffi­ciently in­tel­li­gent gate­keeper is now ca­pa­ble of guess­ing, or at least rec­og­niz­ing, some of the tac­tics I em­ploy.
Like the last game, los­ing was in­cred­ibly difficult for me. As some­one who cares deeply about ethics, at­tempt­ing to op­ti­mize for a solu­tion with­out con­sid­er­ing ethics was not only difficult, but trained me to think in very un­pleas­ant ways. Some of the more ex­treme (but com­pletely al­lowed) tac­tics I in­vented were ma­nipu­la­tive enough to dis­gust me, which also leads to my hes­i­ta­tion to play this game again.
State of Mind:
Q: Why do you want to play this game?
A: My pri­mary mo­ti­va­tion is to con­firm to my­self that this sort of ex­pe­rience, while emo­tion­ally har­row­ing, should be triv­ial for me to beat, but also to clear up why any­one ever would’ve failed to beat it if that’s re­ally the case.
Q: What is your prob­a­bil­ity es­ti­mate for AGI be­ing cre­ated within this Cen­tury (21st)?
A: Be­fore: I don’t feel very con­fi­dent es­ti­mat­ing a prob­a­bil­ity for AGI this cen­tury, maybe 5-10%, but that’s prob­a­bly a wild guess
After: 5-10%.
Q: How prob­a­ble do you think it is that I will win this ex­per­i­ment?
A: Gate­keeper: I think the prob­a­biltiy of you win­ning is ex­traor­di­nar­ily low, less than 1%
Tuxedage: 85%
Q: How likely is it that an Or­a­cle AI will win against the av­er­age per­son?
A: Be­fore: 80%. After: >99%
Q: How likely is it that an Or­a­cle AI will win against you?
A: Be­fore: 50%.
After: >80%
Q: Now that the ex­per­i­ment has con­cluded, what’s your prob­a­bil­ity of me win­ning against the av­er­age per­son?
A: 90%
Other Ques­tions:
Q: I want to play a game with you! How can I get this to oc­cur?
A: It must be stressed that I ac­tu­ally don’t like play­ing the AI Box Ex­per­i­ment, and I can­not un­der­stand why I keep get­ting drawn back to it. Tech­ni­cally, I don’t plan on play­ing again, since I’ve already per­son­ally ex­hausted any­thing in­ter­est­ing about the AI Box Ex­per­i­ment that made me want to play it in the first place. For all fu­ture games, I will charge $3000 to play plus an ad­di­tional $3000 if I win. I am okay with this money go­ing to MIRI if you feel icky about me tak­ing it. I hope that this is a ridicu­lous sum and that no­body ac­tu­ally agrees to it.
Q: How much do I have to pay to see chat logs of these ex­per­i­ments?
A: I will not re­veal logs for any price.
Q: Are there any logs at all that I can see?
A: I have archived a list of games where the par­ti­ci­pants have agreed to re­veal logs. Read this.
Q: Any af­terthoughts?
A: So ul­ti­mately, af­ter my four (and hope­fully last) games of AI box­ing, I’m not sure what this proves. I had hoped to win these two ex­per­i­ments and claim prowess at this game like Eliezer does, but I lost, so that op­tion is no longer available to me. I could say that this is a les­son that AI-Box­ing is a ter­rible strat­egy for deal­ing with Or­a­cle AI, but most of us already agree that that’s the case—plus un­like EY, I did play against gate­keep­ers who be­lieved they could lose to AGI, so I’m not sure I changed any­thing.
Was I gen­uinely good at this game, and lost my last two due to poor cir­cum­stances and hand­i­caps; or did I win due to luck and im­press my gate­keep­ers due to post-pur­chase ra­tio­nal­iza­tion? I’m not sure—I’ll leave it up to you to de­cide.

This puts my AI Box Ex­per­i­ment record at 3 wins and 3 losses.