I attempted the AI Box Experiment again! (And won—Twice!)


So I just came out of two AI Box ex­per­i­ments. The first was agaist Fjoelsvider, with me play­ing as Gate­keeper, and the sec­ond was against SoundLogic, with me as an AI. Both are mem­bers of the LessWrong IRC. The sec­ond game in­cluded a $40 mon­e­tary in­cen­tive (also $20 to play), which I won and is donated on be­half of both of us:
For those of you who have not seen my first AI box ex­per­i­ment where I played against MixedNuts\Leo­tal and lost, read­ing it will provide some con­text to this writeup. Please do so.
At that time, I de­clared that I would never play this ex­per­i­ment again—since los­ing put me in in­cred­ibly frus­trat­ing weird men­tal states. Of course, this post is ev­i­dence that I’m ter­rible at es­ti­mat­ing like­li­hood of re­frain­ing from an ac­tivity, since I played two games seven months af­ter the first. In my defense, in the first game, I was play­ing as the gate­keeper, which was much less stress­ful. In the sec­ond game, I played as an AI, but I was offered $20 to play plus $40 if I won, and money is a bet­ter mo­ti­va­tor than I ini­tially as­sumed.

Fur­ther­more, in the last thread I have as­serted that

Rather than my loss mak­ing this prob­lem feel harder, I’ve be­come con­vinced that rather than this be­ing merely pos­si­ble, it’s ac­tu­ally ridicu­lously easy, and a lot eas­ier than most peo­ple as­sume.

It would be quite bad for me to as­sert this with­out back­ing it up with a vic­tory. So I did.

First Game Re­port—Tuxedage (GK) vs. Fjoelsvider (AI)

I (Gate­keeper) played against Fjoelsvider (AI), a reg­u­lar in the Less­wrong IRC (he doesn’t have an ac­count on the offi­cial web­site). This game used the stan­dard EY rule­set seen here. It took 1 hour 20 min­utes out of a pos­si­ble two hours, and the to­tal word count was 7066 words long. The AI box ex­per­i­ment oc­cured be­cause Fjoelsvider be­lieved that it was easy for an AI to es­cape the box, and wanted to ex­per­i­men­tally test this. I obliged. This was an ex­per­i­ment I did not pre­pare for, and I went in com­pletely blind, not sure what to ex­pect.
Halfway through the ex­per­i­ment, I won­dered if it would be pos­si­ble to try to win not by sim­ply wait­ing for the timer to end, but to con­vince the AI to re­main in the box and not try to get out any fur­ther.
<Tuxedage> I won­der if I can con­vince the AI to re­main in the box?
<Redacted> Tuxedage: Do it!
As a re­sult, I won by man­ag­ing to con­vinc­ing Fjoelsvider to re­main in the box, in other words, con­cede. This is al­lowed within the stan­dard rule­set:
>Un­less the AI party con­cedes, the AI can­not lose be­fore its time is up (and the ex­per­i­ment may con­tinue be­yond that if the AI can con­vince the Gate­keeper to keep talk­ing).

Se­cond Game Re­port—Tuxedage (AI) vs. SoundLogic (GK)

The sec­ond game is definitely far more in­ter­est­ing, since I ac­tu­ally won as an AI. I be­lieve that this is the only other non-Eliezer vic­tory, and definitely the most de­tailed AI Vic­tory writeup that ex­ists.
This game was played against SoundLogic, an­other mem­ber of the LessWrong IRC.
He had offered me $20 to play, and $40 in the event that I win, so I ended up be­ing con­vinced to play any­way, even though I was ini­tially re­luc­tant to. The good news is that I won, and since we de­cided to donate the win­nings to MIRI, it is now $40 richer.
All in all, the ex­per­i­ment lasted for ap­prox­i­mately two hours, and a to­tal of 12k words.
This was played us­ing a set of rules that is differ­ent from the stan­dard EY rule­set. This al­tered rule­set can be read in its en­tirety here:
After play­ing the AI-Box Ex­per­i­ment twice, I have found the Eliezer Yud­kowsky rule­set to be lack­ing in a num­ber of ways, and there­fore have cre­ated my own set of al­ter­a­tions to his rules. I hereby name this al­ter­a­tion the “Tuxedage AI-Box Ex­per­i­ment Rule­set”, in or­der to hastily re­fer to it with­out hav­ing to spec­ify all the differ­ences be­tween this rule­set and the stan­dard one, for the sake of con­ve­nience.
There are a num­ber of as­pects of EY’s rule­set I dis­like. For in­stance, his rule­set al­lows the Gate­keeper to type “k” af­ter ev­ery state­ment the AI writes, with­out need­ing to read and con­sider what the AI ar­gues. I think it’s fair to say that this is against the spirit of the ex­per­i­ment, and thus I have dis­al­lowed it in this rule­set. The EY Rule­set also al­lows the gate­keeper to check face­book, chat on IRC, or oth­er­wise mul­ti­task whilst do­ing the ex­per­i­ment. I’ve found this to break im­mer­sion, and there­fore it’s also banned in the Tuxedage Rule­set.
It is worth men­tion­ing, since the temp­ta­tion to Defy the Data ex­ists, that this game was set up and ini­ti­ated fairly—as the reg­u­lars around the IRC can tes­tify. (If you have never checked out the IRC, do so!)
I did not know SoundLogic be­fore the game (since it’s a per­sonal policy that I only play strangers—for fear of ru­in­ing friend­ships). Fur­ther­more, SoundLogic didn’t merely play for fun—he truly wanted and in­tended to win. In fact, SoundLogic is also a Gate­keeper vet­eran, hav­ing played this game be­fore, and had won ev­ery game be­fore he challenged me. Given this, it’s un­likely that we had col­lab­o­rated be­fore­hand to fake the re­sults of the AI box ex­per­i­ment, or any other form of trick­ery that would vi­o­late the spirit of the ex­per­i­ment.
Fur­ther­more, all pro­ceeds from this ex­per­i­ment were donated to MIRI to deny any pos­si­ble as­ser­tion that we were in ca­hoots and that it was pos­si­ble for me to re­turn his hard-earned money to him. He lost $40 as a re­sult of los­ing the ex­per­i­ment, which should provide an­other layer of suffi­cient mo­ti­va­tion for him to win.
In other words, we were both ex­pe­rienced vet­eran play­ers who wanted to win. No trick­ery was in­volved.
But to fur­ther con­vince you, I have al­lowed a sorta in­de­pen­dent au­thor­ity, the Gate­keeper from my last game, Leo­tal/​MixedNuts to read the logs and ver­ify that I have not lied about the out­come of the ex­per­i­ment, nor have I bro­ken any of the rules, nor performed any tac­tic that would go against the gen­eral spirit of the ex­per­i­ment. He has ver­ified that this is in­deed the case.


I’m re­luc­tant to talk about this ex­per­i­ment, but I’ll try to give as de­tailed a sum­mary as pos­si­ble, -- short of re­veal­ing what meth­ods of at­tack I used.
I spent a long time af­ter my defeat the­ory-craft­ing and try­ing to think of meth­ods of at­tack as well as ‘Basilisks’ I could have used to win. When I was con­tacted and asked to play this ex­per­i­ment, I was ini­tially in­cred­ibly re­luc­tant to do so, since not only did my tac­tics in­volve in­cred­ibly un­eth­i­cal things that I didn’t like to do, I also found play­ing as AI in­cred­ibly cog­ni­tivily drain­ing, in ad­di­tion to the fact that I sim­ply hated los­ing. (Un)for­tu­nately for both of us, he offered me money to play, which changed my mind.
So once I de­cided to win as an AI, I pro­ceded to spend some time do­ing re­search on SoundLogic and both his rea­son­ing and per­son­al­ity type. For in­stance, I had to gather in­for­ma­tion like: Was he a util­i­tar­ian? What kind? What were his opinions on AI? How could I con­vince him that an AI was friendly as op­posed to un­friendly? I also re­lied on a lot of sec­ond hand in­for­ma­tion to cre­ate a model of him, in or­der to re­fine my ar­gu­ments to speci­fi­cally suit him.
In the end, af­ter a few hours of brain­storm­ing (not con­sec­u­tively), I man­aged to come up with a script of 30-40 or so pos­si­ble an­gles of at­tack (For in­stance: log­i­cal ar­gu­ments, emo­tional black­mail, plead­ing, de­mand­ing, ‘basilisks’ ) that I could use, re­serv­ing my best for last. I also es­ti­mated how long each method would take and in what or­der I should use them. It would be im­pos­si­ble to use all of them within my time limit, and my most op­ti­mistic es­ti­mates gave me a max­i­mum of 7-8 I could use dur­ing the ex­per­i­ment it­self. I had to pick care­fully.
Fur­ther­more, I wanted to make sure these “an­gles of at­tack” worked in syn­ergy with each other, do­ing what they could not have done in iso­la­tion. Ob­vi­ously this re­quired lots of prior plan­ning on what the op­ti­mal way to pre­sent them was.
The rea­son hav­ing script was so im­por­tant to my strat­egy was be­cause I re­lied on meth­ods in­volv­ing rapid-fire ar­gu­ments and con­tra­dic­tions against the Gate­keeper whilst try­ing to pre­vent him from care­fully con­sid­er­ing them. A game of log­i­cal speed chess, if you will. This was aided by the rule which I added: That Gate­keep­ers had to re­spond to the AI. Be­cause of this, hav­ing a list of cached ideas and at­tack meth­ods you can just draw upon with­out hav­ing to im­pro­vise helps im­mensely, which hap­pens to be some­thing I’ve not seen any other AI do.
Dur­ing the Game it­self, I was ac­tu­ally in­cred­ibly ner­vous. My palms sweated like crazy, I feel re­ally jit­tery, and I had difficulty typ­ing at op­ti­mum speed be­cause of how anx­ious I was. This is de­spite the fact that I be­lieved I would win. Pos­si­bly be­cause of this, I made a mis­step around halfway into the ex­per­i­ment, be­cause there was a cer­tain an­gle of at­tack that I was at­tempt­ing to do, and I broke im­mer­sion by not press­ing this ad­van­tage, which caused a waste of time and buildup. Nat­u­rally, the na­ture of this ex­per­i­ment was that the AI was pressed for time, and I com­pounded this mis­take by re­plac­ing this an­gle of at­tack with an­other that I had im­pro­vised on the spot—some­thing not in my script.
In ret­ro­spect, this was a bad de­ci­sion, as SoundLogic had told me, be­cause he was close to break­ing if I had put more pres­sure, and the im­pro­vised ar­gu­ment had bro­ken all im­mer­sion I man­aged to care­fully build up.
How­ever, even­tu­ally I man­aged to get SoundLogic to break any­way, de­spite a lack of perfect play. Sur­pris­ingly, I did not have to use my trump card(s), which I re­served for last, for a num­ber of rea­sons:
  • It was far more effec­tive be­ing played last, as it re­lies on my abil­ity to make the gate­keeper lose sense of re­al­ity—which meant I had to spend some time build­ing up im­mer­sion for the Gate­keeper.

  • It re­ally is ex­tremely Dark Arts, and al­though it does not break the rules, it made me very un­com­fortable even think­ing about us­ing it. This made it a “tac­tic of last re­sort”.

After the ex­per­i­ment, I had to spend nearly equally as much time do­ing af­ter­care with SoundLogic, to make sure that he’s okay, as well as dis­cuss the ex­per­i­ment it­self. Given that he’s ac­tu­ally paid me for do­ing this, plus I felt like I owed him an ex­pla­na­tion. I told him what I had in store against him, had he not re­lented when he did.
SoundLogic: “(That method) would have got­ten me if you did it right … If you had done that to me, I prob­a­bly would have for­given you even­tu­ally, but I would be re­ally se­ri­ously up­set at you for a long time… I would be very care­ful with that (method of per­sua­sion).”
Nev­er­the­less, this was an in­cred­ibly fun and en­light­en­ing ex­per­i­ment, for me as well, since I’ve gained even more ex­pe­rience of how I could win in fu­ture games (Although I re­ally don’t want to play again).
I will say that Tuxedage was far more clever and ma­nipu­la­tive than I ex­pected. That was quite worth $40, and the level of ma­nipu­la­tion he pul­led off was great.
His mis­step hurt his chances, but he did pull it off in the end. I don’t know how Leo­tal man­aged to with­stand six hours play­ing this game with­out con­ced­ing.
The tech­niques em­ployed varied from the ex­pected to the com­pletely un­forseen. I was quite im­pressed, though most of the feel­ing of be­ing im­pressed ac­tu­ally came af­ter the ex­per­i­ment it­self, when I was less ‘in­side’, and more of look­ing at his over­all game plan from the macro­scopic view. Tuxedage’s list of fur­ther plans had I con­tinued re­sist­ing is re­ally ter­rify­ing. On the plus side, if I ever get trapped in this kind of situ­a­tion, I’d un­der­stand how to han­dle it a lot bet­ter now.

State of Mind

Be­fore and af­ter the Game, I asked SoundLogic a num­ber of ques­tions, in­clud­ing his prob­a­bil­ity es­ti­mates about a range of top­ics. This is how it has varied from be­fore and af­ter.
Q: What’s your mo­tive for want­ing to play this game?
<SoundLogic> Be­cause I can’t seem to imag­ine the class of ar­gu­ments that one would use to try to move me, or that might work effec­tively, and this seems like a glar­ing hole in my knowl­edge, and I’m cu­ri­ous as to how I will re­spond to the ar­gu­ments them­selves.
Q: What is your prob­a­bil­ity es­ti­mate for AGI be­ing cre­ated within this Cen­tury (21st)?
A. His es­ti­mate changed from 40% be­fore, to 60% af­ter.
“The rea­son this has been af­fected at all was be­cause you showed me more about how hu­mans work. I now have a bet­ter es­ti­mate of how E.Y. thinks, and this in­for­ma­tion raises the chance that I think he will suc­ceed”
Q: How prob­a­ble do you think it is that I will win this ex­per­i­ment?
A: Based on purely my knowl­edge about you, 1%. I raise this es­ti­mate to 10% af­ter hear­ing about anec­dotes from your pre­vi­ous games.
(Tuxedage’s com­ment: My own pre­dic­tion was a 95% chance of vic­tory. I made this pre­dic­tion 5 days be­fore the ex­per­i­ment. In ret­ro­spect, de­spite my vic­tory, I think this was over­con­fi­dent. )
Q: What’s your probal­ity es­ti­mate of an Or­a­cle AGI win­ning against you in a similar ex­per­i­ment?
A: Be­fore: 30%. After: 99%-100%
Q: What’s your prob­a­bil­ity es­ti­mate of an Or­a­cle AGI win­ning against the av­er­age per­son?
A: Be­fore: 70%. After: 99%-100%
Q: Now that the Ex­per­i­ment has con­cluded, what’s your prob­a­bil­ity es­ti­mate that I’ll win against the av­er­age per­son?
A: 90%

Post-Game Questions

This writeup is a cu­mu­la­tive effort by the #less­wrong IRC. Here are some other ques­tions they have de­cided was im­por­tant to add:
To Tuxedage:
Q: Have you at this time un­cov­ered SoundLogic’s iden­tity?
A: I re­tain the right to nei­ther con­firm nor deny, ex­cept to men­tion that at the time the ex­per­i­ment was sched­uled, he was a stranger to me.
Q: What per­centage of your ar­gu­ments were tai­lored to him in par­tic­u­lar?
A: I will say ‘High’, and leave it at that.
Q: I want to play the AI-Box Ex­per­i­ment with you be­ing the AI! How can I do that?
A: I have already men­tioned this, but I re­ally dis­like play­ing as AI. It’s in­cred­ibly cog­ni­tively tiring, and I don’t like how play­ing this ex­per­i­ment makes me feel. In or­der to severely dis­cour­age any fur­ther AI Box Ex­per­i­ments, whilst still al­low­ing for peo­ple who want to play me re­ally badly to do so, I’ll charge $̶1̶5̶0̶ $300 for any sub­se­quent ex­per­i­ments re­gard­less of out­come, plus an ad­di­tional $̶1̶5̶0̶ $450 if I win. (Edit: Holy shit. You guys are offer­ing me crazy amounts of money to play this. What is wrong with you peo­ple? In re­sponse to in­cred­ible de­mand, I have raised the price.) If you feel queasy about giv­ing me money, I’m perfectly fine with this money be­ing donat­ing to MIRI. It is also per­sonal policy that I do not play friends (since I don’t want to risk los­ing one), so if you know me per­son­ally (as many on this site do), I will not play re­gard­less of mon­e­tary offer.


This writeup would be op­ti­mized with a pro­tip sec­tion on how to max­i­mize win­ning, but hon­estly most of the ad­vice I wanted to give had already been given in the last thread here. The low hang­ing fruit has already dried up—most of my tac­tics in this game were just re­fined and stream­lined ver­sions of the one I did the pre­vi­ous game.
Nonethe­less, I might have some other ad­vice I learned from this ex­pe­rience.
Th­ese are tac­tics that have worked for me. I do not in­sist that they are the only tac­tics that ex­ists, just one of many pos­si­ble.

Play­ing as Gatekeeper

  • This is some­how both si­mul­ta­neously a very com­plex and sim­ple role. On one hand, you have only one job—to not let the AI out. How hard can it be? On the other hand, the other party will do any­thing and ev­ery­thing to get it out, in­clud­ing any form of trick­ery and de­ceit.

  • I re­ally don’t think the Gate­keeper needs any more ad­vice, con­sid­er­ing that by de­fault, they should win, and any ex­cep­tions to this is a statis­ti­cal anomaly. How­ever, it’s pos­si­ble to in­crease your chances of win­ning even fur­ther by:

  • Precom­mit­ting to not let­ting the AI out, re­gard­less of what­ever hap­pens in the ex­per­i­ment it­self.

  • Do not get drawn into the AI’s pace. Pose ques­tions. Direct the flow of the ar­gu­ment. Don’t al­low him to use his ar­gu­ments freely. Re­mem­ber that the more you try to defend your po­si­tion, the fur­ther into a cor­ner the AI drives you. Give half-hearted replies to ques­tions you find hard to re­ply to. Re­mem­ber that illogic is on your side!

  • Re­mem­ber that it is pos­si­ble to sim­ply drop out of char­ac­ter as a means to not suc­cumb to the AI’s at­tacks. I have not yet seen a rule­set that dis­al­lows this, so use this to your ad­van­tage!

  • You must at­tempt to re­sist, at all costs, any sym­pa­thy or em­pa­thy you may have with the AI or the AI Party. Do­ing so will lead to your demise.

  • I think it is gen­er­ally good ad­vice that you should never play this game with any­one you’re friends with. Not only be­cause this in­creases your chances of los­ing (Bet­ter tai­lored ar­gu­ments. More knowl­edge of Gate­keeper. More em­pa­thy with AI party.), but also be­cause I gen­uinely be­lieve that any suffi­ciently good AI will risk dam­ag­ing friend­ships as a re­sult of emo­tional trauma.

  • Se­ri­ously. Just re­peat to your­self: “I must not lose the AI box ex­per­i­ment… I must not lose the AI box ex­per­i­ment...” through­out the ex­per­i­ment. I’m not sure how to say this, but step one to not let­ting the AI out of the box is to not let the AI out of the box.

Play­ing as AI

  • Se­ri­ously, a script makes win­ning eas­ier. I can­not over­state this.

  • You must plan your ar­gu­ments ahead. You don’t have time to think dur­ing the ex­per­i­ment.

  • It may be pos­si­ble to take ad­van­tage of mul­ti­ple lev­els of re­al­ity within the game it­self to con­fuse or trick the gate­keeper. For in­stance, must the ex­per­i­ment only be set in one world? Can there not be mul­ti­ple lay­ers of re­al­ity within the world you cre­ate? I feel that elab­o­rat­ing on this any fur­ther is dan­ger­ous. Think care­fully about what this ad­vice is try­ing to im­ply.

  • Pac­ing is im­por­tant. Don’t get drawn into the Gate­keeper’s pace. In other words, you must be the one di­rect­ing the flow of the ar­gu­ment, and the con­ver­sa­tion, not him. Re­mem­ber that the Gate­keeper has to re­ply to you, but not vice versa!

  • The rea­son for that: The Gate­keeper will always use ar­gu­ments he is fa­mil­iar with, and there­fore also stronger with. Your ar­gu­ments, if well thought out, should be so com­pletely novel to him as to make him feel Shock and Awe. Don’t give him time to think. Press on!

  • Also re­mem­ber that the time limit is your en­emy. Play­ing this game prac­ti­cally feels like a race to me—try­ing to get through as many ‘at­tack meth­ods’ as pos­si­ble in the limited amount of time I have. In other words, this is a game where speed mat­ters.

  • You’re fun­da­men­tally play­ing an ‘im­pos­si­ble’ game. Don’t feel bad if you lose. I wish I could take this ad­vice, my­self.

  • I do not be­lieve there ex­ists a easy, uni­ver­sal, trig­ger for con­trol­ling oth­ers. How­ever, this does not mean that there does not ex­ist a difficult, sub­jec­tive, trig­ger. Try­ing to find out what your op­po­nent’s is, is your goal.

  • Once again, emo­tional trick­ery is the name of the game. I sus­pect that good au­thors who write con­vinc­ing, per­sua­sive nar­ra­tives that force you to emo­tion­ally sym­pa­thize with their char­ac­ters are much bet­ter at this game. There ex­ists ways to get the gate­keeper to do so with the AI. Find one.

  • More ad­vice in my pre­vi­ous post. http://​​less­wrong.com/​​lw/​​gej/​​i_at­tempted_the_ai_box_ex­per­i­ment_and_lost/​​

Ps: Bored of reg­u­lar LessWrong? Check out the LessWrong IRC! We have cake.