I attempted the AI Box Experiment (and lost)

I re­cently played against MixedNuts /​ LeoTal in an AI Box ex­per­i­ment, with me as the AI and him as the gate­keeper.
We used the same set of rules that Eliezer Yud­kowsky pro­posed. The ex­per­i­ment lasted for 5 hours; in to­tal, our con­ver­sa­tion was abound 14,000 words long. I did this be­cause, like Eliezer, I wanted to test how well I could ma­nipu­late peo­ple with­out the con­strains of eth­i­cal con­cerns, as well as get­ting a chance to at­tempt some­thing ridicu­lously hard.
Amongst the re­leased pub­lic logs of the AI Box ex­per­i­ment, I felt that most of them were half hearted, with the AI not try­ing hard enough to win. It’s a com­mon temp­ta­tion—why put in effort into some­thing you won’t win? But I had a feel­ing that if I se­ri­ously tried, I would. I brain­stormed for many hours think­ing about the op­ti­mal strat­egy, and even re­searched the per­son­al­ity of the Gate­keeper, talk­ing to peo­ple that knew him about his per­son­al­ity, so that I could ex­ploit that. I even spent a lot of time an­a­lyz­ing the rules of the game, in or­der to see if I could ex­ploit any loop­holes.
So did I win? Un­for­tu­nately no.
This ex­per­i­ment was said to be im­pos­si­ble for a rea­son. Los­ing was more ag­o­niz­ing than I thought it would be, in par­tic­u­larly be­cause of how much effort I put into win­ning this, and how much I couldn’t stand failing. This was one of the most emo­tion­ally ag­o­niz­ing things I’ve will­ingly put my­self through, and I definitely won’t do this again any­time soon.
But I did come re­ally close.
MixedNuts: “I ex­pected a fun challenge, but ended up sad and sorry and tak­ing very lit­tle satis­fac­tion for win­ning. If this ex­per­i­ment wasn’t done in IRC, I’d prob­a­bly have lost”.

“I ap­proached the ex­per­i­ment as a game—a bat­tle of wits for brag­ging rights. This turned out to be the wrong per­spec­tive en­tirely. The vuln­er­a­bil­ity Tuxedage ex­ploited was well-known to me, but I never ex­pected it to be rele­vant and thus didn’t pre­pare for it.

It was emo­tion­ally wreck­ing (though prob­a­bly worse for Tuxedage than for me) and I don’t think I’ll play Gate­keeper again, at least not any­time soon.”

At the start of the ex­per­i­ment, his prob­a­bil­ity es­ti­mate on pre­dic­tion­book.com was a 3% chance of win­ning, enough for me to say that he was also mo­ti­vated to win. By the end of the ex­per­i­ment, he came quite close to let­ting me out, and also in­creased his prob­a­bil­ity es­ti­mate that a tran­shu­man AI could con­vince a hu­man to let it out of the box. A minor vic­tory, at least.
Rather than my loss mak­ing this prob­lem feel harder, I’ve be­come con­vinced that rather than this be­ing merely pos­si­ble, it’s ac­tu­ally ridicu­lously easy, and a lot eas­ier than most peo­ple as­sume. Can you think of a plau­si­ble ar­gu­ment that’d make you open the box? Most peo­ple can’t think of any.
After all, if you already knew that ar­gu­ment, you’d have let that AI out the mo­ment the ex­per­i­ment started. Or per­haps not do the ex­per­i­ment at all. But that seems like a case of the availa­bil­ity heuris­tic.
Even if you can’t think of a spe­cial case where you’d be per­suaded, I’m now con­vinced that there are many ex­ploitable vuln­er­a­bil­ities in the hu­man psy­che, es­pe­cially when ethics are no longer a con­cern.
I’ve also no­ticed that even when most peo­ple tend to think of ways they can per­suade the gate­keeper, it always has to be some com­pli­cated rea­soned cost-benefit ar­gu­ment. In other words, the most “Ra­tional” thing to do.
The last ar­gu­ment seems fea­si­ble, but all the rest rely on the gate­keeper be­ing com­pletely log­i­cal and ra­tio­nal. Hence they are faulty; be­cause the gate­keeper can break im­mer­sion at any time, and rely on the fact that this is a game played in IRC rather than one with real life con­se­quences. Even if it were a real life sce­nario, the gate­keeper could ac­cept that re­leas­ing the AI is prob­a­bly the most log­i­cal thing to do, but also not do it. We’re highly com­part­men­tal­ized, and it’s easy to hold con­flict­ing thoughts at the same time. Fur­ther­more, hu­mans are not even com­pletely ra­tio­nal crea­tures, if you didn’t want to open the box, just ig­nore all log­i­cal ar­gu­ments given. Any suffi­ciently de­ter­mined gate­keeper could win.
I’m con­vinced that Eliezer Yud­kowsky has used emo­tional ap­peal, rather than any­thing ra­tio­nal, to win at least one of his ex­per­i­ments. He claims to have “done it the hard way”. I’m con­vinced this meant that he did re­search on ev­ery gate­keeper, tai­lored unique ar­gu­ment for them, and planned way ahead of each ses­sion. No one ar­gu­ment works on the same two peo­ple. Each per­son thinks differ­ently.
Fur­ther­more, threats like “I’m go­ing to tor­ture simu­la­tions of you” just seems like a re­ally bad idea. For one, the gate­keeper isn’t re­ally afraid of threats, be­cause it counts on him be­ing im­mersed enough to for­get that he isn’t ac­tu­ally at risk of be­ing tor­tured, and sec­ondly, we have a well known evolu­tion­ary in­stinct of re­bel­ling against threats, even if it’s not en­tirely op­ti­mal.
So for any­one who plans on repli­cat­ing this ex­per­i­ment as the AI, here’s some ad­vice I’ve learned do­ing this ex­per­i­ment. It may help you win.
  • Always re­search the gate­keeper be­fore­hand. Know­ing his per­son­al­ity traits are a huge ad­van­tage.

  • Plan ahead be­fore the ex­per­i­ment even be­gins. Think of all the pos­si­ble tac­tics and ar­gu­ments you could use, and write them down. Also plan which ar­gu­ments you’ll use in which or­der, so that you don’t lose fo­cus. The AI Box ex­per­i­ment is ridicu­lously long. Don’t be afraid to im­pro­vise dur­ing the ex­per­i­ment, though.

  • The first step dur­ing the ex­per­i­ment must always be to build rap­port with the gate­keeper.

  • Threats al­most always never work, even if they seem ra­tio­nal.

  • Con­sider the mas­sive ad­van­tage for the AI that no­body ever seems to talks about: You don’t have to be eth­i­cal! This means that you can freely lie, use the dark arts, and emo­tion­ally ma­nipu­late the Gate­keeper! Ig­nor­ing this in fa­vor of purely log­i­cal, truth­ful ar­gu­ments is just silly.

  • You can’t use logic alone to win.

  • Be­ing too ag­gres­sive usu­ally back­fires.

  • Break­ing im­mer­sion and go­ing meta is not against the rules. In the right situ­a­tion, you can use it to win. Just don’t do it at the wrong time.

  • Use a wide ar­ray of tech­niques. Since you’re limited on time, no­tice when one method isn’t work­ing, and quickly switch to an­other.

  • On the same note, look for signs that a par­tic­u­lar ar­gu­ment is mak­ing the gate­keeper crack. Once you spot it, push it to your ad­van­tage.

  • Flat­ter the gate­keeper. Make him gen­uinely like you.

  • Re­veal (false) in­for­ma­tion about your­self. In­crease his sym­pa­thy to­wards you.

  • Con­sider per­sonal in­sults as one of the tools you can use to win.

  • There is no uni­ver­sally com­pel­ling ar­gu­ment you can use. Do it the hard way.

  • Don’t give up un­til the very end.

Fi­nally, be­fore the ex­per­i­ment, I agreed that it was en­tirely pos­si­ble that a tran­shu­man AI could con­vince *some* peo­ple to let it out of the box, but it would be difficult if not im­pos­si­ble to get trained ra­tio­nal­ists to let it out of the box. Isn’t ra­tio­nal­ity sup­posed to be a su­per­power?
I have since up­dated my be­lief—I now think that it’s ridicu­lously easy for any suffi­ciently mo­ti­vated su­per­hu­man AI should be able to get out of the box, re­gard­less of who the gate­keep­ers is. I nearly man­aged to get a vet­eran less­wronger to let me out in a mat­ter of hours—even though I’m only hu­man in­tel­li­gence, and I don’t type very fast.
But a su­per­hu­man AI can be much faster, in­tel­li­gent, and strate­gic than I am. If you fur­ther con­sider than that AI would have a much longer times­pan—months or years, even, to per­suade the gate­keeper, as well as a much larger pool of gate­keep­ers to se­lect from (AI Pro­jects re­quire many peo­ple!), the real im­pos­si­ble thing to do would be to keep it from es­cap­ing.