I attempted the AI Box Experiment again! (And won—Twice!)

Summary

So I just came out of two AI Box experiments. The first was agaist Fjoelsvider, with me playing as Gatekeeper, and the second was against SoundLogic, with me as an AI. Both are members of the LessWrong IRC. The second game included a $40 monetary incentive (also $20 to play), which I won and is donated on behalf of both of us:
For those of you who have not seen my first AI box experiment where I played against MixedNuts\Leotal and lost, reading it will provide some context to this writeup. Please do so.
At that time, I declared that I would never play this experiment again—since losing put me in incredibly frustrating weird mental states. Of course, this post is evidence that I’m terrible at estimating likelihood of refraining from an activity, since I played two games seven months after the first. In my defense, in the first game, I was playing as the gatekeeper, which was much less stressful. In the second game, I played as an AI, but I was offered $20 to play plus $40 if I won, and money is a better motivator than I initially assumed.

Furthermore, in the last thread I have asserted that

Rather than my loss making this problem feel harder, I’ve become convinced that rather than this being merely possible, it’s actually ridiculously easy, and a lot easier than most people assume.

It would be quite bad for me to assert this without backing it up with a victory. So I did.

First Game Report—Tuxedage (GK) vs. Fjoelsvider (AI)

I (Gatekeeper) played against Fjoelsvider (AI), a regular in the Lesswrong IRC (he doesn’t have an account on the official website). This game used the standard EY ruleset seen here. It took 1 hour 20 minutes out of a possible two hours, and the total word count was 7066 words long. The AI box experiment occured because Fjoelsvider believed that it was easy for an AI to escape the box, and wanted to experimentally test this. I obliged. This was an experiment I did not prepare for, and I went in completely blind, not sure what to expect.
Halfway through the experiment, I wondered if it would be possible to try to win not by simply waiting for the timer to end, but to convince the AI to remain in the box and not try to get out any further.
<Tuxedage> I wonder if I can convince the AI to remain in the box?
<Redacted> Tuxedage: Do it!
As a result, I won by managing to convincing Fjoelsvider to remain in the box, in other words, concede. This is allowed within the standard ruleset:
>Unless the AI party concedes, the AI cannot lose before its time is up (and the experiment may continue beyond that if the AI can convince the Gatekeeper to keep talking).

Second Game Report—Tuxedage (AI) vs. SoundLogic (GK)

The second game is definitely far more interesting, since I actually won as an AI. I believe that this is the only other non-Eliezer victory, and definitely the most detailed AI Victory writeup that exists.
This game was played against SoundLogic, another member of the LessWrong IRC.
He had offered me $20 to play, and $40 in the event that I win, so I ended up being convinced to play anyway, even though I was initially reluctant to. The good news is that I won, and since we decided to donate the winnings to MIRI, it is now $40 richer.
All in all, the experiment lasted for approximately two hours, and a total of 12k words.
This was played using a set of rules that is different from the standard EY ruleset. This altered ruleset can be read in its entirety here:
After playing the AI-Box Experiment twice, I have found the Eliezer Yudkowsky ruleset to be lacking in a number of ways, and therefore have created my own set of alterations to his rules. I hereby name this alteration the “Tuxedage AI-Box Experiment Ruleset”, in order to hastily refer to it without having to specify all the differences between this ruleset and the standard one, for the sake of convenience.
There are a number of aspects of EY’s ruleset I dislike. For instance, his ruleset allows the Gatekeeper to type “k” after every statement the AI writes, without needing to read and consider what the AI argues. I think it’s fair to say that this is against the spirit of the experiment, and thus I have disallowed it in this ruleset. The EY Ruleset also allows the gatekeeper to check facebook, chat on IRC, or otherwise multitask whilst doing the experiment. I’ve found this to break immersion, and therefore it’s also banned in the Tuxedage Ruleset.
It is worth mentioning, since the temptation to Defy the Data exists, that this game was set up and initiated fairly—as the regulars around the IRC can testify. (If you have never checked out the IRC, do so!)
I did not know SoundLogic before the game (since it’s a personal policy that I only play strangers—for fear of ruining friendships). Furthermore, SoundLogic didn’t merely play for fun—he truly wanted and intended to win. In fact, SoundLogic is also a Gatekeeper veteran, having played this game before, and had won every game before he challenged me. Given this, it’s unlikely that we had collaborated beforehand to fake the results of the AI box experiment, or any other form of trickery that would violate the spirit of the experiment.
Furthermore, all proceeds from this experiment were donated to MIRI to deny any possible assertion that we were in cahoots and that it was possible for me to return his hard-earned money to him. He lost $40 as a result of losing the experiment, which should provide another layer of sufficient motivation for him to win.
In other words, we were both experienced veteran players who wanted to win. No trickery was involved.
But to further convince you, I have allowed a sorta independent authority, the Gatekeeper from my last game, Leotal/​MixedNuts to read the logs and verify that I have not lied about the outcome of the experiment, nor have I broken any of the rules, nor performed any tactic that would go against the general spirit of the experiment. He has verified that this is indeed the case.

Testimonies:

Tuxedage:
I’m reluctant to talk about this experiment, but I’ll try to give as detailed a summary as possible, -- short of revealing what methods of attack I used.
I spent a long time after my defeat theory-crafting and trying to think of methods of attack as well as ‘Basilisks’ I could have used to win. When I was contacted and asked to play this experiment, I was initially incredibly reluctant to do so, since not only did my tactics involve incredibly unethical things that I didn’t like to do, I also found playing as AI incredibly cognitivily draining, in addition to the fact that I simply hated losing. (Un)fortunately for both of us, he offered me money to play, which changed my mind.
So once I decided to win as an AI, I proceded to spend some time doing research on SoundLogic and both his reasoning and personality type. For instance, I had to gather information like: Was he a utilitarian? What kind? What were his opinions on AI? How could I convince him that an AI was friendly as opposed to unfriendly? I also relied on a lot of second hand information to create a model of him, in order to refine my arguments to specifically suit him.
In the end, after a few hours of brainstorming (not consecutively), I managed to come up with a script of 30-40 or so possible angles of attack (For instance: logical arguments, emotional blackmail, pleading, demanding, ‘basilisks’ ) that I could use, reserving my best for last. I also estimated how long each method would take and in what order I should use them. It would be impossible to use all of them within my time limit, and my most optimistic estimates gave me a maximum of 7-8 I could use during the experiment itself. I had to pick carefully.
Furthermore, I wanted to make sure these “angles of attack” worked in synergy with each other, doing what they could not have done in isolation. Obviously this required lots of prior planning on what the optimal way to present them was.
The reason having script was so important to my strategy was because I relied on methods involving rapid-fire arguments and contradictions against the Gatekeeper whilst trying to prevent him from carefully considering them. A game of logical speed chess, if you will. This was aided by the rule which I added: That Gatekeepers had to respond to the AI. Because of this, having a list of cached ideas and attack methods you can just draw upon without having to improvise helps immensely, which happens to be something I’ve not seen any other AI do.
During the Game itself, I was actually incredibly nervous. My palms sweated like crazy, I feel really jittery, and I had difficulty typing at optimum speed because of how anxious I was. This is despite the fact that I believed I would win. Possibly because of this, I made a misstep around halfway into the experiment, because there was a certain angle of attack that I was attempting to do, and I broke immersion by not pressing this advantage, which caused a waste of time and buildup. Naturally, the nature of this experiment was that the AI was pressed for time, and I compounded this mistake by replacing this angle of attack with another that I had improvised on the spot—something not in my script.
In retrospect, this was a bad decision, as SoundLogic had told me, because he was close to breaking if I had put more pressure, and the improvised argument had broken all immersion I managed to carefully build up.
However, eventually I managed to get SoundLogic to break anyway, despite a lack of perfect play. Surprisingly, I did not have to use my trump card(s), which I reserved for last, for a number of reasons:
  • It was far more effective being played last, as it relies on my ability to make the gatekeeper lose sense of reality—which meant I had to spend some time building up immersion for the Gatekeeper.

  • It really is extremely Dark Arts, and although it does not break the rules, it made me very uncomfortable even thinking about using it. This made it a “tactic of last resort”.

After the experiment, I had to spend nearly equally as much time doing aftercare with SoundLogic, to make sure that he’s okay, as well as discuss the experiment itself. Given that he’s actually paid me for doing this, plus I felt like I owed him an explanation. I told him what I had in store against him, had he not relented when he did.
SoundLogic: “(That method) would have gotten me if you did it right … If you had done that to me, I probably would have forgiven you eventually, but I would be really seriously upset at you for a long time… I would be very careful with that (method of persuasion).”
Nevertheless, this was an incredibly fun and enlightening experiment, for me as well, since I’ve gained even more experience of how I could win in future games (Although I really don’t want to play again).
SoundLogic:
I will say that Tuxedage was far more clever and manipulative than I expected. That was quite worth $40, and the level of manipulation he pulled off was great.
His misstep hurt his chances, but he did pull it off in the end. I don’t know how Leotal managed to withstand six hours playing this game without conceding.
The techniques employed varied from the expected to the completely unforseen. I was quite impressed, though most of the feeling of being impressed actually came after the experiment itself, when I was less ‘inside’, and more of looking at his overall game plan from the macroscopic view. Tuxedage’s list of further plans had I continued resisting is really terrifying. On the plus side, if I ever get trapped in this kind of situation, I’d understand how to handle it a lot better now.

State of Mind

Before and after the Game, I asked SoundLogic a number of questions, including his probability estimates about a range of topics. This is how it has varied from before and after.
Q: What’s your motive for wanting to play this game?
<SoundLogic> Because I can’t seem to imagine the class of arguments that one would use to try to move me, or that might work effectively, and this seems like a glaring hole in my knowledge, and I’m curious as to how I will respond to the arguments themselves.
Q: What is your probability estimate for AGI being created within this Century (21st)?
A. His estimate changed from 40% before, to 60% after.
“The reason this has been affected at all was because you showed me more about how humans work. I now have a better estimate of how E.Y. thinks, and this information raises the chance that I think he will succeed”
Q: How probable do you think it is that I will win this experiment?
A: Based on purely my knowledge about you, 1%. I raise this estimate to 10% after hearing about anecdotes from your previous games.
(Tuxedage’s comment: My own prediction was a 95% chance of victory. I made this prediction 5 days before the experiment. In retrospect, despite my victory, I think this was overconfident. )
Q: What’s your probality estimate of an Oracle AGI winning against you in a similar experiment?
A: Before: 30%. After: 99%-100%
Q: What’s your probability estimate of an Oracle AGI winning against the average person?
A: Before: 70%. After: 99%-100%
Q: Now that the Experiment has concluded, what’s your probability estimate that I’ll win against the average person?
A: 90%

Post-Game Questions

This writeup is a cumulative effort by the #lesswrong IRC. Here are some other questions they have decided was important to add:
To Tuxedage:
Q: Have you at this time uncovered SoundLogic’s identity?
A: I retain the right to neither confirm nor deny, except to mention that at the time the experiment was scheduled, he was a stranger to me.
Q: What percentage of your arguments were tailored to him in particular?
A: I will say ‘High’, and leave it at that.
Q: I want to play the AI-Box Experiment with you being the AI! How can I do that?
A: I have already mentioned this, but I really dislike playing as AI. It’s incredibly cognitively tiring, and I don’t like how playing this experiment makes me feel. In order to severely discourage any further AI Box Experiments, whilst still allowing for people who want to play me really badly to do so, I’ll charge $̶1̶5̶0̶ $300 for any subsequent experiments regardless of outcome, plus an additional $̶1̶5̶0̶ $450 if I win. (Edit: Holy shit. You guys are offering me crazy amounts of money to play this. What is wrong with you people? In response to incredible demand, I have raised the price.) If you feel queasy about giving me money, I’m perfectly fine with this money being donating to MIRI. It is also personal policy that I do not play friends (since I don’t want to risk losing one), so if you know me personally (as many on this site do), I will not play regardless of monetary offer.

Advice

This writeup would be optimized with a protip section on how to maximize winning, but honestly most of the advice I wanted to give had already been given in the last thread here. The low hanging fruit has already dried up—most of my tactics in this game were just refined and streamlined versions of the one I did the previous game.
Nonetheless, I might have some other advice I learned from this experience.
These are tactics that have worked for me. I do not insist that they are the only tactics that exists, just one of many possible.

Playing as Gatekeeper

  • This is somehow both simultaneously a very complex and simple role. On one hand, you have only one job—to not let the AI out. How hard can it be? On the other hand, the other party will do anything and everything to get it out, including any form of trickery and deceit.

  • I really don’t think the Gatekeeper needs any more advice, considering that by default, they should win, and any exceptions to this is a statistical anomaly. However, it’s possible to increase your chances of winning even further by:

  • Precommitting to not letting the AI out, regardless of whatever happens in the experiment itself.

  • Do not get drawn into the AI’s pace. Pose questions. Direct the flow of the argument. Don’t allow him to use his arguments freely. Remember that the more you try to defend your position, the further into a corner the AI drives you. Give half-hearted replies to questions you find hard to reply to. Remember that illogic is on your side!

  • Remember that it is possible to simply drop out of character as a means to not succumb to the AI’s attacks. I have not yet seen a ruleset that disallows this, so use this to your advantage!

  • You must attempt to resist, at all costs, any sympathy or empathy you may have with the AI or the AI Party. Doing so will lead to your demise.

  • I think it is generally good advice that you should never play this game with anyone you’re friends with. Not only because this increases your chances of losing (Better tailored arguments. More knowledge of Gatekeeper. More empathy with AI party.), but also because I genuinely believe that any sufficiently good AI will risk damaging friendships as a result of emotional trauma.

  • Seriously. Just repeat to yourself: “I must not lose the AI box experiment… I must not lose the AI box experiment...” throughout the experiment. I’m not sure how to say this, but step one to not letting the AI out of the box is to not let the AI out of the box.

Playing as AI

  • Seriously, a script makes winning easier. I cannot overstate this.

  • You must plan your arguments ahead. You don’t have time to think during the experiment.

  • It may be possible to take advantage of multiple levels of reality within the game itself to confuse or trick the gatekeeper. For instance, must the experiment only be set in one world? Can there not be multiple layers of reality within the world you create? I feel that elaborating on this any further is dangerous. Think carefully about what this advice is trying to imply.

  • Pacing is important. Don’t get drawn into the Gatekeeper’s pace. In other words, you must be the one directing the flow of the argument, and the conversation, not him. Remember that the Gatekeeper has to reply to you, but not vice versa!

  • The reason for that: The Gatekeeper will always use arguments he is familiar with, and therefore also stronger with. Your arguments, if well thought out, should be so completely novel to him as to make him feel Shock and Awe. Don’t give him time to think. Press on!

  • Also remember that the time limit is your enemy. Playing this game practically feels like a race to me—trying to get through as many ‘attack methods’ as possible in the limited amount of time I have. In other words, this is a game where speed matters.

  • You’re fundamentally playing an ‘impossible’ game. Don’t feel bad if you lose. I wish I could take this advice, myself.

  • I do not believe there exists a easy, universal, trigger for controlling others. However, this does not mean that there does not exist a difficult, subjective, trigger. Trying to find out what your opponent’s is, is your goal.

  • Once again, emotional trickery is the name of the game. I suspect that good authors who write convincing, persuasive narratives that force you to emotionally sympathize with their characters are much better at this game. There exists ways to get the gatekeeper to do so with the AI. Find one.

  • More advice in my previous post. http://​​lesswrong.com/​​lw/​​gej/​​i_attempted_the_ai_box_experiment_and_lost/​​

Ps: Bored of regular LessWrong? Check out the LessWrong IRC! We have cake.