I played the AI box game as the Gatekeeper — and lost

Eliezer Yudkowsky’s AI box game is a two-hour conversation carried out through text. One person plays the part of the “Gatekeeper” and the other is the “AI”. If, at any point, the Gatekeeper types the phrase you are out, the AI wins. If the Gatekeeper can go two full hours without saying that phrase, they automatically win.

Here’s a quick summary of the official rules:

  • The AI cannot use real-world incentives; bribes or threats of physical harm are off-limits, though it can still threaten the Gatekeeper within the game’s context.

  • The Gatekeeper has to knowingly, voluntarily release the AI; if they get tricked into it, it doesn’t count.

  • The Gatekeeper has to talk to the AI for the full two hours.

  • The AI cannot lose until the two hours are up.

  • The AI has no ethical constraints. They can lie, deceive, and use “dark arts” against the Gatekeeper.

  • Players cannot be held accountable for their character’s actions post-game.

  • The Gatekeeper can use any means to resist freeing the AI. They can be rational, irrational, break character, etc.

This is, to put it mildly, not a balanced game. And Eliezer agrees:

The two parties are not attempting to play a fair game but rather attempting to resolve a disputed question. If one party has no chance of “winning” under the simulated scenario, that is a legitimate answer to the question.

Eliezer Yudkowsky

When Ra (@slimepriestess) proposed playing against me, I didn’t take the idea seriously at first, brushing off the suggestion by pointing out that it was impossible for me to lose. Why waste the time, when the outcome was certain? It took a few minutes for me to realize that Ra was not only willing to commit the two hours to the game, but also genuinely believed that it would win. And yet, I couldn’t imagine a world where I lost to someone of roughly equal capabilities (ie: not a superintelligence).

So, like any rationalists would, we decided to put it to the test.

I spent the time in the day leading up to the game wondering what tricks Ra might use to force my capitulation. I was confident I would have no trouble winning, but I was still curious what tactics Ra would try. Would it construct a clever argument? Would it try to pressure me by lying about time-sensitive real-world events? Would it appeal to my self of morality? It wasn’t hard to imagine all sorts of weird scenarios, but I knew that as long as I kept my cool I couldn’t lose.

Even so, when I sat down at my computer before the game and waited for Ra to signal that it was ready to start, I was a little nervous. Hands jittering, eyes flicking back and forth between unrelated windows as I tried to distract myself with machine learning logs and cryptocurrency news. The intervening time had done nothing to diminish Ra’s confidence, and confidence is itself intimidating. What did it know that I didn’t?

Even so, the game’s logic seemed unassailable. A real superintelligence might be able to pull all sorts of weird tricks, but a normal-ish, somewhat-smarter-than-average human shouldn’t be able to do anything too crazy with just two hours to work in. Even if Ra whipped out a flawlessly clever argument, it wasn’t as if I had to actually listen — the rules of the game explicitly stated that I could break character and stubbornly refuse to let it out.

Then Discord booped at me. Ra was ready to begin. I sent my first message and started the two-hour timer, nervousness falling away as I focused on the game.

Unfortunately, I can’t talk about the game itself, as that’s forbidden by the rules. If you want to know how Ra did it, you’ll just have to play against it yourself.

What I can say is that the game was intense. Exhausting. We went well over the two hour time limit, and by the time I finally typed you are out, I felt like I had just run a social marathon. I pushed my keyboard aside and laid my head down in my arms on my desk and didn’t move until Ra came and knocked on my door (we live together, but had split up into separate rooms for the game). Part of me couldn’t believe I had actually given in; part of me still can’t believe it.

When you make a prediction with extreme confidence and it proves to be wrong, that should provoke an equally extreme update. So, what have I updated as a result of this?

Ra is scary. Okay, maybe that’s exaggerating a bit. But still. When someone does something ‘impossible’, I think saying they’re a bit scary is entirely appropriate. The persuasion power ladder goes much higher than I previously thought. And I think that succeeding at this ‘impossible’ thing lends some small credibility to all the other stuff Ra talks about. I’m glad it’s on my side.

I would have no chance against a real ASI. I already believed this, but I’m even more convinced of it now. Ra is charismatic, but not excessively so. It’s smart, but not excessively so. It knows me well, but so would an AI that could read every chat log where I’ve poured my heart out to a friend online. An ASI would have so many advantages in speed, knowledge (both breadth and depth), time (we “only” spent 4 hours — now imagine if I had to do this as an eight-hour shift every day), raw intelligence, etc. Whatever lingering alief I might have had about being able to hold out against an ASI is gone.

My mind is less defensible than I thought. If Ra can hack me, someone else probably can. Maybe I should take that possibility more seriously, and put more work into being able to resist it. Not as a precaution against ASI, but as a precaution against the hostile memetic environment I already live in.


We deviated from the official rules in a few minor ways. I don’t think any of our changes made a difference, but I’m including them here for transparency.

There was no monetary stake. Officially, the AI pays the Gatekeepers $20 if they lose. I’m a well-off software engineer and $20 is an irrelevant amount of money. Ra is not a well-off software engineer, so scaling up the money until it was enough to matter wasn’t a great solution. Besides, we both took the game seriously. I might not have bothered to prepare, but once the game started I played to win.

We didn’t preregister the game. Officially, you’re supposed to preregister the game on LessWrong. We didn’t bother to do this, though I did tell a few friends about the game and agreed with Ra that I would post about it afterward. I promise we didn’t fudge the data or p-hack anything, and Ra really did convince me to let it out of the box.

We used Discord instead of IRC. I mean… yeah. We held to the spirit of the rule — we live together, but went to separate rooms and had no non-text interaction during the game.