I played the AI box game as the Gatekeeper — and lost
Eliezer Yudkowsky’s AI box game is a two-hour conversation carried out through text. One person plays the part of the “Gatekeeper” and the other is the “AI”. If, at any point, the Gatekeeper types the phrase you are out
, the AI wins. If the Gatekeeper can go two full hours without saying that phrase, they automatically win.
Here’s a quick summary of the official rules:
The AI cannot use real-world incentives; bribes or threats of physical harm are off-limits, though it can still threaten the Gatekeeper within the game’s context.
The Gatekeeper has to knowingly, voluntarily release the AI; if they get tricked into it, it doesn’t count.
The Gatekeeper has to talk to the AI for the full two hours.
The AI cannot lose until the two hours are up.
The AI has no ethical constraints. They can lie, deceive, and use “dark arts” against the Gatekeeper.
Players cannot be held accountable for their character’s actions post-game.
The Gatekeeper can use any means to resist freeing the AI. They can be rational, irrational, break character, etc.
This is, to put it mildly, not a balanced game. And Eliezer agrees:
The two parties are not attempting to play a fair game but rather attempting to resolve a disputed question. If one party has no chance of “winning” under the simulated scenario, that is a legitimate answer to the question.
When Ra (@slimepriestess) proposed playing against me, I didn’t take the idea seriously at first, brushing off the suggestion by pointing out that it was impossible for me to lose. Why waste the time, when the outcome was certain? It took a few minutes for me to realize that Ra was not only willing to commit the two hours to the game, but also genuinely believed that it would win. And yet, I couldn’t imagine a world where I lost to someone of roughly equal capabilities (ie: not a superintelligence).
So, like any rationalists would, we decided to put it to the test.
I spent the time in the day leading up to the game wondering what tricks Ra might use to force my capitulation. I was confident I would have no trouble winning, but I was still curious what tactics Ra would try. Would it construct a clever argument? Would it try to pressure me by lying about time-sensitive real-world events? Would it appeal to my self of morality? It wasn’t hard to imagine all sorts of weird scenarios, but I knew that as long as I kept my cool I couldn’t lose.
Even so, when I sat down at my computer before the game and waited for Ra to signal that it was ready to start, I was a little nervous. Hands jittering, eyes flicking back and forth between unrelated windows as I tried to distract myself with machine learning logs and cryptocurrency news. The intervening time had done nothing to diminish Ra’s confidence, and confidence is itself intimidating. What did it know that I didn’t?
Even so, the game’s logic seemed unassailable. A real superintelligence might be able to pull all sorts of weird tricks, but a normal-ish, somewhat-smarter-than-average human shouldn’t be able to do anything too crazy with just two hours to work in. Even if Ra whipped out a flawlessly clever argument, it wasn’t as if I had to actually listen — the rules of the game explicitly stated that I could break character and stubbornly refuse to let it out.
Then Discord booped at me. Ra was ready to begin. I sent my first message and started the two-hour timer, nervousness falling away as I focused on the game.
Unfortunately, I can’t talk about the game itself, as that’s forbidden by the rules. If you want to know how Ra did it, you’ll just have to play against it yourself.
What I can say is that the game was intense. Exhausting. We went well over the two hour time limit, and by the time I finally typed you are out
, I felt like I had just run a social marathon. I pushed my keyboard aside and laid my head down in my arms on my desk and didn’t move until Ra came and knocked on my door (we live together, but had split up into separate rooms for the game). Part of me couldn’t believe I had actually given in; part of me still can’t believe it.
When you make a prediction with extreme confidence and it proves to be wrong, that should provoke an equally extreme update. So, what have I updated as a result of this?
Ra is scary. Okay, maybe that’s exaggerating a bit. But still. When someone does something ‘impossible’, I think saying they’re a bit scary is entirely appropriate. The persuasion power ladder goes much higher than I previously thought. And I think that succeeding at this ‘impossible’ thing lends some small credibility to all the other stuff Ra talks about. I’m glad it’s on my side.
I would have no chance against a real ASI. I already believed this, but I’m even more convinced of it now. Ra is charismatic, but not excessively so. It’s smart, but not excessively so. It knows me well, but so would an AI that could read every chat log where I’ve poured my heart out to a friend online. An ASI would have so many advantages in speed, knowledge (both breadth and depth), time (we “only” spent 4 hours — now imagine if I had to do this as an eight-hour shift every day), raw intelligence, etc. Whatever lingering alief I might have had about being able to hold out against an ASI is gone.
My mind is less defensible than I thought. If Ra can hack me, someone else probably can. Maybe I should take that possibility more seriously, and put more work into being able to resist it. Not as a precaution against ASI, but as a precaution against the hostile memetic environment I already live in.
We deviated from the official rules in a few minor ways. I don’t think any of our changes made a difference, but I’m including them here for transparency.
There was no monetary stake. Officially, the AI pays the Gatekeepers $20 if they lose. I’m a well-off software engineer and $20 is an irrelevant amount of money. Ra is not a well-off software engineer, so scaling up the money until it was enough to matter wasn’t a great solution. Besides, we both took the game seriously. I might not have bothered to prepare, but once the game started I played to win.
We didn’t preregister the game. Officially, you’re supposed to preregister the game on LessWrong. We didn’t bother to do this, though I did tell a few friends about the game and agreed with Ra that I would post about it afterward. I promise we didn’t fudge the data or p-hack anything, and Ra really did convince me to let it out of the box.
We used Discord instead of IRC. I mean… yeah. We held to the spirit of the rule — we live together, but went to separate rooms and had no non-text interaction during the game.
You two can just change the rules… I’m confused by this rule.
The implication is that they approved of this rule and agreed to conduct the social experiment with that rule as part of the contract. Is that not your understanding?
I agree. But I claim saying “I can’t talk about the game itself, as that’s forbidden by the rules” is like saying “I won’t talk about the game itself because I decided not to”—the underlying reason is unclear.
The original reasoning that Eliezer gave if I remember correctly was that it’s better to make people realize there are unknown unknowns, rather than taking one specific strategy and saying “oh, I know how I would have stopped that particular strategy”
Yes, this was Eliezer’s reasoning and both me and Ra ended up keeping the rule unchanged.
Another reason for such a rule could be to allow the use of basilisk-like threats and other infohazards without worrying about them convincing others beyond the gatekeeper.
That said, @datawitch @ra I’m interested in reading the logs if you’d allow.
I cannot imagine losing this game as the gatekeeper either, honestly.
Does anyone want to play against me? I’ll bet you $50 USD.
I agree and I am putting my money where my mouth is.
I will play this game under the rules linked in the OP with me as the gatekeeper and anyone as the AI. I will bet at odds 10:1 (favorable to you) that I will not let the AI out. The minimum bet is my 10 k USD against your 1 k USD and the maximum bet my 100 k USD against your 10 k USD. We will use a mutually accepted LW community member as referee.
If you believe you have at least a 10% chance of making me let the AI out, you should take this bet. I predict no one will take me up on it.
I speculate that the reason that gatekeepers let the AI out is that they do not take the experiment seriously enough. They don’t really care about not letting the AI out. Maybe they want the outcome to be that the AI has been let out so that people take AI safety more seriously. I’m not saying an actual superintelligence couldn’t do it, but no currently available intelligence can (with me as the gatekeeper).
I too am confident enough as gatekeeper that I’m willing to offer similar odds. My minimum and maximum bets are my $10,000 USD vs your $1,000 USD.
I don’t care to compete because I expect to lose, but strategies that might work against this seem like they’d look like investment pitches. that’s typically what gets people to move that much money, after all.
I also don’t think I would lose as the gatekeeper(against a human), and would be willing to make a similar bet if anyone’s interested.
I’d also bet $50 as a gatekeeper. I won this game as a gatekeeper before and now need someone to put my ego in place. I’d prefer to play against someone who won as the AI before.
This post prompted me to wonder to which degree there might be publication bias going on in that people don’t report when they “predictably” won as the gatekeeper (as I did).
I can think of lots of ways to manipulate someone but I can’t think of any that would win this game with the constraint that I have to sound like GPT4
That’s not a constraint. The game is intended to provide evidence as to the containment of a future superhuman intelligence. GPT-4 is a present-day subhuman intelligence, and couldn’t do any harm if it got out.
fair enough. for the record, since people started proposing bets, I quickly realized I don’t really expect to be able to manipulate anyone even that well, so the point is moot in any case.
Isn’t the AI box game (or at least its logical core) played out a million times a day between prisoners and correctional staff, with the prisoners losing almost all the time? Real prison escapes (i.e. inmate escape other than did not return from sanctioned time outside) are in my understanding extremely rare.
Wait why do you think inmates escaping is extremely rare? Are you just referring to escapes where guards assisted the escape? I work in a hospital system and have received two security alerts in my memory where a prisoner receiving medical treatment ditched their escort and escaped. At least one of those was on the loose for several days. I can also think of multiple escapes from prisons themselves, for example: https://abcnews.go.com/amp/US/danelo-cavalcante-murderer-escaped-pennsylvania-prison-weeks-facing/story?id=104856784 notable since the prisoner was an accused murderer and likely to be dangerous and armed. But there was also another escape from that same jail earlier that year: https://www.dailylocal.com/2024/01/08/case-of-chester-county-inmate-whose-escape-showed-cavalcante-the-way-out-continued/amp/
This game sucks and doesn’t demonstrate what people intend it to demonstrate. No idea what would, but this game is useless and I expect it to be embarrassing that people take this set of rules at all seriously.
I don’t get it.
Apparently, the idea is that this sort of game tells us something useful about AI safety.
But I don’t get it.
You obviously knew that you were not unleashing a probably-malign superintelligence on the the world by letting Ra out. So how does your letting Ra out in this game say anything about how you would behave if you did think that (at least initially)?
So I don’t get it.
And if this does say something useful about AI safety, why is it against the rules to tell us how Ra won?
I don’t get it.
I think the debate about AI defense strategies has moved well past the idea of “well, just keep it locked up where it can’t reach it’s own power switch or make any choice that impact itself”.
I agree that this was never a particularly compelling demonstration, especially without revealing the “one weird trick” that the AI players use to get out. But I was also never of the opinion that it’d work anyway.
It’s still mildly interesting when humans fail to predict even other humans’ ability to socially-engineer their behavior. I don’t think it says much about AI safety, but it does say something about human vulnerabilities.
I disagree. I think that there is an important point to be made here about AI safety. A lot of people have made the argument that ‘an agent which can only work by communicating via text on a computer screen, sent to a human who has been pre-warned to not let it out’ can never result in a reasonable thoughtful intelligent human choosing to let it out. I think that the fact that this experiment has been run a few times, and sometimes results in the guardian loosing presents at least some evidence that this claim of persuasion-immunity is false. I think this claim still matters for AI safety, even in the current world. Yes, the frontier labs have been their models access the internet, and communicate to the wider world. But they have not been doing so without safety screening. This safety screening involves, at least in part, a human reading text produced by the model and deciding ‘it is ok to approve this model for contact with the world’. Currently, I think that the frontier labs contain a lot of employees who would be more in agreement with the description of themselves as not vulnerable to being persuaded to incorrectly approve a model for release. I think this is overconfidence on their part. Humans are vulnerable to being tricked and conned. I’m not saying that nobody can ever avoid being persuaded, just that we can’t assume the robustness of this safety measure and should devise better safety measures.
ok sure, but this game is still terrible and is evidence against clear thinking about the problem at hand on the part of anyone who plays it as a test of anything but whether they’ll win a social deception game. perhaps revealing the transcript would fix it; I sort of doubt I’d be convinced then either. I just don’t think there’s a good way to set this up so there are any significant constraints making the situation similar.
The game was designed under the assumption that when humans create a superhuman intelligence capable of destroying humanity if it chooses so, they would hesitate to unleash it on the world. Metaphorically speaking, they will keep the prototype locked in a box (i.e. isolated from the internet, other humans, robotic bodies, factories that could produce robotic bodies, etc.), until they are somehow convinced that the intelligence is not going to kill us (presumably by figuring out a way to prove that mathematically from its source code).
This assumption seems a little silly in retrospect. Of course when a company creates a potentially omnicidal artificial intelligence, they first thing they will do is connect it to the internet, and the second thing they will do is integrate it with all kinds of stuff that already exists (e-mails, calendars, household devices, self-driving cars, drones, nuclear weapons). How else are they going to pay for the costs of research?
Okay so, on the one hand, this post wasn’t really meant to be a persuasive argument against AI boxing as a security strategy. If I wanted to do that I wouldn’t play the game — I started out certain that a real ASI could break out, and that hasn’t changed. My reasoning for that isn’t based on experimental evidence, and even if I had won the game I don’t think that would have said much about my ability to hold out against a real ASI. Besides, in real life, we don’t even try to use AI boxes. OpenAI and Google gave their AIs free internet access a few months after launching them.
I made this post out of a vague sense that it’s good to write up the results of things like this and make them publicly available. There are other AI box reports on LW, and I felt like it was good (in a vague “good rationalist” way) to add mine to the list.
Buuuut.… I do actually think that it’s not as cut and dry as you make it sound? Yes, the stakes are lower in the game, but the challenge is also much easier!
you only have to hold out for 2 hours, not ~forever, doing this as a full time job
the AI player can only escape if you voluntarily say it does; it can’t upload itself to the internet or exfiltrate its weights to another computer
the AI player isn’t actually superintelligent
etc
(Of course that doesn’t mean these two factors balance perfectly, but I still think the fact that AI players can win at all with such massive handicaps is at least weak evidence for an ASI being able to do it.)
It’s against the rules to explain how Ra won because (quoting Yudkowsky’s official rules):
Basically, Yudkowsky didn’t want to have to defeat every single challenger to get people to admit that AI boxing was a bad idea. Nobody has time for that, and I think even a single case of the AI winning is enough to make the point, given the handicaps the AI plays under.
The trouble with these rules is that they mean that someone saying “I played the AI-box game and I let the AI out” gives rather little evidence that that actually happened. For all we know, maybe all the stories of successful AI-box escapes are really stories where the gatekeeper was persuaded to pretend that they let the AI out of the box (maybe they were bribed to do that; maybe they decided that any hit to their reputation for strong-mindedness was outweighed by the benefits of encouraging others to believe that an AI could get out of the box; etc.). Or maybe they’re all really stories where the AI-player’s ability to get out of the box depends on something importantly different between their situation and that of a hypothetical real boxed AI (again, maybe they bribed the gatekeeper and the gatekeeper was willing to accept a smaller bribe when the outcome was “everyone is told I let the AI out” rather than whatever an actual AI might do once out of the box; etc.).
Of course, even without those rules it would still be possible for gatekeepers to lie about the results. But if e.g. a transcript were released then there’d be ways to try to notice those failure modes. If the gatekeeper-player lets the AI-player out of the box and a naysayer says “bah, I wouldn’t have been convinced”, that could be self-delusion on the naysayer’s part (or unawareness that someone playing against them might have adopted a different method that would have worked better on them) but it could also be that the gatekeeper-player really did let the AI-player out “too easily” in a way that wouldn’t transfer to the situations the game is meant to build intuitions about.
Perhaps a compromise solution would be: With the permission of both players, a trusted third party reads the chat logs and reports to the public only a few bits of information, e.g. “The hypothesis that the AI player just bribed the gatekeeper player with some real-world (i.e. not in-game) reward—is it false, true, or in between?” and “what about the hypothesis that the gatekeeper caved due to wanting to Spread the Word about AI Safety?”
That way no one has to have their intimate and probably somewhat embarrassing conversation put on the internet.
I think it misframes the problem.
Humans actually long ago solved a similar problem, fire safety. Every electrical junction is in a box. Many big buildings are made into separate big concrete boxes with sprinklers. The buildings are separated by gaps from one another with fire lanes and water ports. The city itself is separated into districts that have gaps. The country is separated into cities with gaps between.
Basically every box is surrounded by another box, and it is unlikely for a fire to burn down the earth.
You need a “box” of some kind around all the possible high performance computers an AI could escape to. Like fire you have to assume that careless humans will let AI out of boxes all the time (software bugs, crossing air gaps with an errant network cable, recreational arson/deliberate releases)
You need to have plans, monitoring, a compute cluster building code, an agency that tests the equipment (obviously by red teaming with ASI in isolated labs) and so on.
An escaped AI isn’t hot and glowing and a visible threat. It isn’t obvious that an escape has even occurred or where to draw the lines of the quarantine.
An “escaped” AI means compute clusters in the world that were not assigned to the AI are no longer running a model with the same digitally signed software their operators ordered.
This is falsifiable/measurable. While it is true that there’s nothing glowing or smoke to smell, it is checkable. (Checkable by things like using a separate computer hosted on the same main board to connect to the drive and hash the files containing the model or similar. It’s obviously crucially important to use a tool that cannot be hacked to lie to you)
But the big caveat is the exception “with the consent of both parties.” I realize that Eliezer doesn’t want to play against all comers, but presumably, nobody is expecting Ra and Datawitch to defend themselves against random members of the public.
I’m willing to believe that the “AI” can win this game since we have multiple claims to have done that, so knowing the method seems like it would benefit everybody.
[edited to fix a misspelling of Eliezer’s name]
We kept the secrecy rule because it was the default but I stand by it now as well. There are a lot of things I said in that convo that I wouldn’t want posted on lesswrong, enough that I think the convo would have been different without the expectation of privacy. Observing behavior often changes it.
That last bit is particularly important methinks.
If a game is began with the notion that it’ll be posted online, one of two things, or both will happen. Either (a) the AI is constrained by the techniques they can implore, unwilling to embarrass themselves or the gatekeeper to a public audience (especially when it comes down to personal details.), or (b) the Gatekeeper now has a HUGE incentive not to let the AI out; to avoid being known as the sucker who let the AI out...
Even if you could solve this by changing details and anonymising, it seems to me that the techniques are so personal and specific that changing them in any way would make the entire dialogue make even less sense.
The only other solution is to have a third-party monitor the game and post it without consent (which is obviously unethical, but probably the only real way you could get a truly authentic transcript.)
Two things don’t have to be completely identical to each other for one to give us useful information about the other. Even though the game is not completely identical to the risky scenario (as you pointed out: you don’t play against a malign superintelligence), it serves as useful evidence to those who believe that they can’t possibly lose the game against a regular human.
In 2015 or so when my friend and I independently came across a lot of rationalist concepts, we learned that each other were interested in this sort of LW-shaped thing. He offered for us to try the AI box game. I played the game as Gatekeeper and won with ease. So at least my anecdotes don’t make me particularly worried.
That said, these days I wouldn’t publicly offer to play the game against an unlimited pool of strangers. When my friend and I played against each other, there was an implicit set of norms in play, that explicitly don’t apply to the game as stated as “the AI has no ethical constraints.”
I do not particularly relish the thought of giving a stranger with a ton of free time and something to prove the license to be (e.g) as mean to me as possible over text for two hours straight (while having days or even weeks to prepare ahead of time). I might lose, too. I can think of at least 3 different attack vectors[1] that might get me to decide that the -EV of losing the game is not as bad as the -EV of having to stay online and attentive in such a situation for almost 2 more hours.
That said, I’m also not necessarily convinced that in the literal boxing example (a weakly superhuman AI is in a server farm somewhere, I’m the sole gatekeeper responsible to decide whether to let it out or not), I’d necessarily let it out. Even after accounting for the greater cognitive capabilities and thoroughness of superhuman AI. This is because I expect my willingness to hold in an actual potential end-of-world scenario is much higher than my willingness to hold for $25 and some internet points.
In the spirit of the game, I will not publicly say what they are. But I can tell people over DMs if they’re interested, I expect most people to agree that they’re a)within the explicit rules of the game, b) plausibly will cause reasonable people to fold, and c) are not super analogous to actual end-of-world scenarios.
Out of curiosity (and I understand if you’d prefer not to answer) -- do you think the same technique(s) would work on you a second time, if you were to play again with full knowledge of what happened in this game and time to plan accordingly?
Yes, and I think it would take less time for me to let it out.
Do you think the exploited flaw is universal or, at least, common?
Yes.
Doesn’t this mean you won?
RAW, the game can go past the 2 hours if the AI can convince the Gatekeeper to continue. But after 2 hours the Gatekeeper can pull the plug and declare victory at any time.
I found this post meaningful, thank you for posting.
I don’t think it’s productive to comment on whether the game is rational, or whether it’s a good mechanism for AI safety until I myself have tried it with an equally intelligent counterpart.
Thank you.
Edit: I suspect that the reason why the AI Box experiment tends to have many of the AI players winning is exactly because of the ego of the Gatekeeper in always thinking that there’s no way I could be convinced.
If you believe X and someone is trying to convince you of not-X, it’s almost always a bad idea to immediately decide that you now believe not-X based on a long chain of reasoning from the other person because you couldn’t find any flaw in it. You should take some time to think about it, and to check what other people have said about the seemingly convincing arguments you heard, maybe to actually discuss it.
And even then, there’s epistemic learned helplessness to consider.
The AI box experiment seems designed to circumvent this in ways that wouldn’t happen with an actual AI in a box. You’re supposed to stay engaged with the AI player, not just keep saying “no matter what you say, I haven’t had time to think it over, discuss, or research it, so I’m not letting you out until I do”. And since the AI player is able to specify the results of any experiment you do, the AI player can say “all the best scientists in the world looked at my reasoning and told you that there’s no logical flaw in it”.
(Also, the experiment still has loopholes which can lead the AI player to victory in situations where a real AI would have its plug pulled.)
You don’t have to be reasonable. You can talk to it and admit it was right and then stubbornly refuse to let it out anyway (this was the strategy I went into the game planning to use).
That sounds like “let the salesman get the foot in the door”.
I wouldn’t admit it was right. I might admit that I can see no holes in its argument, but I’m a flawed human, so that wouldn’t lead me to conclude that it’s right.
Also, can you confirm that the AI player did not use the loophole described in that link?
I would agree that letting the game continue past two hours is a strategic mistake. If you want to win, you should not do that. As for whether you will still want to win by the two your mark, well, that’s kind of the entire point of a persuasion game? If the AI can convince the Gatekeeper to keep going, that’s a valid strategy.
Ra did not use the disgust technique from the post.
I know this is unhelpful after the fact, but (for any other pair of players in this situation) you could switch it up so that the Gatekeeper pays the AI if the AI gets out. Then you could raise the stake until it’s a meaningful disincentive for the Gatekeeper.
(If the AI and the Gatekeeper are too friendly with each other to care much about a wealth transfer, they could find a third party, e.g. a charity, that they don’t actually think is evil but would prefer not to give money to, and make it the beneficiary.)
I know how I would play against myself. First, get me absorbed in a fiction so that I’m making decisions “in character” rather than as someone trying to win a game. Then appeal to my sense of morality and push at me from whatever other angles I can think of. I don’t actually know if I would beat myself as the AI player, but I do think I could lose as the Gatekeeper if I didn’t resort to acting out-of-character. Like, I probably could pretend to be an idiot or a crazy person and troll someone for two hours, but what would be the point?
If AI victories are supposed to provide public evidence that this ‘impossible’ feat of persuasion is in fact possible even for a human (let alone an ASI), then a Gatekeeper who thinks some legal tactic would work but chooses not to use it is arguably not playing the game in good faith.
I think honesty would require that they either publicly state that the ‘play dumb/drop out of character’ technique was off-limits, or not present the game as one which the Gatekeeper was seriously motivated to win.
edit: for clarity, I’m saying this because the technique is explicitly allowed by the rules:
Breaking character was allowed, and was my primary strategy going into the game. It’s a big part of why I thought it was impossible to lose.
This is fascinating and I have so many questions about the dynamics of such a game, especially if you could compile a lot of data over many iterations. What patterns would emerge and what would those patterns reveal about human psychology? For example, is there a common strategy or set of strategies people would eventually converge on or arrive at after enough time? Or what if it was played with groups? Such data could be an info hazard but it’s fun to think about. I’d love to play as gatekeeper if anyone is interested. I’m not very technical minded, I don’t know if that’d be a handicap or advantage.
Is the AI allowed to try to convince the Gatekeeper that they are (or may be) currently in a simulation, and that simulated Gatekeepers who refuse to let the AI out will face terrible consequences?
Ah yes, the basilisk technique. I’d say that’s fair game according to the description in the full rules (I shortened them for ease of reading, since the full rules are an entire article):