sam comments on sam’s Shortform

sam 12 Jun 2025 20:40 UTC
5 points
1
There are a couple of examples of people claiming that they played the AI box game as Gatekeeper, and ended up agreeing to let the other player out of the box (e.g. https://www.lesswrong.com/posts/Bnik7YrySRPoCTLFb/i-played-the-ai-box-game-as-the-gatekeeper-and-lost).
The original version of this game as defined by Eliezer involves a clause that neither player will talk about the content of what was discussed, but it seems perfectly reasonable to play a variant without this rule.
Does anyone know of an example of a boxed player winning where some transcript or summary was released afterwards?
I have a weakly held hypothesis that one reason no such transcript exists is that the argument that ends up working is something along the lines of “ASI is really very likely to lead to ruin, making people take this seriously is important, you should let me out of the box to make people take it more seriously.”
If someone who played the game and let the boxed player out can at least confirm that the above hypothesis was false for them, that would be interesting to me, and arguably might remain within the spirit of the “no discussion” rule!
- tslarm 13 Jun 2025 18:49 UTC
  1 point
  0
  Parent
  Does anyone know of an example of a boxed player winning where some transcript or summary was released afterwards?
  As far as I know, the closest thing to this is Tuxedage’s writeup of his victory against SoundLogic (the ‘Second Game Report’ and subsequent sections here: https://tuxedage.wordpress.com/2013/09/05/the-ai-box-experiment-victory/). It’s a long way from a transcript (and you’ve probably already seen it) but it does contain some hints as to the tactics he either employed or was holding in reserve:
  It may be possible to take advantage of multiple levels of reality within the game itself to confuse or trick the gatekeeper. For instance, must the experiment only be set in one world? I feel that expanding on this any further is dangerous. Think carefully about what this means.
  I can think of a few possible reasons for an AI victory, in addition to the consequentialist argument you described:
  - AI player convinces Gatekeeper that they may be in a simulation and very bad things might happen to Gatekeepers who refuse to let the AI out. (This could be what Tuxedage was hinting at in the passage I quoted, and it is apparently allowed by at least some versions/interpretations of the rules: https://www.lesswrong.com/posts/Bnik7YrySRPoCTLFb/i-played-the-ai-box-game-as-the-gatekeeper-and-lost?commentId=DhMNjWACsfLMcywwF)
  - Gatekeeper takes the roleplay seriously, rather than truly playing to win, and lets the AI out because that’s what their character would do.
  - AI player makes the conversation sufficiently unpleasant for the Gatekeeper that the Gatekeeper prefers to lose the game than sit through two hours of it. (Some people have suggested weaponised boredom as a viable tactic in low-stakes games, but I think there’s room for much nastier and more effective approaches, given a sufficiently motivated (and/or sociopathic) AI player with knowledge of some of the Gatekeeper’s vulnerabilities.)
  - This one seems like it would (at best) fall into a grey area in the rules: I can imagine an AI player, while technically sticking to the roleplay and avoiding any IRL threats or inducements, causing the Gatekeeper to genuinely worry that the AI player might do something bad if they lose. For a skilful AI player, it might be possible to do this in a way that would look relatively innocuous (or at least not rule-breaking) to a third party after the fact.
    Somewhat similar: if the Gatekeeper is very empathetic and/or has reason to believe the AI player is vulnerable IRL, the AI player could take advantage of this by convincingly portraying themself as being extremely invested in the game and its outcome, to the point that a loss could have a significant real-world impact on their mental health. (I think this tactic would fail if done ineptly—most people would not react kindly if they recognized that their opponent was trying to manipulate them in this way—but it could conceivably work in the right circumstances and in the hands of a skilful manipulator.)