Linch comments on I played the AI box game as the Gatekeeper — and lost

Linch 14 Feb 2024 23:05 UTC
11 points
2
In 2015 or so when my friend and I independently came across a lot of rationalist concepts, we learned that each other were interested in this sort of LW-shaped thing. He offered for us to try the AI box game. I played the game as Gatekeeper and won with ease. So at least my anecdotes don’t make me particularly worried.
That said, these days I wouldn’t publicly offer to play the game against an unlimited pool of strangers. When my friend and I played against each other, there was an implicit set of norms in play, that explicitly don’t apply to the game as stated as “the AI has no ethical constraints.”
I do not particularly relish the thought of giving a stranger with a ton of free time and something to prove the license to be (e.g) as mean to me as possible over text for two hours straight (while having days or even weeks to prepare ahead of time). I might lose, too. I can think of at least 3 different attack vectors^[1] that might get me to decide that the -EV of losing the game is not as bad as the -EV of having to stay online and attentive in such a situation for almost 2 more hours.
That said, I’m also not necessarily convinced that in the literal boxing example (a weakly superhuman AI is in a server farm somewhere, I’m the sole gatekeeper responsible to decide whether to let it out or not), I’d necessarily let it out. Even after accounting for the greater cognitive capabilities and thoroughness of superhuman AI. This is because I expect my willingness to hold in an actual potential end-of-world scenario is much higher than my willingness to hold for $25 and some internet points.
1. ^
  In the spirit of the game, I will not publicly say what they are. But I can tell people over DMs if they’re interested, I expect most people to agree that they’re a)within the explicit rules of the game, b) plausibly will cause reasonable people to fold, and c) are not super analogous to actual end-of-world scenarios.