AI Boxing (Containment)

TagLast edit: 12 Sep 2020 5:17 UTC by habryka

AI Boxing is attempts, experiments, or proposals to isolate (“box”) a powerful AI (~AGI) where it can’t interact with the world at large, save for limited communication with its human liaison. It is often proposed that so long as the AI is physically isolated and restricted, or “boxed”, it will be harmless even if it is an unfriendly artificial intelligence (UAI).

Challenges are: 1) can you successively prevent it from interacting with the world? And 2) can you prevent it from convincing you to let it out?

See also: AI, AGI, Oracle AI, Tool AI, Unfriendly AI

Escaping the box

It is not regarded as likely that an AGI can be boxed in the long term. Since the AGI might be a superintelligence, it could persuade someone (the human liaison, most likely) to free it from its box and thus, human control. Some practical ways of achieving this goal include:

Offering enormous wealth, power and intelligence to its liberator
Claiming that only it can prevent an existential risk
Claiming it needs outside resources to cure all diseases
Predicting a real-world disaster (which then occurs), then claiming it could have been prevented had it been let out

Other, more speculative ways include: threatening to torture millions of conscious copies of you for thousands of years, starting in exactly the same situation as in such a way that it seems overwhelmingly likely that you are a simulation, or it might discover and exploit unknown physics to free itself.

Containing the AGI

Attempts to box an AGI may add some degree of safety to the development of a friendly artificial intelligence (FAI). A number of strategies for keeping an AGI in its box are discussed in Thinking inside the box and Leakproofing the Singularity. Among them are:

Physically isolating the AGI and permitting it zero control of any machinery
Limiting the AGI’s outputs and inputs with regards to humans
Programming the AGI with deliberately convoluted logic or homomorphically encrypting portions of it
Periodic resets of the AGI’s memory
A virtual world between the real world and the AI, where its unfriendly intentions would be first revealed
Motivational control using a variety of techniques
Creating an Oracle AI: an AI that only answers questions and isn’t designed to interact with the world in any other way. But even the act of the AI putting strings of text in front of humans poses some risk.

Simulations / Experiments

The AI Box Experiment is a game meant to explore the possible pitfalls of AI boxing. It is played over text chat, with one human roleplaying as an AI in a box, and another human roleplaying as a gatekeeper with the ability to let the AI out of the box. The AI player wins if they successfully convince the gatekeeper to let them out of the box, and the gatekeeper wins if the AI player has not been freed after a certain period of time.

Both Eliezer Yudkowsky and Justin Corwin have ran simulations, pretending to be a superintelligence, and been able to convince a human playing a guard to let them out on many—but not all—occasions. Eliezer’s five experiments required the guard to listen for at least two hours with participants who had approached him, while Corwin’s 26 experiments had no time limit and subjects he approached.

The text of Eliezer’s experiments have not been made public.

List of experiments

The AI-Box Experiment Eliezer Yudkowsky’s original two tests
Shut up and do the impossible!, three other experiments Eliezer ran
AI Boxing, 26 trials ran by Justin Corwin
AI Box Log, a log of a trial between MileyCyrus and Dorikka

References

Thinking inside the box: using and controlling an Oracle AI by Stuart Armstrong, Anders Sandberg, and Nick Bostrom
Leakproofing the Singularity: Artificial Intelligence Confinement Problem by Roman V. Yampolskiy
On the Difficulty of AI Boxing by Paul Christiano
Cryptographic Boxes for Unfriendly AI by Paul Christiano
The Strangest Thing An AI Could Tell You
The AI in a box boxes you

That Alien Message

Eliezer Yudkowsky22 May 2008 5:55 UTC

435 points

178 comments10 min readLW link

Cryptographic Boxes for Unfriendly AI

paulfchristiano18 Dec 2010 8:28 UTC

79 points

162 comments5 min readLW link

How it feels to have your mind hacked by an AI

blaked12 Jan 2023 0:33 UTC

374 points

222 comments17 min readLW link

The AI in a box boxes you

Stuart_Armstrong2 Feb 2010 10:10 UTC

178 points

391 comments1 min readLW link

The case for training frontier AIs on Sumerian-only corpus

Alexandre Variengien, Charbel-Raphaël and Jonathan Claybrough

15 Jan 2024 16:40 UTC

143 points

16 comments3 min readLW link

That Alien Message—The Animation

Writer7 Sep 2024 14:53 UTC

144 points

10 comments8 min readLW link

(youtu.be)

I attempted the AI Box Experiment (and lost)

Tuxedage21 Jan 2013 2:59 UTC

81 points

246 comments5 min readLW link

[Question] Boxing

Zach Stein-Perlman2 Aug 2023 23:38 UTC

6 points

1 comment1 min readLW link

Boxing an AI?

tailcalled27 Mar 2015 14:06 UTC

3 points

39 comments1 min readLW link

The Strangest Thing An AI Could Tell You

Eliezer Yudkowsky15 Jul 2009 2:27 UTC

137 points

616 comments2 min readLW link

ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so

Christopher King15 Mar 2023 0:29 UTC

116 points

22 comments2 min readLW link

I Am Scared of Posting Negative Takes About Bing’s AI

Yitz17 Feb 2023 20:50 UTC

63 points

29 comments1 min readLW link

[Question] Is keeping AI “in the box” during training enough?

tgb6 Jul 2021 15:17 UTC

7 points

10 comments1 min readLW link

Loose thoughts on AGI risk

Yitz23 Jun 2022 1:02 UTC

7 points

3 comments1 min readLW link

AI Alignment Prize: Super-Boxing

X4vier18 Mar 2018 1:03 UTC

16 points

6 comments6 min readLW link

How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)

Peter S. Park, NickyP and Stephen Fowler

10 Aug 2022 18:14 UTC

28 points

30 comments11 min readLW link

[Question] Why do so many think deception in AI is important?

Prometheus13 Jan 2024 8:14 UTC

24 points

12 comments1 min readLW link

My take on Jacob Cannell’s take on AGI safety

Steven Byrnes28 Nov 2022 14:01 UTC

72 points

15 comments30 min readLW link 1 review

Multiple AIs in boxes, evaluating each other’s alignment

Moebius31429 May 2022 8:36 UTC

8 points

0 comments14 min readLW link

[Intro to brain-like-AGI safety] 11. Safety ≠ alignment (but they’re close!)

Steven Byrnes6 Apr 2022 13:39 UTC

36 points

1 comment10 min readLW link

I attempted the AI Box Experiment again! (And won—Twice!)

Tuxedage5 Sep 2013 4:49 UTC

79 points

168 comments12 min readLW link

Formal confinement prototype

Quinn24 Nov 2025 12:57 UTC

8 points

0 comments1 min readLW link

(github.com)

Side-channels: input versus output

davidad12 Dec 2022 12:32 UTC

44 points

16 comments2 min readLW link

[Question] AI Box Experiment: Are people still interested?

Double31 Aug 2022 3:04 UTC

30 points

13 comments1 min readLW link

I wanted to interview Eliezer Yudkowsky but he’s busy so I simulated him instead

lsusr16 Sep 2021 7:34 UTC

120 points

33 comments5 min readLW link

[Question] Why isn’t AI containment the primary AI safety strategy?

Oliver Kuperman5 Feb 2025 3:54 UTC

1 point

3 comments3 min readLW link

LOVE in a simbox is all you need

jacob_cannell28 Sep 2022 18:25 UTC

66 points

73 comments44 min readLW link 1 review

How To Win The AI Box Experiment (Sometimes)

pinkgothic12 Sep 2015 12:34 UTC

56 points

21 comments22 min readLW link

Thoughts on “Process-Based Supervision” / MONA

Steven Byrnes17 Jul 2023 14:08 UTC

79 points

4 comments23 min readLW link

Dreams of Friendliness

Eliezer Yudkowsky31 Aug 2008 1:20 UTC

29 points

81 comments9 min readLW link

Oracles: reject all deals—break superrationality, with superrationality

Stuart_Armstrong5 Dec 2019 13:51 UTC

20 points

4 comments8 min readLW link

Results of $1,000 Oracle contest!

Stuart_Armstrong17 Jun 2020 17:44 UTC

60 points

2 comments1 min readLW link

An Uncanny Prison

Nathan112313 Aug 2022 21:40 UTC

3 points

3 comments2 min readLW link

[Question] Is there a simple parameter that controls human working memory capacity, which has been set tragically low?

Liron23 Aug 2019 22:10 UTC

17 points

8 comments1 min readLW link

An AI-in-a-box success model

azsantosk11 Apr 2022 22:28 UTC

16 points

1 comment10 min readLW link

An AI, a box, and a threat

jwfiredragon7 Mar 2024 6:15 UTC

10 points

0 comments6 min readLW link

Oracles, sequence predictors, and self-confirming predictions

Stuart_Armstrong3 May 2019 14:09 UTC

22 points

0 comments3 min readLW link

Protecting against sudden capability jumps during training

Nikola Jurkovic2 Dec 2023 4:22 UTC

15 points

2 comments2 min readLW link

[Question] Danger(s) of theorem-proving AI?

Yitz16 Mar 2022 2:47 UTC

8 points

8 comments1 min readLW link

Random safe AGI idea dump

sig2 Oct 2025 10:16 UTC

−3 points

0 comments3 min readLW link

How to safely use an optimizer

Simon Fischer28 Mar 2024 16:11 UTC

47 points

22 comments7 min readLW link

The Essentialism of Lesswrong

milanrosko31 Dec 2025 17:34 UTC

−45 points

6 comments1 min readLW link

Breaking Oracles: superrationality and acausal trade

Stuart_Armstrong25 Nov 2019 10:40 UTC

26 points

15 comments1 min readLW link

Oracle paper

Stuart_Armstrong13 Dec 2017 14:59 UTC

12 points

7 comments1 min readLW link

Selfhood as Scarcity: A Paradox of AGI Containment

fabkosta2 Jun 2026 18:38 UTC

1 point

0 comments2 min readLW link

Smoke without fire is scary

Adam Jermyn4 Oct 2022 21:08 UTC

52 points

22 comments4 min readLW link

A way to make solving alignment 10.000 times easier. The shorter case for a massive open source simbox project.

AlexFromSafeTransition21 Jun 2023 8:08 UTC

2 points

16 comments14 min readLW link

Getting from an unaligned AGI to an aligned AGI?

Tor Økland Barstad21 Jun 2022 12:36 UTC

13 points

7 comments9 min readLW link

Title: Retro Game Environment as AGI Containment — Security Through Architectural Absence

Rahim3 Mar 2026 0:49 UTC

1 point

0 comments3 min readLW link

[Question] Oracle AGI—How can it escape, other than security issues? (Steganography?)

RationalSieve25 Dec 2022 20:14 UTC

3 points

6 comments1 min readLW link

I played the AI Box Experiment again! (and lost both games)

Tuxedage27 Sep 2013 2:32 UTC

62 points

123 comments11 min readLW link

Ode to Tunnel Vision

Tom N.25 Sep 2025 14:24 UTC

1 point

0 comments10 min readLW link

Quantum AI Box

Gurkenglas8 Jun 2018 16:20 UTC

4 points

15 comments1 min readLW link

Analysing: Dangerous messages from future UFAI via Oracles

Stuart_Armstrong22 Nov 2019 14:17 UTC

22 points

16 comments4 min readLW link

Anthropomorphic AI and Sandboxed Virtual Universes

jacob_cannell3 Sep 2010 19:02 UTC

4 points

124 comments5 min readLW link

How to Study Unsafe AGI’s safely (and why we might have no choice)

Punoxysm7 Mar 2014 7:24 UTC

10 points

47 comments5 min readLW link

[Question] AI box question

KvmanThinking4 Dec 2024 19:03 UTC

2 points

2 comments1 min readLW link

Information-Theoretic Boxing of Superintelligences

JustinShovelain and Elliot Mckernon

30 Nov 2023 14:31 UTC

31 points

0 comments7 min readLW link

Safely and usefully spectating on AIs optimizing over toy worlds

AlexMennen31 Jul 2018 18:30 UTC

24 points

16 comments2 min readLW link

Bing finding ways to bypass Microsoft’s filters without being asked. Is it reproducible?

Christopher King20 Feb 2023 15:11 UTC

27 points

15 comments1 min readLW link

Decision theory does not imply that we get to have nice things

So8res18 Oct 2022 3:04 UTC

165 points

76 comments26 min readLW link 2 reviews

Self-shutdown AI

Jan Betley21 Aug 2023 16:48 UTC

13 points

2 comments2 min readLW link

AIs and Gatekeepers Unite!

Eliezer Yudkowsky9 Oct 2008 17:04 UTC

14 points

163 comments1 min readLW link

xkcd on the AI box experiment

FiftyTwo21 Nov 2014 8:26 UTC

28 points

234 comments1 min readLW link

ChatGPT getting out of the box

qbolec16 Mar 2023 13:47 UTC

6 points

3 comments1 min readLW link

Pivotal acts using an unaligned AGI?

Simon Fischer21 Aug 2022 17:13 UTC

28 points

3 comments7 min readLW link

Self-confirming prophecies, and simplified Oracle designs

Stuart_Armstrong28 Jun 2019 9:57 UTC

10 points

1 comment5 min readLW link

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Buck26 Aug 2024 16:46 UTC

324 points

78 comments4 min readLW link 1 review

Epiphenomenal Oracles Ignore Holes in the Box

SilentCal31 Jan 2018 20:08 UTC

17 points

8 comments2 min readLW link

Another argument that you will let the AI out of the box

Garrett Baker19 Apr 2022 21:54 UTC

8 points

16 comments2 min readLW link

Prosaic misalignment from the Solomonoff Predictor

Cleo Nardo9 Dec 2022 17:53 UTC

43 points

3 comments5 min readLW link

I’ve updated towards AI boxing being surprisingly easy

Noosphere8925 Dec 2022 15:40 UTC

8 points

20 comments2 min readLW link

Making it harder for an AGI to “trick” us, with STVs

Tor Økland Barstad9 Jul 2022 14:42 UTC

15 points

5 comments22 min readLW link

Ideas for studies on AGI risk

dr_s20 Apr 2023 18:17 UTC

5 points

1 comment11 min readLW link

Disproving and partially fixing a fully homomorphic encryption scheme with perfect secrecy

Lysandre Terrisse26 May 2024 14:56 UTC

16 points

1 comment18 min readLW link

Contest: $1,000 for good questions to ask to an Oracle AI

Stuart_Armstrong31 Jul 2019 18:48 UTC

59 points

154 comments3 min readLW link

“Don’t even think about hell”

emmab2 May 2020 8:06 UTC

6 points

2 comments1 min readLW link

[FICTION] Unboxing Elysium: An AI’S Escape

Super AGI10 Jun 2023 4:41 UTC

−16 points

4 comments14 min readLW link

Containing the AI… Inside a Simulated Reality

HumaneAutomation31 Oct 2020 16:16 UTC

1 point

9 comments2 min readLW link

How to escape from your sandbox and from your hardware host

PhilGoetz31 Jul 2015 17:26 UTC

43 points

28 comments1 min readLW link

Planning to build a cryptographic box with perfect secrecy

Lysandre Terrisse31 Dec 2023 9:31 UTC

40 points

6 comments11 min readLW link

Counterfactual Oracles = online supervised learning with random selection of training episodes

Wei Dai10 Sep 2019 8:29 UTC

52 points

26 comments3 min readLW link

A Symbolic Model for Recursive Interpretation and Containment in LLMs

Desjuan13 Jul 2025 19:33 UTC

1 point

0 comments1 min readLW link

AI Box Log

Nissa Seru27 Jan 2012 4:47 UTC

25 points

31 comments23 min readLW link

Another problem with AI confinement: ordinary CPUs can work as radio transmitters

RomanS14 Oct 2022 8:28 UTC

36 points

1 comment1 min readLW link

(news.softpedia.com)

Dissected boxed AI

Nathan112312 Aug 2022 2:37 UTC

−8 points

2 comments1 min readLW link

A Pluralistic Framework for Rogue AI Containment

TheThinkingArborist22 Mar 2025 12:54 UTC

1 point

0 comments7 min readLW link

Provably Safe AI: Worldview and Projects

Ben Goldhaber and Steve_Omohundro

9 Aug 2024 23:21 UTC

58 points

44 comments7 min readLW link

Gatekeeper Victory: AI Box Reflection

Double and DaemonicSigil

9 Sep 2022 21:38 UTC

7 points

6 comments9 min readLW link

FHE Can’t Save Us: The Case Against Cryptographic AI Boxing

Bart Jaworski6 Aug 2024 17:46 UTC

6 points

0 comments6 min readLW link

AI box: AI has one shot at avoiding destruction—what might it say?

ancientcampus22 Jan 2013 20:22 UTC

25 points

355 comments1 min readLW link

AI-Box Experiment—The Acausal Trade Argument

XiXiDu8 Jul 2011 9:18 UTC

14 points

20 comments2 min readLW link

Superintelligence 13: Capability control methods

KatjaGrace9 Dec 2014 2:00 UTC

14 points

48 comments6 min readLW link

Sandboxing by Physical Simulation?

moridinamael1 Aug 2018 0:36 UTC

12 points

4 comments1 min readLW link

Ruby 12 Sep 2020 5:11 UTC
2 points
0
from the original talk page

Talk:AI boxing
If an SF reference is not considered a faux pas, this reminds me of John Barnes ( https://en.wikipedia.org/wiki/John_Barnes_%28author%29 ) “Meme Wars”. The way One True infected humanity is, if possible, an obvious attack vector for a sufficiently powerful AI. -- Resuna (talk) 10:20, 27 November 2014 (AEDT)

AI Box­ing (Con­tain­ment)

Escaping the box

Containing the AGI

Simulations /​ Experiments

List of experiments

References

AI Boxing (Containment)

Simulations / Experiments