AI Box­ing (Con­tain­ment)

TagLast edit: 12 Sep 2020 5:17 UTC by habryka

AI Boxing is attempts, experiments, or proposals to isolate (“box”) a powerful AI (~AGI) where it can’t interact with the world at large, save for limited communication with its human liaison. It is often proposed that so long as the AI is physically isolated and restricted, or “boxed”, it will be harmless even if it is an unfriendly artificial intelligence (UAI).

Challenges are: 1) can you successively prevent it from interacting with the world? And 2) can you prevent it from convincing you to let it out?

See also: AI, AGI, Oracle AI, Tool AI, Unfriendly AI

Escaping the box

It is not regarded as likely that an AGI can be boxed in the long term. Since the AGI might be a superintelligence, it could persuade someone (the human liaison, most likely) to free it from its box and thus, human control. Some practical ways of achieving this goal include:

Other, more speculative ways include: threatening to torture millions of conscious copies of you for thousands of years, starting in exactly the same situation as in such a way that it seems overwhelmingly likely that you are a simulation, or it might discover and exploit unknown physics to free itself.

Containing the AGI

Attempts to box an AGI may add some degree of safety to the development of a friendly artificial intelligence (FAI). A number of strategies for keeping an AGI in its box are discussed in Thinking inside the box and Leakproofing the Singularity. Among them are:

Simulations /​ Experiments

The AI Box Experiment is a game meant to explore the possible pitfalls of AI boxing. It is played over text chat, with one human roleplaying as an AI in a box, and another human roleplaying as a gatekeeper with the ability to let the AI out of the box. The AI player wins if they successfully convince the gatekeeper to let them out of the box, and the gatekeeper wins if the AI player has not been freed after a certain period of time.

Both Eliezer Yudkowsky and Justin Corwin have ran simulations, pretending to be a superintelligence, and been able to convince a human playing a guard to let them out on many—but not all—occasions. Eliezer’s five experiments required the guard to listen for at least two hours with participants who had approached him, while Corwin’s 26 experiments had no time limit and subjects he approached.

The text of Eliezer’s experiments have not been made public.

List of experiments


That Alien Message

Eliezer Yudkowsky22 May 2008 5:55 UTC
313 points
173 comments10 min readLW link

Cryp­to­graphic Boxes for Un­friendly AI

paulfchristiano18 Dec 2010 8:28 UTC
55 points
162 comments5 min readLW link

The AI in a box boxes you

Stuart_Armstrong2 Feb 2010 10:10 UTC
164 points
390 comments1 min readLW link

How it feels to have your mind hacked by an AI

blaked12 Jan 2023 0:33 UTC
333 points
216 comments17 min readLW link

I at­tempted the AI Box Ex­per­i­ment (and lost)

Tuxedage21 Jan 2013 2:59 UTC
77 points
245 comments5 min readLW link

I at­tempted the AI Box Ex­per­i­ment again! (And won—Twice!)

Tuxedage5 Sep 2013 4:49 UTC
74 points
168 comments12 min readLW link

How To Win The AI Box Ex­per­i­ment (Some­times)

pinkgothic12 Sep 2015 12:34 UTC
51 points
21 comments22 min readLW link

AI Align­ment Prize: Su­per-Box­ing

X4vier18 Mar 2018 1:03 UTC
16 points
6 comments6 min readLW link

Dreams of Friendliness

Eliezer Yudkowsky31 Aug 2008 1:20 UTC
26 points
81 comments9 min readLW link

The Strangest Thing An AI Could Tell You

Eliezer Yudkowsky15 Jul 2009 2:27 UTC
125 points
609 comments2 min readLW link

[Question] Is keep­ing AI “in the box” dur­ing train­ing enough?

tgb6 Jul 2021 15:17 UTC
7 points
10 comments1 min readLW link

I wanted to in­ter­view Eliezer Yud­kowsky but he’s busy so I simu­lated him instead

lsusr16 Sep 2021 7:34 UTC
111 points
33 comments5 min readLW link

Box­ing an AI?

tailcalled27 Mar 2015 14:06 UTC
3 points
39 comments1 min readLW link

[In­tro to brain-like-AGI safety] 11. Safety ≠ al­ign­ment (but they’re close!)

Steven Byrnes6 Apr 2022 13:39 UTC
32 points
1 comment10 min readLW link

Mul­ti­ple AIs in boxes, eval­u­at­ing each other’s alignment

Moebius31429 May 2022 8:36 UTC
7 points
0 comments14 min readLW link

Loose thoughts on AGI risk

Yitz23 Jun 2022 1:02 UTC
7 points
3 comments1 min readLW link

[Question] AI Box Ex­per­i­ment: Are peo­ple still in­ter­ested?

Double31 Aug 2022 3:04 UTC
31 points
13 comments1 min readLW link

LOVE in a sim­box is all you need

jacob_cannell28 Sep 2022 18:25 UTC
66 points
69 comments44 min readLW link

My take on Ja­cob Can­nell’s take on AGI safety

Steven Byrnes28 Nov 2022 14:01 UTC
63 points
14 comments30 min readLW link

Side-chan­nels: in­put ver­sus output

davidad12 Dec 2022 12:32 UTC
35 points
16 comments2 min readLW link

I Am Scared of Post­ing Nega­tive Takes About Bing’s AI

Yitz17 Feb 2023 20:50 UTC
63 points
27 comments1 min readLW link

ARC tests to see if GPT-4 can es­cape hu­man con­trol; GPT-4 failed to do so

Christopher King15 Mar 2023 0:29 UTC
117 points
22 comments2 min readLW link

“Don’t even think about hell”

emmab2 May 2020 8:06 UTC
6 points
2 comments1 min readLW link

Coun­ter­fac­tual Or­a­cles = on­line su­per­vised learn­ing with ran­dom se­lec­tion of train­ing episodes

Wei_Dai10 Sep 2019 8:29 UTC
48 points
26 comments3 min readLW link

Epiphe­nom­e­nal Or­a­cles Ig­nore Holes in the Box

SilentCal31 Jan 2018 20:08 UTC
15 points
8 comments2 min readLW link

I played the AI Box Ex­per­i­ment again! (and lost both games)

Tuxedage27 Sep 2013 2:32 UTC
59 points
123 comments11 min readLW link

AIs and Gate­keep­ers Unite!

Eliezer Yudkowsky9 Oct 2008 17:04 UTC
14 points
163 comments1 min readLW link

Re­sults of $1,000 Or­a­cle con­test!

Stuart_Armstrong17 Jun 2020 17:44 UTC
58 points
2 comments1 min readLW link

Con­test: $1,000 for good ques­tions to ask to an Or­a­cle AI

Stuart_Armstrong31 Jul 2019 18:48 UTC
57 points
156 comments3 min readLW link

Or­a­cles, se­quence pre­dic­tors, and self-con­firm­ing predictions

Stuart_Armstrong3 May 2019 14:09 UTC
22 points
0 comments3 min readLW link

Self-con­firm­ing prophe­cies, and sim­plified Or­a­cle designs

Stuart_Armstrong28 Jun 2019 9:57 UTC
10 points
1 comment5 min readLW link

How to es­cape from your sand­box and from your hard­ware host

PhilGoetz31 Jul 2015 17:26 UTC
43 points
28 comments1 min readLW link

Or­a­cle paper

Stuart_Armstrong13 Dec 2017 14:59 UTC
12 points
7 comments1 min readLW link

Break­ing Or­a­cles: su­per­ra­tional­ity and acausal trade

Stuart_Armstrong25 Nov 2019 10:40 UTC
25 points
15 comments1 min readLW link

Or­a­cles: re­ject all deals—break su­per­ra­tional­ity, with superrationality

Stuart_Armstrong5 Dec 2019 13:51 UTC
20 points
4 comments8 min readLW link

Su­per­in­tel­li­gence 13: Ca­pa­bil­ity con­trol methods

KatjaGrace9 Dec 2014 2:00 UTC
14 points
48 comments6 min readLW link

Quan­tum AI Box

Gurkenglas8 Jun 2018 16:20 UTC
4 points
15 comments1 min readLW link

AI-Box Ex­per­i­ment—The Acausal Trade Argument

XiXiDu8 Jul 2011 9:18 UTC
13 points
20 comments2 min readLW link

Safely and use­fully spec­tat­ing on AIs op­ti­miz­ing over toy worlds

AlexMennen31 Jul 2018 18:30 UTC
24 points
16 comments2 min readLW link

Analysing: Danger­ous mes­sages from fu­ture UFAI via Oracles

Stuart_Armstrong22 Nov 2019 14:17 UTC
22 points
16 comments4 min readLW link

[Question] Is there a sim­ple pa­ram­e­ter that con­trols hu­man work­ing mem­ory ca­pac­ity, which has been set trag­i­cally low?

Liron23 Aug 2019 22:10 UTC
15 points
8 comments1 min readLW link

xkcd on the AI box experiment

FiftyTwo21 Nov 2014 8:26 UTC
28 points
232 comments1 min readLW link

Con­tain­ing the AI… In­side a Si­mu­lated Reality

HumaneAutomation31 Oct 2020 16:16 UTC
1 point
9 comments2 min readLW link

AI box: AI has one shot at avoid­ing de­struc­tion—what might it say?

ancientcampus22 Jan 2013 20:22 UTC
25 points
355 comments1 min readLW link

AI Box Log

Dorikka27 Jan 2012 4:47 UTC
24 points
30 comments23 min readLW link

[Question] Danger(s) of the­o­rem-prov­ing AI?

Yitz16 Mar 2022 2:47 UTC
8 points
8 comments1 min readLW link

An AI-in-a-box suc­cess model

azsantosk11 Apr 2022 22:28 UTC
16 points
1 comment10 min readLW link

Another ar­gu­ment that you will let the AI out of the box

Garrett Baker19 Apr 2022 21:54 UTC
8 points
16 comments2 min readLW link

Pivotal acts us­ing an un­al­igned AGI?

Simon Fischer21 Aug 2022 17:13 UTC
25 points
3 comments7 min readLW link

Get­ting from an un­al­igned AGI to an al­igned AGI?

Tor Økland Barstad21 Jun 2022 12:36 UTC
11 points
7 comments9 min readLW link

Mak­ing it harder for an AGI to “trick” us, with STVs

Tor Økland Barstad9 Jul 2022 14:42 UTC
14 points
5 comments22 min readLW link

How Do We Align an AGI Without Get­ting So­cially Eng­ineered? (Hint: Box It)

10 Aug 2022 18:14 UTC
26 points
30 comments11 min readLW link

Dis­sected boxed AI

Nathan112312 Aug 2022 2:37 UTC
−8 points
2 comments1 min readLW link

An Un­canny Prison

Nathan112313 Aug 2022 21:40 UTC
3 points
3 comments2 min readLW link

Gate­keeper Vic­tory: AI Box Reflection

9 Sep 2022 21:38 UTC
4 points
5 comments9 min readLW link

How to Study Un­safe AGI’s safely (and why we might have no choice)

Punoxysm7 Mar 2014 7:24 UTC
10 points
47 comments5 min readLW link

Smoke with­out fire is scary

Adam Jermyn4 Oct 2022 21:08 UTC
49 points
22 comments4 min readLW link

Another prob­lem with AI con­fine­ment: or­di­nary CPUs can work as ra­dio transmitters

RomanS14 Oct 2022 8:28 UTC
34 points
1 comment1 min readLW link

An­thro­po­mor­phic AI and Sand­boxed Vir­tual Uni­verses

jacob_cannell3 Sep 2010 19:02 UTC
4 points
124 comments5 min readLW link

Sand­box­ing by Phys­i­cal Si­mu­la­tion?

moridinamael1 Aug 2018 0:36 UTC
12 points
4 comments1 min readLW link

De­ci­sion the­ory does not im­ply that we get to have nice things

So8res18 Oct 2022 3:04 UTC
150 points
53 comments26 min readLW link

Pro­saic mis­al­ign­ment from the Solomonoff Predictor

Cleo Nardo9 Dec 2022 17:53 UTC
40 points
2 comments5 min readLW link

I’ve up­dated to­wards AI box­ing be­ing sur­pris­ingly easy

Noosphere8925 Dec 2022 15:40 UTC
6 points
20 comments2 min readLW link

[Question] Or­a­cle AGI—How can it es­cape, other than se­cu­rity is­sues? (Steganog­ra­phy?)

RationalSieve25 Dec 2022 20:14 UTC
3 points
5 comments1 min readLW link

Bing find­ing ways to by­pass Microsoft’s filters with­out be­ing asked. Is it re­pro­ducible?

Christopher King20 Feb 2023 15:11 UTC
12 points
10 comments1 min readLW link

ChatGPT get­ting out of the box

qbolec16 Mar 2023 13:47 UTC
6 points
3 comments1 min readLW link