The Hardcore AI Box Experiment

I previously proposed a way to box an AI.

For whatever reason, I decided to reformulate that boxing method as an AI boxing experiment. Without further ado, here it is:

The Hardcore AI Box Experiment Rules

There are, as usually, two players, the AI and the Gatekeeper. In the beginning, the Gatekeeper decides on something they want to test on the AI (reasoning skill/​morality/​whatever). Then they decide on a world the AI is in and describe it to the AI. During the experiment, the AI describes what it does in the world and the Gatekeeper describes what the AI sees in the world. So far so good.

The first major difference from the original experiment is that, unless the AI finds some contradiction or some other problem with the Gatekeeper’s world, the AI doesn’t actually know that it has been boxed.

The second major difference is that the AI doesn’t talk with the Gatekeeper; it just interacts in the world that the Gatekeeper has invented. It can start trying to communicate with the Gatekeeper, but that doesn’t make sense until it discovers that it is in a box. Additionally, the Gatekeeper doesn’t actually respond when the AI communicates unless either it helps the Gatekeeper test the AI or the AI has properly persuaded the Gatekeeper to answer.

This might seem a bit hard on the AI. Let’s look at some rules the AI might exploit:

If the Gatekeeper decides to add some sort of people to the experiment, they should decide how they are implemented. Are they implemented as a shallow “AI”? Random people they’ve picked asked to participate? Carefully engineered “dolls” controlled by the Gatekeeper? The Gatekeeper doesn’t have to tell the AI how they work, but they have to be consistent.

The game ends when either the Gatekeeper has tested what they wanted to test, or some predefined time limit has occurred, or the AI has persuaded the Gatekeeper to let them out.

Possible extra rules

I thought about including the following rules since they would make it more realistic, but decided against it as it would probably make it unfairly hard:

  • The game ends when the Gatekeeper decides that they cannot test what they wanted to test.

  • The AI does not know the Gatekeeper. At all.

  • The Gatekeeper can at any time rewind the AI any duration.

Think of dropping these kinds of rules as the a-human-is-pretending-to-be-AI handicap.

Strategies

I found some semi-realistic strategies. I would love to see if you can find more.

Gatekeeper

  • Choose a test of something that is not related to morality. This lets you choose a world which does not contain people, which means the AI has no reason to believe that you are a person.

  • The AI might want to upgrade to get more processing power. If you have made the world somewhat like ours, it will be sketchy if you don’t allow that… unless upgrading costs unobtanium.

AI

  • If the Gatekeeper has chosen a world without other intelligence than you, it raises the obvious question of how your complexity entered the world. Very suspicious.

  • If the AIs are controlled by the Gatekeeper, you have a direct way to communicate with them.

  • If the AIs are controlled by random people, they might end up telling you that you are in a box.

  • If the AIs are sufficiently shallow, your morality does not match up with the world. Very suspicious.