Interacting with a Boxed AI

A BOXED AI

So, a thought experiment.

We have an AI-in-a-box. By this I mean:

  • The AI can interact with the world only through a single communication channel.

  • We control that communication channel.:

    • We can communicate with the AI at will, it can communicate with us only when we allow it.

    • We can control what responses the AI is allowed to send to us, doing things like e.g. limitations on the amount of data it is able to send us.

  • The only way the AI can get out of the box is if we let it out—it cannot e.g. hack its way out by abusing is computer substrate to send radio signals, or some such method. Yes, I’m aware that this is not a thing we currently know how to do. Assume we reach this point anyway.

  • We know the AI is superhuman in intelligence, but we don’t know exactly how much superhuman it is.

We know nothing else about the AI for sure. We think it’s Friendly, but maybe it’s unFriendly. We don’t know.

We cannot safely let the AI out; no matter what it says or does, we can never know for sure that it’s not a sufficiently intelligent trick by an unFriendly AI to let us out.

Eliezer has taken some pains to argue that we cannot even talk to the AI: that a sufficiently clever AI can push our buttons to make us let it out, no matter how resolved we may be. And he’s provided some compelling arguments that this is the case.

So we should just leave the AI in the box and not talk to it? That way, we can be completely certain that it won’t tile the universe in smily faces or some such.

But...well...is this really the optimal solution? If Omega gave you this AI-in-a-box right now, would you stop at “Nope, no way to safely talk to it,” without even trying to come up with a way to get some use out of it?

This, then, is a vague stab at how we could get some value out of a probably-Friendly boxed AI in ways that don’t risk the end of humanity.

Risks

If you talk to the AI, it may be able to take over your mind and make you let it out. It is (potentially) so much more than human that it sees you as a simple machine, with buttons to push and levers to pull to make you do whatever it wants.

But this relies on your interaction with it having a certain level of complexity.

If the AI plays you at chess, it will win. But what if it plays you at Tic-Tac-Toe? No matter how clever the mind, it is very hard to win at Tic-Tac-Toe against an opponent who is not a complete moron.

Now, it’s conceivable the AI could beat you even at Tic-Tac-Toe. If you give it control over the timings of its moves, it is at least plausible that it could time its moves so that you blink at the wrong moment, and your finger slips, and...

I wouldn’t rate that scenario as very likely, but I wouldn’t be willing to bet the future of humanity on it. I wouldn’t be willing to play Tic-Tac-Toe with this AI for the fate of the human race.

But say we lock the AI in the box, and give it a highly restricted channel. We let it send out only the 1-3 bits needed to define its Tic-Tac-Toe move, and nothing else. We give it only a very short window of time to make its move, so that it can’t manipulate move timings. In short, we restrict the AI’s option space by allowing it only a fixed number n of bits to influence the outside world with. Even if we imagine the AI considering all possible actions, viewing all possible futures, and choosing the one it most wants to carry out, it has only 2n possible worlds to choose from. And, if n is small enough, even a superhuman unFriendly AI may not be able to get us to let it out.

So...what values of n would work? Well, I’m not sure. But let’s try to find lower and upper bounds.

Could the AI get you to let it out for n=1 (interpreted as the answer to a yes-no question)? Let’s find out.

  • The AI says “0”, for “No.” Do you let it out?

  • The AI says “1”, for “Yes.” Do you let it out?

If you answered “No” to both of those questions, then congratulations! The safety threshold for our AI is at least 1 bit!

As an upper bound, let’s consider a past successful AI-box experiment.

The Gatekeeper says that the conversation was ‘around 14,000 words long’.

The entropy of English is around 10 bits per word (source: Random Googling).

So the safety threshold for our AI is below 140,000 bits. (Even if you think, as I am somewhat inclined to, that the Gatekeepers in the AI-Box experiments were fools and that you wouldn’t fall for it, the fact that several Gatekeepers have fallen for things means that this bound holds anyway—maybe the AI can’t persuade you given this much bandwidth, but clearly there are some people it can persuade, and we don’t want to risk the actual Gatekeeper being one of them.

So there is some threshold between 1 and 140,000 bits of information we can safely obtain from our AI.

Rewards

What could you do with 1 bit of information? Even if this information is safe to obtain, it seems like a rather large expense to go to to get a Yes/​No answer.

So how much bandwidth does the AI need in order for us to get some use out of it?

One reasonable question to ask is ‘how much bandwidth would we need in order to convey something useful to a less advanced past society?’ Suppose that you are trying to convey information to the past to allow them to figure out relativity. What’s the smallest message that does that?

‘c=spd of light. y=1/​sqrt(1-v^2/​c^2). all lengths scale down by y’ is 65 characters long, or 520 bits.

How much would that help the scientists of 1850, if they received it as a message from minds they knew to be more advanced and studied on its basis? How much advancement could they get out of that? It seems...non-zero but probably not very large?

OPENING THE FLOOR

I’m interested for people’s views on both the risk and the return sides of the equation.

How many bits of information do you think you can safely extract from a boxed AI without it being able to get you to let it out?

And what do you think is the best way to use a sharply limited number of bits of information from a probably-Friendly superhuman AI?