# I’ve updated towards AI boxing being surprisingly easy

Specifically because I think that sandboxes like an improved WASM could make it such that conditioned on careful data curation, that side channels have 0 probability to give the AI any information outside of what we give it.

I.e I would predict 0 probability of AI learning virology from a dataset, if the human didn’t give it that data.

I also believe that this can be done with no performance penalty (unlike a lot of security measures.)

My argument follows in 2 steps:

1. First, I’ll use daviddad’s argument from the link below on how to create a sanbox (also known as a Cartesian Boundary) to prevent embedded computations from learning things from side channels.

1. I will use a No Free Lunch argument to show why the bits that are left are insufficient to prevent the Cartesian Boundary/​Sanbox from breaking.

# Removing nondeterminism

The big piece of the argument is that we can prevent side channels from leaking information by removing all nondeterministic sources. This requires us to remove clock access, networking APIs, or randomness.

This is the second condition, but what about the first? Daviddad gives us the first condition:

1. We are careful to only provide the training process with inputs that would be just as likely in, say, an alternate universe where AI was built by octopus minds made of organosilicon where atoms obey the Bohr model.

From davidad again on an example:

Regarding part 1, self-play in Go is an example par excellence and this may be one reason why people tend to have a strong intuition that arbitrarily strong AlphaZero fails to kill you.

# A No Free Lunch argument for why the remaining bits from random errors are incapable, even in theory, of breaking the sandbox/​Cartesian Boundary

Basically, it comes from Steven Byrnes’s explainer on the No Free Lunch theorem, and if you have equally likely data, that is the world is entirely random, you cannot learn anything. And once we remove the channels of nondeterminism given here, all that is left is random errors, like the chance of a 1 in 20 billion hardware errors, or cosmic rays, or simple random failure. But all of these bits are useless for breaking the sandbox, since again they’re random.

Here’s Steven Byrnes’s explainer on the topic:

https://​​www.lesswrong.com/​​posts/​​Dyt2TDdMGHDkXPcpp/​​the-no-free-lunch-theorem-for-dummies

# Appendix

Surprisingly, in order to optimize communication complexity, massive machine learning models forego at least some aspects of nondeterminism anyway, which is a good sign.

https://​​arxiv.org/​​pdf/​​1910.02054.pdf

And from davidad on how to make sure everything is nondeterministic, and some important information specific to machine learning.

Enforcing determinism for machine learning will require more effort because of the involvement of GPUs. One must ensure that code is executed deterministically on the GPU as well as on the CPU, and that GPU/​CPU concurrency is appropriately synchronized (to enforce deterministic dataflow). But I claim this is still eminently doable, and with no noticeable performance penalty versus contemporary best practices for scalable ML, if an AI lab understood what determinism is and cared about it even a little bit.

• The problem is twofold. One, as and to the extent AI proliferates, you will eventually find someone who is less capable and careful about their sandboxing. Two, relatedly and more importantly, for much the same reason that people will not be satisfied with AIs without agency, they will weaken the sandboxing.

The STEM AI proposal referred to above can be used to illustrate this. If you want the AGI to do theoretical math you don’t need to tell it anything about the world. If you want it to cure cancer, you need to give it a lot of information about physics, chemistry and mammalian biology. And if you want it to win the war or the election, then you have to tell it about human society and how it works. And, as it competes with others, whoever has more real time and complete data is likely to win.

• If you want it to cure cancer, you need to give it a lot of information about physics, chemistry and mammalian biology.

This is much, much safer than elections or wars, since we can basically prevent it from learning human models.

And I should made this explicit, but I believe sandboxing can be done in such a way that it basically incurs no performance penalty.

That is, I believe AI sandboxing is one of the most competitive proposals here that reduces the risk to arguably 0, in the STEM AI case.

• We are careful to only provide the training process with inputs that would be just as likely in, say, an alternate universe where AI was built by octopus minds made of organosilicon where atoms obey the Bohr model.

In practice this isn’t going to be nearly as useful as an AI which does have access to the wealth of human knowledge. So whilst this might be useful for some sort of pivotal act, it’s not going to be a practical long term option for AI security.

Can you even think of a pivotal act an AI can help with that satisfies this criteria?

• STEM AI is one such plan that could be done with a secure sandbox, as long as we don’t give it data on humans or human models, or at least giving it the least amount of data that is necessary, and we can prevent escalation from sandboxing it. Thus we control the data sources.

From Evhub’s post:

STEM AI is a very simple proposal in a similar vein to microscope AI. Whereas the goal of microscope AI was to avoid the potential problems inherent in building agents, the goal of STEM AI is to avoid the potential problems inherent in modeling humans. Specifically, the idea of STEM AI is to train a model purely on abstract science, engineering, and/​or mathematics problems while using transparency tools to ensure that the model isn’t thinking about anything outside its sandbox.

This approach has the potential to produce a powerful AI system—in terms of its ability to solve STEM problems—without relying on any human modeling. Not modeling humans could then have major benefits such as ensuring that the resulting model doesn’t have the ability to trick us to nearly the same extent as if it possessed complex models of human behavior. For a more thorough treatment of why avoiding human modeling could be quite valuable, see Ramana Kumar and Scott Garrabrant’s “Thoughts on Human Models.”

Evhub’s post is here:

And the Thoughts on Human Models post is here:

https://​​www.lesswrong.com/​​posts/​​BKjJJH2cRpJcAnP7T/​​thoughts-on-human-models

• Note that giving abstract STEM problems is very unlikely to give zero anthropological information to an AI. The very format and range of the problems is likely to reveal information about both human technology and human psychology.

Now I still agree that’s much more secure than giving it all the information it needs, but the claim of zero bits is pushing it.

• IMO neither Evan nor Scott nor anyone else has offered a plausible plan for using a STEM AI (that knows nothing about the existence of humans) to solve the big problem that someone else is going to build an unboxed non-STEM AI the next year.

• The key is that my sandbox (really Daviddad’s sandbox) requires very little or no performance loss, so even selfish actors would sandbox their AIs.

• Until you want to use the AGI to e.g. improve medicine...

• But all of these bits are useless for breaking the sandbox, since again they’re random.

This isn’t true in principle. Suppose you had floating point numbers, you could add, multiply and compare them, but you weren’t sure how they were represented internally. When you see a cosmic ray bitflip, you learn that only one bit needs to be flipped to produce these results. This is technically information. In practice not much info. But some.

• no noticeable performance penalty

I think I know what you mean, but we should be crystal-clear that if

• Team A has an AI that understands humans and the human world and human technology and the human economy and human language,

• Team B has an AI that doesn’t know anything about any of those things

…then Team B’s AI is suffering a “performance penalty” in the trivial sense that there are many tasks which Team A’s AI can perform and Team B’s AI can’t. Agree?

Then there are a few schools of thought:

• “Team B’s AI is so useless as to be completely irrelevant—we can just ignore it in our discussions.”

• “Team B’s AI can do important real-world things of the “pivotal act” variety.

• “Team B’s AI is not directly useful, but can be used for safely testing possible solutions to the Alignment Problem.”

I would mostly agree with the first bullet point, with a bit of the third—see here. I’m very skeptical of the second one. Proof assistants are neat and all, for example, but I don’t see them changing the world very much. There’s no proposed solution to the alignment problem that is stuck on an unproven mathematical conjecture, IMO. So what do we do with the boxed AI?

• I think human technology is a reasonably safe area, similar to how Go is safe even at arbitrary capability levels, and we might be able to get away with the human economy, but the other areas are likely too dangerous to work in.

My big optimism that while I think politics and social areas aren’t very safe areas, I do believe that there actually safe areas that are crucially non-political/​not overly reliant on human models, and I think technology/​STEM development is exactly this.

Also, it allows us to iterate on the problem of AI Alignment by making sure we can test it’s alignment safely.

• Davidad proposes “we are careful to only provide the training process with inputs that would be just as likely in, say, an alternate universe where AI was built by octopus minds made of organosilicon where atoms obey the Bohr model.” You write “we might be able to get away with [the AI knowing about] the human economy”. These seem very contradictory to me, right? The human economy ≠ the organosilicon octopus mind economy, right? For example, the chemical industry would look rather different if atoms obeyed the Bohr model. The clothing industry would look different if we lived underwater. Etc.

Normally when people say something like “technology/​STEM development” in the context of AGI, I would think of something like “come up with a way to make faster CPUs”. The AI would need to know how current fabs work in great detail, and what other technology and tools are easily available to use to attack the problem, and how compilers works, presumably with code samples and so on, and also an English-language description of what would constitute a functional CPU (ideally with English-language back-and-forth discussion to navigate the trade-space). For real AGI, I would expect it to be able to also design the new advanced CPU factories, purchase the land, negotiate with suppliers, apply for permits, etc. etc.

OK, so how much of that can be distilled into tasks which would be exactly the same in an alternate universe with different laws of physics? I think a few extremely narrow slices at best—for example maybe some narrow-AI optimization problems along the lines of this. But meanwhile an unboxed AGI can do the entire thing.

So you really can’t say “no performance penalty”, right?

• So you really can’t say “no performance penalty”, right?

Yeah, that might be a big concern with my approach, but it’s still worth trying to get empirical data on how much we can get away with a boxed AI model.

• Sandboxing an AI is trivial; you can just not run it.

• My claim is roughly the following:

1. Removing non-deterministic APIs, as well as making it fully deterministic is possible with very little or zero performance penalty, thus sandboxing the AI.

2. The bits that remain accessible to the AI aren’t sufficient to break the sandbox, because they’re totally random, and the No Free Lunch theorem prohibits learning on entirely random data.

3. Thus, sandboxing arbitrarily powerful AI is possible, or equivalently there is a positive answer to whether we can prevent the AI from breaking the box, no matter how powerful.

That’s a non-trivial claim, since the AI is still running here.

• It’s a non-trivial way to sandbox it, but I am not sure what value that gives us which the trivial way of sandboxing doesn’t give us.

• I think the key problem is that this sandboxing won’t work for anything with a large language model as a component.

• Not really? Why could this actually happen, conditional on us filtering out certain information, the real question is how the large language model can break the sandbox such that it can learn other facts?

• Filtering out all the implicit information would be really, really hard.