I’ve updated towards AI boxing being surprisingly easy

Specifically because I think that sandboxes like an improved WASM could make it such that conditioned on careful data curation, that side channels have 0 probability to give the AI any information outside of what we give it.

I.e I would predict 0 probability of AI learning virology from a dataset, if the human didn’t give it that data.

I also believe that this can be done with no performance penalty (unlike a lot of security measures.)

My argument follows in 2 steps:

  1. First, I’ll use daviddad’s argument from the link below on how to create a sanbox (also known as a Cartesian Boundary) to prevent embedded computations from learning things from side channels.

Link here: https://​​www.lesswrong.com/​​posts/​​bqRD6MS3yCdfM9wRe/​​side-channels-input-versus-output

  1. I will use a No Free Lunch argument to show why the bits that are left are insufficient to prevent the Cartesian Boundary/​Sanbox from breaking.

Removing nondeterminism

The big piece of the argument is that we can prevent side channels from leaking information by removing all nondeterministic sources. This requires us to remove clock access, networking APIs, or randomness.

This is the second condition, but what about the first? Daviddad gives us the first condition:

  1. We are careful to only provide the training process with inputs that would be just as likely in, say, an alternate universe where AI was built by octopus minds made of organosilicon where atoms obey the Bohr model.

From davidad again on an example:

Regarding part 1, self-play in Go is an example par excellence and this may be one reason why people tend to have a strong intuition that arbitrarily strong AlphaZero fails to kill you.

A No Free Lunch argument for why the remaining bits from random errors are incapable, even in theory, of breaking the sandbox/​Cartesian Boundary

Basically, it comes from Steven Byrnes’s explainer on the No Free Lunch theorem, and if you have equally likely data, that is the world is entirely random, you cannot learn anything. And once we remove the channels of nondeterminism given here, all that is left is random errors, like the chance of a 1 in 20 billion hardware errors, or cosmic rays, or simple random failure. But all of these bits are useless for breaking the sandbox, since again they’re random.

Here’s Steven Byrnes’s explainer on the topic:

https://​​www.lesswrong.com/​​posts/​​Dyt2TDdMGHDkXPcpp/​​the-no-free-lunch-theorem-for-dummies

Appendix

Surprisingly, in order to optimize communication complexity, massive machine learning models forego at least some aspects of nondeterminism anyway, which is a good sign.

Link here:

https://​​arxiv.org/​​pdf/​​1910.02054.pdf

And from davidad on how to make sure everything is nondeterministic, and some important information specific to machine learning.

Enforcing determinism for machine learning will require more effort because of the involvement of GPUs. One must ensure that code is executed deterministically on the GPU as well as on the CPU, and that GPU/​CPU concurrency is appropriately synchronized (to enforce deterministic dataflow). But I claim this is still eminently doable, and with no noticeable performance penalty versus contemporary best practices for scalable ML, if an AI lab understood what determinism is and cared about it even a little bit.