Safer sandboxing via collective separation

Epistemic status: speculative brainstorming. Follow-up to this post on AGIs as collectives. Note that I’ve changed the term population AGI to collective AGI for consistency with Bostrom’s use in Superintelligence.

One safety measure we might want to use on AGIs is (sand)boxing: limiting their access to the real world, so they have very few channels by which to influence it. However, we also want to deploy them on real-world tasks without them using that opportunity to affect the world in unintended ways. Realistically, even if we can maintain high security in a few sandbox environments, I doubt we’ll be able to make every single deployment environment secure against the full intellectual powers of a superintelligent AGI. But this problem might be easier in the case of a collective AGI.

Broadly speaking, my hypothesis is that the tasks which require the sort of high-level intelligence and strategic capability which we’re worried about also require the least interaction with the world. Discovering new scientific theories or making major strategic decisions can occur in a sandbox which only allows brief text outputs. Whereas being a good personal assistant or web administrator might require frequent access to the internet, but benefit much less from superhuman capabilities. And so if we are able to easily adjust the level of intelligence that an AGI is able to apply to any given task, then we might be able to significantly reduce the risks it poses without reducing its economic usefulness much. On tasks which require interacting with the world in flexible ways, we could limit its intelligence greatly; on tasks which require very high intelligence, we could limit its ability to interact with the world.

In general, for most possible AGI architectures, this gradated intelligence restriction seems quite difficult to achieve. For example, we couldn’t just remove parts of a standard neural network without needing to retrain it significantly. Limiting the AGI’s thinking time, or access to memory, might be possible. But this may reduce its capabilities on all tasks, since even low-level tasks can require detailed calculations and long-term memory. By contrast, it’s easy to reduce the capabilities of a collective AGI: simply deploy only a subset of the members, and bottleneck the communication between different subsets. This makes deployment on a range of real-world tasks more feasible. Even if each individual member of a collective AGI is as intelligent as any hundred humans put together, I expect that we could (with sufficient effort) create secure deployment and monitoring protocols that the individual AI could not break, if it weren’t able to communicate with the rest of the collective beforehand.

Given the claims I’ve made so far, we might hope that a small subset of a collective AGI could do nearly as well as the full collective on tasks which require a lot of interaction with the real world, while being significantly safer. In the next section I’ll discuss this proposal in more detail. First, though, I want to be very clear that this is not intended as a primary solution to the safety problem; ideally we’ll have much more reliable ways to ensure AGI is aligned. However, it’s worth thinking about ways to “move the needle” by adding additional layers of security. I think there’s a reasonable chance that hacky solutions like the one above will make a difference (especially because of my uncertainty about what AGI will actually look like).

A sketch of collective separation

In my previous post I outlined two ways we might train a collective AGI to become more intelligent than humanity. Either we train a single AGI through some means, and then make many copies of it and allow them to communicate. Or else we start off with a multi-agent collective, and train them in a way which incentivises both competition and cooperation (e.g. they are each rewarded for acquiring more of a scarce resource).

Either way, let’s consider how we might actually want to deploy the resulting AGI. For example, suppose we want to use it to solve the problem of world energy. The entire collective AGI might be responsible for deriving the physical principles which would allow more efficient power sources, in near-total isolation from the world. Different subsets of that collective AGI could then separately work on creating actual designs for new power plants—which would involve testing prototypes and evaluating construction feasibility, in closer contact with humans. Even smaller subsets could then be used to oversee the daily running of those power plants—a task which would require the ability to make autonomous real-time interventions. The key claim is that a small subset of the collective AGI is nearly as good at managing one power plant as the whole collective would be, while also being much safer.

More specifically, when I talk about extracting a subset of the collective AGI, I mean making copies of a set of individual AIs which have knowledge about the relevant topic, and then separating those copies from the rest of the collective. The extent to which the copies undergo further training to fine-tune them for their new task will vary. I expect that for tasks like designing power plants, we’ll have too little data to optimise them specifically for that task; members of the subset will need to use their existing general intelligence, like humans do. Perhaps the requirements for overseeing daily operation of a plant will be well-specified enough to gather good training data, though. And as I discuss in my previous post, fine-tuning will likely be much easier for smaller subsets of the collective AGI.

Information flow and version control

One important question is how much information flow should be allowed between different parts of the collective AGI used for different tasks. This is a crucial point, and cause for concern. In the power plant example above, we might imagine that the collective AGI could design a power plant with security flaws in advance, then share the blueprint with all of its members. Then a small subset of the collective could recite that design when asked to come up with a real-world implementation; and an even smaller subset could exploit the pre-specified flaws when deployed.

How to prevent this type of problem? One option is strict version control: the small subsets could be drawn from a version of the collective AGI which hadn’t yet thought about how to design flawed power plant security systems. This seems very difficult if the AGI had enough information about humans (e.g. read-only access to the internet), because it could then predict the most likely ways it would be used, and make plans about how to subvert them. More promising is the idea that the full AGI should never be given detailed information about human society—that it should only ever be used for abstract problems. Subsets could be used for tasks involving real-world data, but then immediately shut down without having the chance to report back to the full AGI. This would likely impair performance, but plausibly not by very much if the subsets are fast learners and inherit the full AGI’s meta-level knowledge about how to solve problems.

Implementation

Could we realistically expect widespread deployment of collective AGI without anyone giving (a copy of) a full collective access to an insecure task, or an individual member unrestricted access to the internet? On the face of it, this does seem unlikely. But it’s less difficult if the source code of the AGI is kept confidential (which aligns well with economic incentives). Widespread deployment might then involve allowing a range of groups to rent access to copies of that AGI running on the cloud.

In this situation, collective boxing could be implemented by limiting permissible tasks depending on how much of the collective AGI was requested. The full collective could only be rented to do batched tasks with short outputs. The restrictions on subsets could be less strict. Even if several different groups develop AGIs, if they coordinated to run them on similar (or the same) cloud platforms, boxing restrictions could be enforced in a scalable way.