Competitive safety via gradated curricula

Epistemic status: brainstorming some speculative research directions. Not trying to thoroughly justify the claims I’m making.

One way to think about the AI safety problem: there’s a spectrum of methods which each represent a different tradeoff between safety and ease of training an AGI, and unfortunately the two are anticorrelated. In particular, consider four regimes in which the bulk of training might occur (perhaps with additional fine-tuning afterwards):

  1. Training a language model to answer questions correctly.

  2. Training a RL agent on a range of limited tasks (e.g. games).

  3. Training a RL agent on general tasks in large-scale simulations for long time periods.

  4. Training a RL agent in competitive multi-agent environments.

I claim (but won’t fully defend here) that these are in order from safest but most difficult, to easiest but most dangerous:

  1. Regime 1 will produce a question-answering system which has no experience taking actions in the world, and which may not be goal-directed at all. But many researchers expect that it’ll be much easier to create an AGI which can answer difficult questions by training it to interact with a simulated world, so that its concepts are “grounded” by experience.

  2. Regime 2 is likely to produce an agent whose goals are bounded, and whose concepts are grounded; but which might only do well on the specific tasks it had been trained on. If so, building AGI in this regime would require a very sophisticated curriculum, if it’s possible at all.

  3. Regime 3 provides a rich environment for an agent to learn quite general skills and concepts. However, now the agent will also be rewarded for developing large-scale goals, which might make it dangerous.

  4. Regime 4 additional provides an “autocurriculum” via competition, the training signal from which could accelerate the development of general intelligence (as it did in humans). However, the agent could learn harmful skills and motivations (such as deception, manipulation or aggression) from competing with other agents, which it might then apply to interactions with humans.

This is a problem—but it’s also an opportunity. If we accept the claims I’ve made about this spectrum, then it might be much easier to train a relatively safe and non-agentic AGI if we start training in less safe regimes, and then gradually transition the training of that AGI into safer regimes. More specifically, I’m proposing a training curriculum in which an agent is trained in regime 4 until it displays a given level of competence; then moved to regime 3 until it again displays a significant amount of progress; then regime 2; then regime 1. The specific regimes used are not vital; some could be removed or replaced by others that I haven’t thought of. Neither is it essential to keep using exactly the same agent; it’d need to be retrained to use different observation and action spaces, and perhaps have its architecture modified during transitions. (In particular, it might be useful to incorporate a pre-trained language model at some point to kick-start its understanding of language.) The main point is that as training progresses, we increasingly use safer training regimes even though we expect it to be much more difficult to train an AGI solely using those regimes.

The key hypothesis is that it’s not uniformly harder to train AGIs in the safer regimes—rather, it’s primarily harder to get started in those regimes. Once an AI reaches a given level of intelligence, then transitioning to a safer regime might not slow down the rate at which it gains intelligence very much—but might still decrease the optimisation pressure in favour of that AI being highly agentic and pursuing large-scale goals.

I have some intuitions in favour of this hypothesis (although I’m still pretty uncertain). Here are three:

  1. Language-based tasks or limited-domain tasks can be made almost arbitrarily hard, we won’t run out of ways to challenge the agent (although this might require more work than continuing training in other regimes).

  2. Exploration is much easier for an agent that’s already quite intelligent, and so reward sparsity matters less. This would also make it easier to manually design tasks, since a wider range of difficulties would be acceptable.

  3. Once an agent has had some amount of interaction with a (simulated or real) world, it can interpret language as referring to objects in that world. Whereas without any interaction, it seems far more difficult to develop a solid understanding of objects and concepts. (Note that although quite a few AI researchers seem to believe that such grounding is important, I haven’t seen much justification for it in the context of modern machine learning. So I wanted to flag this as a particularly intuition-driven claim.)

One example which might help illustrate the overall hypothesis is that of Helen Keller, who developed a sophisticated understanding of the world (and wrote 12 books), despite becoming deaf and blind before the age of 2. Compare two options: either trying to train an AGI (from random initialisation to human-level intelligence) which receives the inputs that a typical human has, or else trying to do the same thing giving it only the inputs that Helen had (primarily touch, and language initially communicated via taps on her hand). I think the latter would be significantly more difficult, because having much less data imposes many fewer constraints on what the structure of the world could be like. And yet Helen did not learn many times more slowly than most people; in fact she earned a degree from Harvard in four years. My explanation for this: the learning problem Helen faced was much harder than what most of us face, but because her brain architecture had already been “trained” by evolution, she could make use of those implicit priors to match, and then surpass, most of her contemporaries.

The second crucial hypothesis is that the AGI doesn’t also retain dangerous characteristics from earlier training regimes. Again, I’m very uncertain about this hypothesis—it might be that, once the system has started training in a highly competitive environment, it will continue to have competitive and highly agentic motivations until we actively prevent it from doing so. Yet there are ways to mitigate this. In particular: to the extent that we can identify which part of the AGI is responsible for goal-directed cognition, we can remove that before continuing to train the rest of the network in the new regime. This would rely on interpretability techniques—but techniques which could be much cruder than those required to understand what an AGI is thinking or planning at any given moment. As an illustration, I think we already understand the human brain well enough to identify the parts most responsible for goal-directed cognition and motivation (although I don’t know much neuroscience, so corrections on this point would be very welcome). After removing the analogous sections of an AGI, it’d need to develop a new motivational system in subsequent training regimes—but I’m hoping that those later regimes only incentivise limited goal-directedness, towards bounded goals.

Right now, I think it’s pretty likely that there’ll be some transitions between different training regimes when developing AGI—e.g. for language models it’s already common to start with unsupervised pre-training, and then do supervised or RL fine-tuning. But note that this transition goes in the opposite direction (from more limited tasks to larger-scale tasks) compared with the ones I discussed above. So my proposal is a little counterintuitive; but while I’m not confident that it’ll be useful (perhaps because I’m still quite confused about agency and goal-directedness), I think it’s worth evaluating. One empirical test which could already be done is to see whether pre-training in a simulated environment is advantageous for developing better language models. Another, which might only be viable in a few years, is to investigate how separable goal-directed cognition is from world-modelling and other aspects of intelligence, in deep reinforcement learners with a range of neural architectures.


* Given that I have no experience of being deaf or blind, and have not looked into it very much, my intuitions on this point are not very well-informed; so I wanted to explicitly flag it as quite speculative.