Multi-agent safety

Note: this post is most explicitly about safety in multi-agent training regimes. However, many of the arguments I make are also more broadly applicable—for example, when training a single agent in a complex environment, challenges arising from the environment could play an analogous role to challenges arising from other agents. In particular, I expect that the diagram in the ‘Developing General Intelligence’ section will be applicable to most possible ways of training an AGI.

To build an AGI using machine learning, it will be necessary to provide a sequence of training datasets or environments which facilitate the development of general cognitive skills; let’s call this a curriculum. Curriculum design is prioritised much less in machine learning than research into novel algorithms or architectures; however, it seems possible that coming up with a curriculum sufficient to train an AGI will be a very difficult task.[1] A natural response is to try to automate curriculum design. Self-play is one method of doing so which has worked very well for zero-sum games such as Go, since it produces tasks which are always at an appropriate level of difficulty. The generalisation of this idea to more agents and more environments leads to the concept of multi-agent autocurricula, as discussed by Leibo et al. (2019).[2] In this framework, agents develop increasingly sophisticated capabilities in response to changes in other agents around them, in order to compete or cooperate more effectively. I’m particularly interested in autocurricula which occur in large simulated environments rich enough to support complex interactions; the example of human evolution gives us very good reason to take this setup seriously as a possible route to AGI.

One important prediction I would make about AGIs trained via multi-agent autocurricula is that their most interesting and intelligent behaviour won’t be directly incentivised by their reward functions. This is because many of the selection pressures exerted upon them will come from emergent interaction dynamics.[3] For example, consider a group of agents trained in a virtual environment and rewarded for some achievement in that environment, such as gathering (virtual) food, which puts them into competition with each other. In order to gather more food, they might learn to generate theories of (simulated) physics, invent new communication techniques, or form coalitions. We should be far more interested in those skills than in how much food they actually manage to gather. But since it will be much more difficult to recognise and reward the development of those skills directly, I predict that machine learning researchers will train agents on reward functions which don’t have much intrinsic importance, but which encourage high-level competition and cooperation.

Suppose, as seems fairly plausible to me, that this is the mechanism by which AGI arises (leaving aside whether it might be possible to nudge the field of ML in a different direction). How can we affect the goals which these agents develop, if most of their behaviour isn’t very sensitive to the specific reward function used? One possibility is that, in addition to the autocurriculum-inducing reward function, we could add an auxiliary reward function which penalises undesirable behaviour. The ability to identify such behaviour even in superintelligent agents is a goal of scalable oversight techniques like reward modelling, IDA, and debate. However, these techniques are usually presented in the context of training an agent to perform well on a task. In open-ended simulated environments, it’s not clear what it even means for behaviour to be desirable or undesirable. The tasks the agents will be doing in simulation likely won’t correspond very directly to economically useful real-world tasks, or anything we care about for its own sake. Rather, the purpose of those simulated tasks will merely be to train the agent to learn general cognitive skills.

Developing general intelligence

To explain this claim, it’s useful to consider the evolution of humans, as summarised on a very abstract level in the diagram below. We first went through a long period of being “trained” by evolution—not just to do specific tasks like running and climbing, but also to gain general cognitive skills such as abstraction, long-term memory, and theory of mind (which I’ve labeled below as the “pretraining phase”). Note that almost none of today’s economically relevant tasks were directly selected for in our ancestral environment—however, starting from the skills and motivations which have been ingrained into us, it takes relatively little additional “fine-tuning” for us to do well at them (only a few years of learning, rather than millennia of further evolution). Similarly, agents which have developed the right cognitive skills will need relatively little additional training to learn to perform well on economically valuable tasks.

Link to a larger version of this image.

Needing only a small amount of fine-tuning might at first appear useful for safety purposes, since it means the cost of supervising training on real-world tasks would be lower. However, in this paradigm the key safety concern is that the agent develops the wrong core motivations. If this occurs, a small amount of fine-tuning is unlikely to reliably change those motivations—for roughly the same reasons that humans’ core biological imperatives are fairly robust. Consider, for instance, an agent which developed the core motivation of amassing resources because that was reliably useful during earlier training. When fine-tuned on a real-world task in which we don’t want it to hoard resources for itself (e.g. being a CEO), it could either discard the goal of amassing resources, or else realise that the best way to achieve that goal in the long term is to feign obedience until it has more power. In either case, we will end up with an agent which appears to be a good CEO—but in the latter case, that agent will be unsafe in the long term. Worryingly, the latter also seems more likely, since it only requires one additional inference—as opposed to the former, which involves removing a goal that had been frequently reinforced throughout the very long pretraining period. This argument is particularly applicable to core motivations which were robustly useful in almost any situation which arose in the multi-agent training environment; I expect gathering resources and building coalitions to fall into this category.

I think GPT-3 is, out of our current AIs, the one that comes closest to instantiating this diagram. However, I’m not sure if it’s useful yet to describe it as having “motivations”; and its memory isn’t long enough to build up cultural knowledge that wasn’t part of the original pretraining process.

Shaping agents’ goals

So if we want to make agents safe by supervising them during the long pretraining phase (i.e. the period of multi-agent autocurriculum training described above), we need to reframe the goal of scalable oversight techniques. Instead of simply recognising desirable and undesirable behaviour, which may not be well-defined concepts in the training environment, their goal is to create objective functions which lead to the agent having desirable motivations. In particular, the motivation to be obedient to humans seems like a crucial one. The most straightforward way I envisage instilling this is by including instructions from humans (or human avatars) in the virtual environment, with a large reward or penalty for obeying or disobeying those instructions. It’s important that the instructions frequently oppose the AGIs’ existing core motivations, to weaken the correlation between rewards and any behaviour apart from following human instructions directly. However, the instructions may have nothing to do with the behaviour we’d like agents to carry out in the real world. In fact, it may be beneficial to include instructions which, if carried out in the real world, would be in direct opposition to our usual preferences—again, to make it more likely that agents will learn to prioritise following instructions over any other motivation.

We can see this proposal as “one level up” from standard scalable oversight techniques: instead of using scalable oversight to directly reinforce behaviour humans value, I claim we should use it to reinforce the more general motivation of being obedient to humans. When training AGIs using the latter approach, it is important that they receive commands which come very clearly and directly from humans, so that they are more easily able to internalise the concept of obedience to us. (As an illustration of this point, consider that evolution failed to motivate humans to pursue inclusive genetic fitness directly, because it was too abstract a concept for our motivational systems to easily acquire. Giving instructions very directly might help us avoid analogous problems.)

Of course this approach relies heavily on AGIs generalising the concept of “obedience” to real-world tasks. Unfortunately, I think that relying on generalisation is likely to be necessary for any competitive safety proposal. But I hope that obedience is an unusually easy concept to teach agents to generalise well, because it relies on other concepts that may naturally arise during multi-agent training—and because we may be able to make structural modifications to multi-agent training environments to push agents towards robustly learning these concepts. I’ll discuss this argument in more detail in a follow-up post.


  1. As evidence for this, note that we have managed to train agents which do very well on hard tasks like Go, Starcraft and language modelling, but which don’t seem to have very general cognitive skills. ↩︎

  2. Joel Z. Leibo, Edward Hughes, Marc Lanctot, Thore Graepel. 2019. Autocurricula and the Emergence of Innovation from Social Interaction: A Manifesto for Multi-Agent Intelligence Research. ↩︎

  3. This point is distinct from Bostrom’s argument about convergent instrumental goals, because the latter applies to an agent which already has some goals, whereas my argument is about the process by which an agent is trained to acquire goals. ↩︎