Eight claims about multi-agent AGI safety

There are quite a few arguments about how interactions between multiple AGIs affect risks from AGI development. I’ve identified at least eight distinct but closely-related claims which it seems worthwhile to disambiguate. I’ve split them up into four claims about the process of training AGIs, and four claims about the process of deploying AGIs; after listing them, I go on to explain each in more detail. Note that while I believe that all of these ideas are interesting enough to warrant further investigation, I don’t currently believe that all of them are true as stated. In particular, I think that so far there’s been little compelling explanation of why interactions between many aligned AIs might have castastrophic effects on the world (as is discussed in point 7).

Claims about training

1. Multi-agent training is one of the most likely ways we might build AGI.

2. Multi-agent training is one of the most dangerous ways we might build AGI.

3. Multi-agent training is a regime in which standard safety techniques won’t work.

4. Multi-agent training allows us to implement important new safety techniques.

Claims about deployment

5. We should expect the first AGIs to be deployed in a world which already contains many nearly-as-good AIs.

6. We should expect AGIs to be deployed as multi-agent collectives.

7. Lack of coordination between multiple deployed AGIs is a major source of existential risk.

8. Conflict between multiple deployed AGIs risks causing large-scale suffering.

Details and arguments

1. Multi-agent training is one of the most likely ways we might build AGI.

The core argument for this thesis is that multi-agent interaction was a key feature of the evolution of human intelligence, by promoting both competition and cooperation. Competition between humans provides a series of challenges which are always at roughly the right level of difficulty; Liebo et al. (2019) call this an autocurriculum. Autocurricula were crucial for training sophisticated reinforcement learning agents like AlphaGo and OpenAI Five; it seems plausible that they will also play an important role in training AGIs. Meanwhile, the usefulness of cooperation led to the development of language, which plays a core role in human cognition; and the benefits of cooperatively sharing ideas allowed the accumulation of human cultural skills and knowledge more generally.

2. Multi-agent training is one of the most dangerous ways we might build AGI.

Humans have skills and motivations (such as deception, manipulation and power-hungriness) which would be dangerous in AGIs. It seems plausible that the development of many of these traits was driven by competition with other humans, and that AGIs trained to answer questions or do other limited-scope tasks would be safer and less goal-directed. I briefly make this argument here.

3. Multi-agent training is a regime in which standard safety techniques won’t work.

Most approaches to safety rely on constructing safe reward functions. But Ecoffet et al. (2020) argue that “open-ended” environments give rise to incentives which depend on reward functions in complex and hard-to-predict ways. Open-endedness is closely related to self-play (which was used to train AlphaGo) and multi-agent environments more generally. When a task involves multiple agents, those agents might learn many skills that are not directly related to the task itself, but instead related to competing or cooperating with each other. E.g. compare a language model like GPT-3, which was directly trained to output language, to the evolution of language in humans—where evolution only selected us for increased genetic fitness, but we developed language skills because they were (indirectly) helpful for that.

Furthermore, as I point out here, it’s not even clear what “good behaviour” would actually look like in such environments, since they don’t necessarily contain tasks corresponding directly to things we’d like AIs to do in the real world. And fine-tuning on real-world tasks may not be sufficient to override dangerous motivations acquired during extensive multi-agent training.

4. Multi-agent training allows us to implement important new safety techniques.

The most central example of a safety technique which rely on multi-agent environments is probably work done by Gillian Hadfield and others about learning group-level norms. More generally, CHAI’s concept of Assistance Games frames the machine learning training process as an interactive game played between humans and AIs, to better allow humans to guide AI behaviour.

I’ve also written about some tentative ideas for how to select for obedience in multi-agent environments.

5. We should expect the first AGIs to be deployed in a world which already contains many nearly-as-good AIs.

Paul Christiano defends this thesis as follows:

  • Lots of people will be trying to build powerful AI.

  • For most X, it is easier to figure out how to do a slightly worse version of X than to figure out how to do X.

  • The worse version may be more expensive, slower, less reliable, less general… (Usually there is a tradeoff curve, and so you can pick which axes you want the worse version to be worse along.)

  • If many people are trying to do X, and a slightly worse version is easier and almost-as-good, someone will figure out how to do the worse version before anyone figures out how to do the better version.

Robin Hanson also argues that progress in AI will be widely-distributed, and not very “lumpy”. He discusses this argument here, in part by summarising the lengthy AI foom debate.

6. We should expect AGIs to be deployed as multi-agent collectives.

I discuss this hypothesis here, building on concepts introduced by Bostrom. Summary: after training AGI, there will be strong incentives to copy it many times to get it to do more useful work. If that work involves generating new knowledge, then putting copies in contact with each other to share that knowledge would also increase efficiency. This would be easier if they had already been trained to collaborate; but even if not, their general intelligence should allow them to learn to work together. And so, one way or another, I expect that we’ll eventually end up dealing with a “collective” of AIs, which we could also think of as a single “collective AGI”.

Arguably, on a large-scale view, this is how we should think of humans. Each individual human is generally intelligent in our own right. Yet from the perspective of chimpanzees, the problem was not that any single human was intelligent enough to take over the world, but rather that millions of humans underwent cultural evolution to make the human collective much more intelligent.

7. Lack of coordination between multiple deployed AGIs is a major source of existential risk.

Critch makes this case here, summarising this more extensive report. He distinguishes between single AIs which are aligned to single humans (single/​single delegation), versus the problem of living in a society where many AIs are each used on behalf of many humans (multi/​multi delegation):

It might be that future humans would struggle to coordinate on the globally safe use of powerful single/​single AI systems, absent additional efforts in advance to prepare technical multi/​multi delegation solutions.
For a historical analogy supporting this view, consider the stock market “flash crash” of 6 May 2010, viewed as one of the most dramatic events in the history of financial markets. The flash crash was a consequence of the use of algorithmic stock trading systems by competing stakeholders. If AI technology significantly broadens the scope of action and interaction between algorithms, the impact of unexpected interaction effects could be much greater, and might be difficult to anticipate in detail.

Note that he claims that this may be true even if single/​single alignment is solved, and all AGIs involved are aligned to their respective users.

8. Conflict between multiple deployed AGIs risks causing large-scale suffering.

The Centre on Long-term Risk argues for this thesis in this research agenda. Key idea:

Many of the cooperation failures in which we are interested can be understood as mutual defection in a social dilemma. Informally, a social dilemma is a game in which everyone is better off if everyone cooperates, yet individual rationality may lead to defection. … An example of potentially disastrous cooperation failure is extortion (and other compelling threats), and the execution of such threats by powerful agents.

Since threats are designed to be strong disincentives, we should expect that the types of threats made against aligned AIs will be very undesirable by human moral standards, and try to design AGIs in ways which prevent threats from being carried out by or against them.