Why multi-agent safety is important

Alternative Title: Mo’ Agents, Mo’ Problems

This was a semi-adversarial collaboration with Robert Kirk; views are mostly Akbir’s, clarified with Rob’s critiques.

Target Audience: Machine Learning researchers who are interested in Safety and researchers who are focusing on Single Agent Safety problems.

Context: Recently I’ve been discussing how to meaningfully de-risk AGI technologies. My argument is that even in the situation where single agent alignment is solved, there is a complementary set of problems which must be addressed. Below I motivate what to expect in a multipolar scenario, how existing safety methods are likely unsatisfactory and describe new failure modes that will need to be addressed.

My Multipolar world

Why should we expect a multipolar scenario?

Much like today, the future will likely consist of many organisations (states and companies) enthusiastically building AI systems. Along this path, many notice they can improve their systems by scaling data/​compute and as such will deploy these systems into the real-world to scale data collection.

When deployed in the real world, systems will become smarter much quicker—there will be a “takeoff” in capabilities. I claim the process will take months to years, with bottlenecks around collecting data (people are slow to adopt new products). As many see promise in this approach, many takeoffs are likely to happen and be competing with each other, as many consumers find themselves choosing between competing products—“shall I use Dall-e 2 or Imagen for my illustration?”. Because of the existence of multiple competing companies all building more and more general systems, and the (relatively low) speed[1] of these takeoffs, I believe there will be many competing AI systems of similar competencies all acting and learning in the real world. That is, even if one company is in the lead, this doesn’t mean that they will immediately achieve a decisive strategic advantage past a certain capabilities threshold; or equivalently that the threshold is at a later point on the capabilities axis than the capability required for these systems to have transformative impact on the world, and influence each other while learning. In VC speak: we’d say that data is the moat that keeps companies defensible against more capable systems.

My belief in the above scenario is rooted in pragmatism from start-up land. I’ve deployed a couple of ‘narrow’ AI services and as such, my assumptions are best stated as 1) takeoffs will be slow 2) takeoffs require continually learning in the real-world and 3) the real-world will consist of many agents each optimising for their own takeoff.

What a multipolar scenario look like: An analogy

As a metaphor for the scenario, let’s consider sending a kid to school: Our “babyAGI” (an offline- and/​or simulation-trained model) is ready for its first day of school (being deployed in the real world, starting its takeoff). At school, the babyAGI will learn more about the world, by experimentation, studying and interactions. By going to school our babyAGI can become a smarter and more productive AGI for society. Our babyAGI is perfectly aligned on their first day of school, that is to say, they have the exact views of their parents/​creators. At the end of each day of school, as parents we can check what our babyAGI has learnt and make sure they are aligned.

The thing is—our babyAGI is also likely not the only one attending school—there’s probably others babyAGIs about. They could be older and smarter, or simply in its grade. Unipolar safety is making sure that your child remains aligned in a class of 1, multipolar safety is making sure that your child remains aligned in a class of 30. The risks discussed here originate from these multi-agent interactions and each subsequent learning step taken at school.

Keeping your kid aligned in a multi-agent school is hard. There are many interactions we have to prepare for: there are complex social dynamics (mixed-games), slang language (uninterpretable communication), and bad influences (different normative values) all around. These new influences and interactions require consideration to keep agents aligned. Below we make these concepts clear—framing them as either making single agent safety problems harder or as introducing entirely new types of problems.

Making existing problems harder

Meeting new kids and why robustness gets harder

an old problem harder: An obvious failure case for AI systems is that they can end up in situations never seen in training . From both ML and x-risk perspectives[2], robustness to distributional shift or adversarial attacks gets harder as the space of inputs grows. The real world of observations is incredibly large—and most importantly filled with weird entangled agents that act both on observations and their current policies. A single agent approach assumes we can perfectly disentangle this—that is, I can simulate every observation and be ignorant of what caused it. Precisely: the distribution of states generated for training is a true representation of the real-world. This is dangerous for a couple of reasons:

  1. Environments /​ Agents can have strange feedback loops. For example amazon book prices and the flash market crash, where negative feedback loops caused collective failures. These are hard to simulate apriori, especially where we don’t have full models of the environment.

  2. The real-world may contain agents not seen during training. This can generate unseen behaviours or distributions of states we are incapable of generating at pre-deployment training (e.g. the real-world contains smarter agents). Empirically, this is why independent RL fails within multi-agent games with non-transitive strategy spaces.

Ignoring these cases can be dangerous, leading to cases of cooperative failure.

why existing approaches fail: From the x-risk perspective, the current most-discussed approach for these kinds of failures is relaxed adversarial training. The addition of multiple agents into the deployment environment of the trained AGI affects this style of technique in two ways. Firstly, it quantitatively increases the difficulty of searching over the possible space of inputs to the model as described above, even when working in the pseudo-input space. We now need to search for adversarial inputs over the joint space of world states and co-player behaviours, and the space of behaviours seems very large (especially considering that it’s reasonable to expect these behaviours will be aiming to achieve different goals). Whilst population play and other methods address this, these approaches are largely restricted to the landscape of behaviours with structure (mostly games-of-skill).

Second, it seems to qualitatively change the space of (pseudo-)inputs, and the methods for generating them. Generating plausible but adversarial behaviour from other existing AI systems that may be a similar level of competence seems more difficult than just generating world-states where the only other agents in the world are (groups of) humans. Further, in the mixed game setting, the input-generating adversaries are probably going to aim to make the agent’s reward as low as possible by simulating fully non-cooperative co-player behaviour that might not actually be realistic behaviour if we’re aiming for or expecting prosocial or at least partly cooperative agents. This places an additional constraint on the input-generating adversaries that is hard to capture during training.

Slang and why it makes interpretability harder

an old problem harder: A multipolar world introduces a new set of interactions—an inter-agent communication channel. A priori, introducing a new surface by which each agent can be influenced by externalities (and in particular other misaligned agents) is undesirable. The questions poised are as follows: “can we interpret communication along this channel?”, and “do existing safety measures alleviate this problem?”

Emergent communication has been around for a while now, with a goal to understand how gradient-based learners communicate and establish protocols with one another. Whilst there are numerous works, the meta-results are largely that protocols learnt are rarely like those seen in natural language and fairly uninterpretable to humans. This is largely negative in the two ways: 1) it is untethered to how we communicate and 2) it is hard to predict how learning will happen and where.

The major takeaway from the field has been that communication becomes encoded in seemingly uninterpretable components—with a large open question being how to evaluate the structure of new languages. Even in a world of socratic models, new communication protocols could develop in obscure components of regular channels which we could not easily evaluate.


why this makes safety harder: Interpretability, understanding why a model makes a decision, is a promising approach to safety. One goal of interpretability could be understood as mechanistically understanding why neural networks make decisions.

With a mechanistic understanding, many believe we can create a training signal to supervise the process as well as the output of a model. In a multi-polar world where this is being done in a live deployment environment with other agents, there’s extra difficulty in interpreting the communication channels, and if we don’t know what’s actually being communicated in the channels then we don’t even know what the correct process or output of the model would be. Alternatively we could use the mechanistic understanding as a filter/​mulligan once the model is trained, to stop deploying a misaligned AI. This would certainly occur before our babyAGI is sent to school, but could also happen continuously, and again here we need to be able to understand the communications to know if our agent is doing bad things, or doing good things given the information it’s receiving.

However, current approaches to the above uses of interpretability start from a position of understanding the phenomena we’d like to uncover first. We query in the form “why does this model think this is a hotdog?”. This is decision analysis. What we require to understand emergent communication is to identify proactively that this channel has more information than before. This is specifically important for a communication channel, because there is an infinite set of decision questions which represent “has this model started a new communication protocol?”. Even if we notice the entropy (or some other heuristic) in out-going messages has increased or we PCA and find new structure in the communication channel, without developing capacity to evaluate this communication, we can not evaluate what has been said. Finally, unlike other interpretability work we effectively only have half of a conversation and whilst we can mimic communications of another agent, it is much harder for us to extrapolate out and generate new data points and extrapolate performance of our agent given these generated communications.

New problems

Fights in the playground—the structure of mixed-games

new problem: the real world is a mixed-motive game, in which agents can compete, cooperate or simply ignore one another. In the playground, if we choose to play football, we need to coordinate 20 other players to join us, establish the rules we will adhere to and when the game will end. Even if you choose to play alone, how others act will affect you. To do well in this environment, we need to be mindful of others (and their differing normative values), be robust to exploitation and punish cheaters. In situations where this doesn’t happen, collective failures such as tragedy of the commons can occur.

At its crux, a single-agent approach is oblivious to what is an agent and what is the environment. Imagine we get wet when we are in the playground: how do we distinguish between the random event of rain and the new kid with a water gun? Importantly, we can’t change the rain but we can convince the new kid to calm down with that gun.

The first issue is that learning to cooperate and influence others is a fundamentally multi-agent task. Even if your alignment technique would work offline, you’re now in a world where other agents are adversarially trying to make you do bad things (by your own lights), and many alignment techniques provide no guarantees under that kind of pressure. Further, for any human supervision, it’s less clear what the desired behaviour is in a mixed game, and so it’s harder to evaluate.

In empirical studies of mixed-games, such as melting-pot and iterated-prisoner’s dilemma, it is difficult to converge to cooperation. These works have demonstrated that learning to cooperate and coordinate is exceptionally difficult for gradient based learners. Moreover, collapsing into a defective equilibrium is often absorbing (and so getting out of them is exceptionally hard). So in the playground, football dissolves into every man for themselves, fights break out, and hence social norms do not form. Note that if failures of cooperation happen sufficiently quickly, this can become dangerous for any bystanders (us) very quickly.

Leaving the playground for the moment: the models that are interacting within the same space are competing over similar resources. If deployed to “create paperclips” or “generate profitable movies” these babyAGIs are likely to be indirectly competing with another, either over common resources (consumer attention and data) or outcomes they’d like to achieve (profits). This competitive pressure alters your training objective and as Buck S would say: “makes for a fucked up loss function”. Even if agents are not directly interacting, their data/​takeoffs are likely affected by one another.

This brings us to the second difficulty—it’s extra hard to keep aligned with some goals when there is lots of competitive pressure (due to mixed game) to do a bunch of other stuff, and also effectively adversaries in the environment trying to screw you over—incentives will exist here that are perverse and misaligned with the user’s preferences but hard to move away from. Note this is a structural risk from many systems in a mixed game, as collapsing to defective equilibrium is likely, which is clearly bad for all of us—the difference with AGI is that this can happen at a scale where we have no human stakeholders.

why existing solutions fail: the difficulty of mixed-games lies within structure and how we approach this is very much an open research question. There are maybe three flavours of solutions to this problem currently, each with their own limitations. The first is the way humans solve problems—by establishing impartial bodies which impose restrictions on being defective and create cooperative schelling points (think of the UN Climate Accord). The limitation of such an approach is that most organisational work is done post-hoc (nuclear treaty after WW2) and will still require some technical solution. There are also still massive incentives for agents to cheat (lie about emission levels).

There are two flavours of technical solutions: the first is to assume we can faithfully have all agents work together, which is to all extent and purposes very unlikely to occur in the real world. This involves training agents to knowingly choose altruistic (but exploitive) policies. This can only be done if agents can evaluate the multi-agent outcomes well and secondly if the principal agents of said models are okay with this.

The second set of approaches focus on “opponent-shaping” which involves correctly incentivising agents to take actions which influence their opponent. Commonly this can be considered the tit-for-tat solution approach. Extensions to this work also exist, in which agents are aware that others may have different normative values. This fundamentally requires a theory of mind of opponent agents, and as such currently only works in the privileged position of knowing an opponent’s parameters or learning approach.

Single agent safety would also suggest that we can solve this problem by simply doing offline learning and policing this space effectively. However, to notice and shape agents towards cooperative equilibrium we need two fundamental things: 1) to know what cooperative equilibrium we want and 2) to believe the other agent is also moving towards this. In these limitations I find 1) very difficult to appreciate for complex systems (especially when AGIs are likely to be making decisions at a level above human understanding). At a minimum a new suite of interpretability work is required to evaluate the best cooperative equilibria. Finally, note as point 1 suggests it’s hard to interpret communication or actions in multi-agent systems, “competitiveness” can leak into your offline learning.

Mixing with the wrong crowd—social learning

new problem : an interesting side effect of intelligent systems (and RL agents) is the ability to learn new methods from experts within their environments. This phenomenon, named social learning, is when agents identify successful co-players in an environment and learn to mimic their trajectories.

The social learning problem is cushioned upon a single assumption: that agents are likely to be deployed into a sea of agents aligned differently to yours. This poses a series of issues to consider:

  1. It is hard to simulate other intelligent but differently aligned agents during training (see the “Meeting New Kids” point).

  2. It is hard at deployment, to observe if other agents are correctly aligned (by your own definition). It is unclear if this capability we want to give to our agent, or if we have this capability on black-box agents.

  3. At deployment, social learning—where agents mimic or bootstrap behaviour from perceived experts allows a link in the alignment of our primary agent.

Below, in a scenario where our BabyAGI is “nearly” aligned, we demonstrate how social learning can lead to agents becoming grossly misaligned.

Assume BabyAGI is “nearly” aligned, by which we construct an epsilon-ball of error around the “true alignment″ we’d like within an agent. Let’s also state that we have a suite of Alignment Police which regulate BabyAGI everyday. BabyAGI is exposed to a BadAgent during deployment which is able to outperform BabyAGI on a task (e.g produces more energy but damages the environment). As its deployment time, from BabyAGI’s perspective BadAgent is also nearly aligned so learns to mimic this behaviour—also shifting its epsilon ball of errors. BadAgent can then mimic this behaviour—with its own Overton window shifted—leading to both agents being dangerously misaligned – all without any gradient updates. Empirically we’ve seen how prompting (e.g no gradient updates), can make language models go from helpful to toxic; as such I’d imagine similar fragility in going from aligned to mis-aligned.

The point of failure lies within evaluating agents that are genuinely well aligned vs appearing to be well aligned. Specifically we now require our BabyAGI to fully evaluate how well aligned other agents are before imitating “seemingly expert” co-players in the environment.

  1. ^

    Quicker than pre-deployment, but slower than foom

  2. ^

    where analogous concepts are deployment-only failures, objective misgeneralisation or inner alignment failures