AI learns betrayal and how to avoid it

Research projects

I’m planning to start two research projects on model splintering/​reward generalisation and learning the preferences of irrational agents.

Within those projects, I’m aiming to work on subprojects that are:

  1. Posed in terms that are familiar to conventional ML;

  2. interesting to solve from the conventional ML perspective;

  3. and whose solutions can be extended to the big issues in AI safety.

The point is not just to solve the sub-problems, but to solve them in ways that generalise or point to a general solution.

The aim is to iterate and improve fast on these ideas before implementing them. Because of that, these posts should be considered dynamic and prone to be re-edited, potentially often. Suggestions and modifications of the design are valuable and may get included in the top post.

AI learns promises and betrayal

Parent project: this is a subproject of both the model-splintering and the value learning projects.


DeepMind’s XLand allows the creation of multiple competitive and cooperative games. This paper shows the natural emergence of communication within cooperative multi-agent tasks. This paper shows that cooperation and communication can also emerge in first person shooter team games.

The idea is to combine these approaches to create multi-agent challenges where the agents learn to cooperate and communicate, and then mislead and betray each other. Ultimately the agents will learn the concepts of open and hidden betrayal. And we will attempt to get them to value avoiding committing those concepts.


As mentioned, the environment will be based on XLand, with mixed cooperative-competitive games, modified as needed to allow communication to develop between the agents.

We will attempt to create communication and cooperation between the agents, and then lead them to start betraying each other, and define the concept of betrayal.

Previous high level projects have tried to define concepts like “trustworthiness” (or the closely related “truthful”) and motivated the AI to follow them. Here we will try the opposite: define “betrayal”, and motivate the AIs to avoid it.

We’ll try to categorise “open betrayal” (when the other players realise they are being betrayed) and “hidden betrayal” (where the other players don’t realise it). It is the second one that is the most interesting, as avoiding hidden betrayal is closest to a moral value (a virtue, in fact). In contrast, avoiding open betrayal may have purely instrumental value, if other players are likely to retaliate or trust the agent less in the future.

We’ll experiment with ways of motivating the agents to avoid betrayals, or getting anywhere near to them, and see if these ideas scale. We can make some agents artificially much more powerful and knowledgeable than others, giving us some experimental methods to check how performance changes with increased power.

Research aims

  1. A minor research aim is to see how swiftly lying and betrayal can be generated in multi-player games, and what effect they have on all agents’ overall scores.

  2. A major aim is to have the agents identify a cluster of behaviours that correspond to “overt betrayal” and “secret betrayal”.

  3. We can then experiment with various ways of motivating the agents to avoid that behaviour; maybe a continuously rising penalty as they approach the boundary of the concept, to motivate them to stay well away.

  4. Finally, we can see how an agent’s behaviour scales as they become more powerful relative to the other agents.

  5. This research will be more unguided than other subprojects; I’m not sure what we will find, or whether or not we will succeed.

The ideal would be if we can define “avoid secret betrayal” well enough that extending the behaviour becomes a problem of model splintering rather than “nearest unblocked strategy”. Thus the agent will not do anything that is clearly a betrayal, or clearly close to a betrayal.