Safety via selection for obedience

In a previous post, I argued that it’s plausible that “the most interesting and intelligent behaviour [of AGIs] won’t be directly incentivised by their reward functions”—instead, “many of the selection pressures exerted upon them will come from emergent interaction dynamics”. If I’m right, and the easiest way to build AGI is using open-ended environments and reward functions, then we should be less optimistic about using scalable oversight techniques for the purposes of safety—since capabilities researchers won’t need good oversight techniques to get to AGI, and most training will occur in environments in which good and bad behaviour aren’t well-defined anyway. In this scenario, the best approach to improving safety might involve structural modifications to training environments to change the emergent incentives of agents, as I’ll explain in this post.

My default example of the power of structural modifications is the evolution of altruism in humans. Consider Fletcher and Doebeli’s model of the development of altruism, which relies on assortment in repeated games—that is, when players with a tendency to cooperate end up playing together more often than random chance predicts. In humans, some of the mechanisms which lead to assortment are:

    • Kin recognition: we can tell who we share genes with.

    • Observation of intentions or previous behaviour: these give us evidence about other agents’ future behaviour.

    • Costly signalling: this can allow us to reliably demonstrate our future altruism.

    • Communication of observed information: once one person has made an observation, it can be shared widely.

    • Flexible interactions: we can choose who to assort with in different interactions.

I claim that, given this type of understanding of the evolution of altruism, we can identify changes to high-level properties of the human ancestral environment which would have made humans significantly more altruistic. For example, human cognition is not very transparent, and so it’s relatively difficult for each of us to predict the intentions of others. However, if we had direct observational access to each other’s brains, cooperation would become easier and more advantageous. As another example, if we had evolved in environments where we frequently cooperated with many different species, then we’d likely feel more broadly altruistic today.

To be clear, I don’t think these types of interventions are robust enough to be plausible paths to building safe AGIs: they’re only intuition pumps. In particular, I expect it to be much easier to push AIs to learn to cooperate by directly modifying their reward functions to depend on the rewards gained by other agents. However, agents trained in this way might still learn to care about instrumental goals such as acquiring resources. After all, those instrumental goals will still be useful in allowing them to benefit themselves and others; and unless we train them in a purely cooperative environment, they will still be rewarded for outcompeting other agents. Our question now is: how do we train agents which only care about fulfilling the goals of other agents, while lacking any other goals of their own?

The approach I’m most excited about is changing the high-level properties of the training environment in a way which encourages division of labour, in particular by separating the roles of planner and worker. For example, consider a meta-learning setup incorporating many different tasks. In each episode, the planner agent A is given a detailed set of instructions in natural language, specifying a task which is too complicated for A to do on its own. The worker agent, B, receives the same rewards as agent A, but doesn’t have access to the same instructions. Instead, B can only receive short commands or feedback from A, which forces it to pay attention to A’s instructions. We then train A and B on a very wide range of tasks, so that B learns the general skill of inferring what A wants and acting accordingly. Indeed, we should make A’s reward function and instructions change even within each episode, so that B is always incentivised to continue listening to A’s override commands. And of course we should train each B agent with many different As, so that the Bs don’t overfit to obeying one particular A.

The purpose of this setup is to train B to be only motivated by obedience to A, and not at all intrinsically motivated to pursue other goals even when those other goals are usually useful for increasing A and B’s shared reward. In other words: if we want agents that remain obedient in a wide range of circumstances, then let’s train them to remain obedient in a wide range of circumstances! Instead of putting the burden of hunting down undesirable correlations on the overseer, we’re putting the burden on whichever process generates a wide range of environments, to ensure that for any potential misaligned goal B might learn, there’s some environment where carrying out that goal will be a worse strategy than just following A’s instructions. Then, although the combination of A and B may be highly agentic, we can hope that each of them will be safe when used individually.

It’s not clear that this is a bet which will pay off, but this type of training regime seems worth investigating. Here are three more speculative ideas for ways to structurally modify open-ended training regimes to develop more obedient agents, inspired by selection pressures which affected the extent to which humans are “obedient” and “domesticated”:

  • Increasing specialisation. Agents which are only able to influence their environments in specific, limited ways will have greater incentives to be obedient. For example, if B’s intelligence and observation space are strictly greater than A’s, then it may sometimes be useful for B to override A. Whereas if B can’t see most of the environment, while A can, then obeying A is more likely to increase reward according to their shared reward function.

    • Specialisation might also arise naturally if agents are trained in sufficiently large groups to make it viable, with sufficiently good communication channels to stay coordinated.

  • Increasing the value of learning from others. Agents which are trained in settings where most knowledge is culturally transmitted, rather than derivable individually, will have greater incentives to listen to others rather than always making their own decisions.

    • Given this claim, we may be able to increase our agents’ tendencies towards obedience by making cultural knowledge transmission easier. For example, we could allow agents to easily write to and read from a permanent record (as compared with humans, who needed to invent laborious techniques to facilitate reading and writing).

  • Increasing the value of coordination. Agents trained on tasks which require large-scale coordination will likely learn to be more obedient to central planners, since unilateral action isn’t very useful for solving such tasks. In such cases, other agents might learn to detect and punish disobedience, since it has negative externalities.

One key concern with these proposals is that the concepts learned by agents in simulation won’t generalise to the real world. For example, they may learn the goal of obedience to a broad class of artificial agents, but not want to obey humans, since we’re very different. It’s particularly hard to predict the generalisation of goals of artificial agents because our reasoning about goals often falls prey to anthropomorphic optimism. To counter this concern, we can try to use adversarial training in a wide variety of domains, including some domains where humans are directly involved (although I expect human oversight to be too expensive to make up more than a small minority of training time). Thorough testing is also very important.

Testing multi-agent safety

Many of our current safety test environments have the problem that, while they contain suggestively-labelled components which make sense to humans, they aren’t rich enough to develop the sort of deliberate misbehaviour we are worried about from AGI. To deliberately misbehave, an agent needs a theory of mind, and the ability to predict the consequences of its actions. Without these, even though we can design environments in which the agent takes the action that we’ve labelled as misbehaviour, the agents won’t have the semantic content which we want to check for. While I expect sufficiently advanced language models to have that semantic content, I expect that the easiest way to observe potential misbehaviour is in 3D environments.

In particular, our default test for the safety of an AI should involve putting it in novel simulated environments with other AIs it’s never interacted with before, and seeing what happens. Cooperation between those AIs—especially via the mechanism of some of them being obedient to the commands of others—would be evidence of their safety. It’d be particularly valuable to see if agents trained in very different ways, with very different patterns of behaviour, would cooperate in novel environments. After testing in simulation, we could also test by deploying agents on increasingly complex tasks in the real world.

A second approach to testing safety might use more theoretical ideas. It’s usually difficult to formally reason about complex goals in complex environments, but we can take inspiration from the field of evolutionary biology, which does so by analysing the incentives of agents to help or harm each other given how related they are. From similar incentive analysis in agents, we then might be able to derive theories which we can empirically test, and then use to make predictions.

Another relevant formalism for these situations is that of bargaining games. However, there are a couple of ways in which multi-agent environments differ from standard bargaining games. Firstly, the former are almost always iterated, especially if they’re in a persistent environment. Secondly, there are reputational effects in the former, but not in the latter. Thirdly, all of the agents involved are being optimised based on how well they perform, which shifts their policies over time. So I’m not too optimistic about this type of analysis.