Distinguishing claims about training vs deployment

Given the rapid progress in machine learning over the last decade in particular, I think that the core arguments about why AGI might be dangerous should be formulated primarily in terms of concepts from machine learning. One important way to do this is to distinguish between claims about training processes which produce AGIs, versus claims about AGIs themselves, which I’ll call deployment claims. I think many foundational concepts in AI safety are clarified by this distinction. In this post I outline some of them, and state new versions of the orthogonality and instrumental convergence theses which take this distinction into account.

Goal specification

The most important effect of thinking in terms of machine learning concepts is clarity about what it might mean to specify a goal. Early characterisations of how we might specify the goals of AGIs focused on agents which choose between actions on the basis of an objective function hand-coded by humans. Deep Blue is probably the most well-known example of this; AIXI can also be interpreted as doing so. But this isn’t how modern machine learning systems work. So my current default picture of how we will specify goals for AGIs is:

  • At training time, we identify a method for calculating the feedback to give to the agent, which will consist of a mix of human evaluations and automated evaluations. I’ll call this the objective function. I expect that we will use an objective function which rewards the agent for following commands given to it by humans in natural language.

  • At deployment time, we give the trained agent commands in natural language. The objective function is no longer used; hopefully the agent instead has internalised a motivation/​goal to act in ways which humans would approve of, which leads it to follow our commands sensibly and safely.

This breakdown makes the inner alignment problem a very natural concept—it’s simply the case where the agent’s learned motivations don’t correspond to the objective function used during training.[1] It also makes ambitious approaches to alignment (in which we try to train an AI to be motivated directly by human values) less appealing: it seems strictly easier to train an agent to obey natural language commands in a common-sense way, in which case we get the benefit of continued flexible control during deployment.[2]


Consider Bostrom’s orthogonality thesis, which states:

Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal.

As stated, this is a fairly weak claim: it only talks about which minds are logically possible, rather than minds which we are likely to build. So how has this thesis been used to support claims about the likelihood of AI risk? Ben Garfinkel argues that its proponents have relied on an additional separation between the process of making a system intelligent, and the process of giving it goals—for example by talking about “loading a utility function” into a system that’s already intelligent. He calls the assumption that “the process of imbuing a system with capabilities and the process of imbuing a system with goals are orthogonal” the process orthogonality thesis.

It’s a little unclear what “orthogonal” means for processes; here I give a more precise statement. Given a process for developing an intelligent, goal-directed system, my version of the process orthogonality thesis states that:

  • The overall process involves two (possibly simultaneous) subprocesses: one which builds intelligence into the system, and one which builds goals into the system.

  • The former subprocess could vary greatly how intelligent it makes the system, and the latter subprocess could vary greatly which goals it specifies, without significantly affecting each other’s performance.

Unlike the original orthogonality thesis, we should evaluate the process orthogonality thesis separately for each proposed process, rather than as a single unified claim. Which processes might be orthogonalisable in the sense specified above? Consider first a search algorithm such as Monte Carlo tree search. Roughly speaking, we can consider the “intelligence” of this algorithm to be based on the search implementation, and the “goals” of the algorithm to be based on the scores given to possible outcomes. In this case, the process orthogonality thesis seems to be true: we could, for example, flip the sign of all the outcomes, resulting in a search algorithm which is very good at finding ways to lose games.[3]

However, this no longer holds for more complex search algorithms. For example, the chess engine Deep Blue searches in a way that is guided by many task-specific heuristics built in by its designers, which would need to be changed in order for it to behave “intelligently” on a different objective.

The process orthogonality thesis seems even less plausible when applied to a typical machine learning training process, in which a system becomes more intelligent via a process of optimisation on a given dataset, towards a given objective function. In this setup, even if the agent learns to pursue the exact objective function (without any inner misalignment), we’re still limited to the space of objective functions which are capable of inducing intelligent behaviour. If the objective function specifies a very simple task, then agents trained on it will never acquire complex cognitive skills. Furthermore, both the system’s intelligence and its goals will be affected by the data source used. In particular, if the data is too limited, it will not be possible to instil some goals.

We can imagine training processes for which the process orthogonality thesis is more true, though. Here are two examples. Firstly, consider a training process which first does large-scale unsupervised training (such as autoregressive training on a big language corpus) to produce a very capable agent, and then uses supervised or reinforcement learning to specify what goals that agent should pursue. There’s an open empirical question about how much the first stage of training will shape the goals of the final agent, and how much the second stage of training will improve its capabilities, but it seems conceivable that its capabilities are primarily shaped by the former, and its goals are primarily shaped by the latter, which would make the process orthogonality thesis mostly true in this case.

Secondly, consider a model-based reinforcement learning agent which is able to plan ahead in a detailed way, and then chooses plans based on evaluations of their outcomes provided by a reward model. If this reward model is trained separately from the main agent, then we might just be able to “plug it in” to an already-intelligent system, making the overall process orthogonalisable. However, for the reward model to evaluate plans, it will need to be able to interpret the planner’s representations of possible outcomes, which suggests that there will be significant advantages from training them together, in which case the process would likely not be orthogonalisable.

Suppose that the process orthogonality thesis is false for the techniques we’re likely to use to build AGI. What implications does this have for the safety of those AGIs? Not necessarily reassuring ones—it depends on how dangerous the goals that tend to arise in the most effective processes of intelligence-development will be. We could evaluate this by discussing all the potential training regimes which might produce AGI, but this would be lengthy and error-prone. Instead, I’d like to make a more general argument by re-examining another classic thesis:

Instrumental convergence

The original version of this thesis is roughly as follows:

  • Instrumental convergence thesis: a wide range of the final goals which an AGI could have will incentivise them to pursue certain convergent instrumental subgoals (such as self-preservation and acquiring resources).

However, this again only talks about the final goals which are possible, rather than the ones which are likely to arise in systems we build. How can we reason about the latter? Some have proposed thinking using a simplicity prior over the set of all possible utility functions. But in modern machine learning, the utility function of an agent is not specified directly. Rather, an objective function is used to learn parameters which make the agent score highly on that objective function. If the resulting agent is sufficiently sophisticated, it seems reasonable to describe it as “having goals”. So in order to reason about the goals such an agent might acquire, we need to think about how easily those goals can be represented in machine learning models such as neural networks. Yet we know very little about how goals can be represented in neural networks, and which types of goals are more or less likely to arise.

How can we reason about possible goals in a more reliable way? One approach is to start, not by characterising all the goals an agent might have, but by characterising all the objective functions on which it might be trained. Such objective functions need to be either specifiable in code, or based on feedback from humans. However, training also depends heavily on the data source used, so we need to think more broadly: following reinforcement learning terminology, I’ll talk about environments which, in response to actions from the agent, produce observations and rewards. And so we might ask: when we train an AGI in a typical environment, can we predict that it will end up pursuing certain goals? This leads us to consider the following thesis:

  • Training goal convergence thesis: a wide range of environments in which we could train an AGI will lead to the development of goal-directed behaviour aimed towards certain convergent goals.

We can distinguish two ways in which these convergent goals might arise:

  • Training instrumental convergence: the AGI develops final goals which incentivise the pursuit of convergent instrumental subgoals.

  • Training final convergence: the AGI develops “convergent final goals”—i.e. a set of final goals which arise when trained in many different environments.

The first possibility is roughly analogous to the original instrumental convergence thesis. The second draws on similar ideas, but makes a distinct point. I predict that agents trained in sufficiently rich environments, on a sufficiently difficult objective function, will develop final goals of self-preservation, acquiring resources and knowledge, and gaining power. Note that this wouldn’t depend on agents themselves inferring the usefulness of these convergent final goals. Rather, it’s sufficient that the optimiser finds ways to instil these goals within the agents it trains, because they perform better with these final goals than without—plausibly for many of the same reasons that humans developed similar final goals.

This argument is subject to at least three constraints. Firstly, agents will be trained in environments which are quite different from the real world, and which therefore might incentivise very different goals. This is why I’ve left some convergent instrumental goals off the list of convergent final goals. For example, agents trained in an environment where self-improvement isn’t possible wouldn’t acquire that as a final goal. (To be clear, though, such agents can still acquire the convergent instrumental goal of self-improvement when deployed in the real world, by inferring that it would be valuable for their final goals.) However, it seems likely that environments sophisticated enough to lead to AGI will require agents to act over a sufficiently long time horizon for some influence-seeking actions to have high payoffs, in particular the ones I listed in the previous paragraph.

Secondly, convergent final goals are instilled by the optimiser rather than being a result of AGI reasoning. But search done by optimisers is local and therefore has predictable limitations. For example, reward tampering is sufficiently difficult to stumble across during training that we shouldn’t expect an optimiser to instil that trait directly into agents. Instead, an AGI which had acquired the final goal of increasing its score on the objective function might reason that reward tampering would be a useful way to do so. However, just as it was difficult for evolution to instil the final goal of “increase inclusive genetic fitness” in humans, it may also be difficult for optimisation to instil into AIs the final goal of increasing their score on the objective function; hence it’s an open question whether “doing well on the objective function” is a convergent final goal.

Thirdly, the objective function may not just be composed of easily-specified code, or human feedback, but also of feedback from previously-trained AIs. Insofar as those AIs just model human feedback, then we can just think of this as a way to make human feedback more scalable. But the possibility of them giving types of feedback that groups of humans aren’t realistically capable of reproducing makes it hard to characterise the types of environments we might train AGIs in. For now, I think it’s best to explicitly set aside this possibility when discussing training convergence.

Fragility of value

  • Original formulation: losing only a small part of our goals leads to catastrophe.

  • Fragility of value thesis (deployment): a small error in the goals of an AGI will lead it to pursue catastrophic misbehaviour.

  • Fragility of value thesis (training): a small error in the objective function used to train an AGI will lead it to pursue catastrophic misbehaviour.

In the training case, we can quantify (in theory) what a small error is—for instance, the difference between the reward actually given to the agent, versus the reward we’d give if we were fully informed about which rewards will lead to which outcomes.

In the deployment case, it’s much more difficult to describe what a “small error” is; we don’t really have a good way of reasoning about the distances between different goals as represented in neural networks. But if we think of a “small perturbation” as a small shift in neural weights, it seems unlikely that this would cause a fully aligned agent to produce catastrophic outcomes. I interpret this claim to be roughly equivalent to the claim that there’s a “broad basin of corrigibility”.

Goodhart’s law

  • Original formulation: when a measure becomes a target, it ceases to be a good measure.

  • Goodhart’s law (deployment): when a measure becomes an agent’s goal, it ceases to be a good measure.

  • Goodhart’s law (training): when a measure becomes a training objective function, it ceases to be a good measure.

The distinction between these two is similar to the distinction between the two training convergence theses from above. In one case, an agent reasons about ways to optimise for the measure. In another case, though, an agent may just be optimised towards doing well on that measure, without deliberately making plans to drive it to extreme values. These have different implications for when the measure might fail, and in which ways.


1. These days I’m confused about why it took me so long to understand this outer/​inner alignment distinction, but I guess that’s a good lesson about hindsight bias.

2. Of course this will require the agent to internalise human values to some extent, but plausibly in a much less demanding way. Some also argue that continued flexible control is not in fact a benefit, since they’re worried about how AI will affect geopolitics. But I think that, as much as possible, we should separate the problem of AGI safety from the problem of AGI governance—that is, we should produce safety techniques which can be used by anyone, not just “the right people”.

3. It’s also true for AIXI, because its intelligence comes from simulating all possible futures, which can be used to pursue any reward function.