Tabooing 'Agent' for Prosaic Alignment

This post is an attempt to sketch a presentation of the alignment problem while tabooing words like agency, goals or optimization as core parts of the ontology.^[1] This is not a critique of frameworks which treat these topics as fundamental, in fact I end up concluding that this is likely justified. This is not a ‘new’ framework in any sense, but I am writing it down with my own emphasis in case it helps others who feel sort of uneasy about standard views on agency. Any good ideas here probably grew out of Risks From Learned Optimization or my subsequent discussions with Chris, Joar and Evan.

Epistemic State: Likely many errors both in facts and emphasis, I would be very happy to find out where they are.

Prosaic AI Alignment as a generalization problem

I think the current and near-future state of AI development is well-described by us having:

1. very little understanding of intelligence (here defined as something like “generally powerful problem solvers”), but

2. a lot of ‘dumb’ compute.

Prosaic AGI development is, in my view, about using 2 to get around 1. A very simplistic model of how to do this involves three components:

A parametrized model-space large enough that we think it contains the sorts of generalized problem solvers we want.
A search criteria which we can test for.
A search process which uses massive compute to find parameters in the model-space satisfying the search criteria.

Much of ML is about designing all these parts to make the search process as compute-efficient as possible, for instance by making everything differentiable and using gradient-descent. For the purposes of this discussion I will consider an even simpler model where the search process simply samples random models from some prior over the model-space until it finds one satisfying some (boolean) search criteria.

While we are generally ignoring computational and practical concerns it is important that the search criteria is limited—you can only check that the model acts correctly on a small fraction of the possible situations we want to use this model in. We might talk about a search criteria being feasible if it is both possible to gather the data required to specify it and reasonable to expect to actually find a model fulfilling it with the amount of compute you have. The goal then is to pick a model-space, prior and feasible search criteria such that the model does what we want in any situation it might end up in. Following Paul Christiano (in Techniques for optimizing worst case performance) we might broadly distinguish two ways that this could be false, which we will refer to as ‘generalization failures’:

Benign failure: It does something essentially random or stupid, which has little unexpected effect on the world.
Malign Failure: It generalizes ‘competently’ but not in the way we want, affecting the world in potentially catastrophic and unexpected ways.

Benign failures are seen a lot in current ML and often labelled robustness problems, where distributional shifts or adversarial examples lead to failures of generalization. These failures don’t seem very dangerous, and can usually be solved eventually through iteration and fine-tuning. This is not the case for malign generalization failure, which have far great risks. (Slightly breaking the taboo: the classic stories of training a deceptively aligned expected utility maximizer which only does what we want because it realizes it is being tested is a malign generalization failure, though in this framework this is just an example, and whether this is central or not is an empirical claim which will be explored in the second section.)

A contrived story which doesn’t rely on superintelligence but also demonstrates a malign generalization failure is:

We are searching for a good design for a robot to clean up garbage dumps, so we run a bunch of simulations until we find one which passes our selection criteria of clearing all the garbage. We happily release this robot in the nearest garbage dump, and find that it does indeed clear the garbage, but alarmingly it does this by manufacturing self-replicating garbage-eating nano-bots. These nano-both quickly consume the earth. The robot itself knows nothing other than how to construct this precise set of nano-bots from materials found in a garbage-dumb, which is impressive but not generally intelligent.

Another class of examples are things which are generally intelligent, but in a very messy way with many blind spots and weird quirks, some of which eventually lead to catastrophic outcomes. I think a better understanding of which kinds of malign generalization errors we might be missing could be potentially very important.

So how do we shape the way the model generalizes? I think the key question is understanding how the inputs to the search process (the model-space, the prior and the search criteria) affect the output, which I think is best understood as a posterior distribution in the model-space gained by conditioning on the model clearing the search criteria.

Some examples of how the inputs might relate to the outputs:

With an empty search criteria the posterior is equal to the prior and unless the prior is already very specific you should expect sampling from the posterior to give models acting randomly and getting benign failures everywhere.
If our search criteria requires unreasonable performance on some small test-set, and our prior doesn’t give a significant enough bias toward simple/general models then we should expect benign generalization failures due to ‘overfitting’.
If our search criteria, prior and model-space only focus on a limited task but are set up to correctly identify general solutions to this task then we might expect little generalization error within this task, while getting almost entirely benign errors outside the task. This seems to be where current ML systems are situated when they work well.
If we pick search criteria which require good performance on increasingly general tasks, and we make sure that the prior is increasingly weighted toward the right kind of simple/general solutions then we might expect to see less generalization failure overall in a broad domain, but we also risk malign generalization errors appearing.

Summing up, I think a reasonable definition of the prosaic AI alignment project is to prevent malign generalization error from ever happening, even as we try to eliminate benign errors. This seems difficult mostly because moving toward robust generalization and toward malign generalization seem very similar, and you need some way to differentially advantage the first. Some approaches to this include:

Design a model-space and prior which advantages the sort of ‘intended generalization’ that we want, or which are transparent enough that we can use really powerful search criteria.
Design search criteria which effectively shift the distribution to one containing mostly robustly generalizing models. Such criteria would likely involve a lot of inspection to see how the model actually works and generalizes internally. Paul Christiano’s approach of adversarial training seems like a plausibly good way to do this.
Find ways of making stronger search criteria feasible. An example of this is that the search criterion might be bottle-necked by human judgement and oversight, and amplification is a scheme which might remove this bottleneck and make more detailed search criteria more feasible.
Consider whether we can replace the idea of one big system which should generalize to any situation with many specialized systems which are allowed to have benign generalization failures in many domains. This might correspond to a more ‘comprehensive services’-style solution, as described by Eric Drexler and summarized by Rohin Shah.

Reintroducing agency

Now I will try to give an account of why the concepts of rational agency/goals/optimizers might be useful in this picture, even though they aren’t explicitly part of the problem statement nor the mentioned solutions. This is based on a hand-wavy hypothesis:

$H$ : If you have a prior and model-space which sufficiently advantages descriptive simplicity and a selection criteria which tests performance in a sufficiently general set of situations, then your posterior distribution on the model-space will contain a large measure of models internally implementing highly effective expected utility optimization for some utility function.

There are several arguments supporting this hypothesis, such as those presented by Eliezer in Sufficiently Optimized Agents Appear Coherent and the simplicity/effectiveness of simple optimization algorithms.

If $H$ is true then it provides a good reason to study and understand goals, agency and optimization as describing properties of a particular cluster of models which will play a very important role once we start using these methods to solve very general classes of problems.

As a slight aside, this also gives a framing for the much discussed mesa optimization problem in the Risks from Learned Optimization paper, which points out that there is no a priori reason to expect the utility function to be the one you might have used to grade the model as part of the selection criteria, and that most of the measure might in fact taken up by pseudo-aligned or deceptively-aligned models, which represent a particular example of malign generalization error. In fact, if $H$ is true, avoiding malign generalization errors largely comes down to avoiding misaligned mesa optimizers.

I think the world where $H$ is true is a good world, because it’s a world where we are much closer to understanding and predicting how sophisticated models generalize. If we are dealing with a model doing expected utility maximization we can ‘just’ try to understand whether we agree with its goal, and then essentially trust that it will correctly and stably generalize to almost any situation.

If you agree that understanding how an expected utility maximizer generalizes could be easier than for many other classes of minds, then studying this cluster of model-space could be useful even if $H$ is false, as long as the weaker hypothesis $H'$ still holds.

$H'$ : We will be able to find some model-space, prior and feasible selection criteria such that the posterior distribution on the model-space contains a large measure of models internally implementing highly effective expected utility maximization for some utility function.

In the world where $H'$ holds we can then restrict ourselves to this way of searching, and can thus use the kinds of methods and assumptions which we could in the world where $H$ was true.

In either of these cases I think current models of AI Alignment which treat optimizers with goals as the central problem are justified. However, I think there are reasons to believe $H$ and possibly even $H'$ might be false, which essentially come down to embedded agency and bounded rationality concerns pushing away from elegant agent frameworks. I hope to write more about this at some point in the future. I also feel generally uncomfortable resting the safety of humanity on assumptions like this, and would like a much better understanding of how generalization works in other clusters or parts of various model-spaces.

Summary

I have tried to present a version of the prosaic AI alignment project which doesn’t make important reference to the concept of agency, instead viewing it as a generalization problem where you are trying to avoid finding models which fail disastrously when presented with new situations. Agency then reappears as a potentially important cluster of the space of possible models, which under certain empirical hypotheses justifies it as the central topic, though I still wish we had more understanding of other parts of various model-spaces.

↩︎
Thanks to Evan Hubinger, Lukas Finnveden and Adele Lopez for their feedback on this post!

Tabooing ‘Agent’ for Prosaic Alignment

Prosaic AI Alignment as a generalization problem

Reintroducing agency

Summary