This post is an attempt to sketch a presentation of the alignment problem while tabooing words like agency, goals or optimization as core parts of the ontology.^{[1]} This is not a critique of frameworks which treat these topics as fundamental, in fact I end up concluding that this is likely justified. This is not a ‘new’ framework in any sense, but I am writing it down with my own emphasis in case it helps others who feel sort of uneasy about standard views on agency. Any good ideas here probably grew out of Risks From Learned Optimization or my subsequent discussions with Chris, Joar and Evan.

Epistemic State: Likely many errors both in facts and emphasis, I would be very happy to find out where they are.

Prosaic AI Alignment as a generalization problem

I think the current and near-future state of AI development is well-described by us having:

1. very little understanding of intelligence (here defined as something like “generally powerful problem solvers”), but

2. a lot of ‘dumb’ compute.

Prosaic AGI development is, in my view, about using 2 to get around 1. A very simplistic model of how to do this involves three components:

A parametrized model-space large enough that we think it contains the sorts of generalized problem solvers we want.

A search criteria which we can test for.

A search process which uses massive compute to find parameters in the model-space satisfying the search criteria.

Much of ML is about designing all these parts to make the search process as compute-efficient as possible, for instance by making everything differentiable and using gradient-descent. For the purposes of this discussion I will consider an even simpler model where the search process simply samples random models from some prior over the model-space until it finds one satisfying some (boolean) search criteria.

While we are generally ignoring computational and practical concerns it is important that the search criteria is limited—you can only check that the model acts correctly on a small fraction of the possible situations we want to use this model in. We might talk about a search criteria being feasible if it is both possible to gather the data required to specify it and reasonable to expect to actually find a model fulfilling it with the amount of compute you have. The goal then is to pick a model-space, prior and feasible search criteria such that the model does what we want in any situation it might end up in. Following Paul Christiano (in Techniques for optimizing worst case performance) we might broadly distinguish two ways that this could be false, which we will refer to as ‘generalization failures’:

Benign failure: It does something essentially random or stupid, which has little unexpected effect on the world.

Malign Failure: It generalizes ‘competently’ but not in the way we want, affecting the world in potentially catastrophic and unexpected ways.

Benign failures are seen a lot in current ML and often labelled robustness problems, where distributional shifts or adversarial examples lead to failures of generalization. These failures don’t seem very dangerous, and can usually be solved eventually through iteration and fine-tuning. This is not the case for malign generalization failure, which have far great risks. (Slightly breaking the taboo: the classic stories of training a deceptively aligned expected utility maximizer which only does what we want because it realizes it is being tested is a malign generalization failure, though in this framework this is just an example, and whether this is central or not is an empirical claim which will be explored in the second section.)

A contrived story which doesn’t rely on superintelligence but also demonstrates a malign generalization failure is:

We are searching for a good design for a robot to clean up garbage dumps, so we run a bunch of simulations until we find one which passes our selection criteria of clearing all the garbage. We happily release this robot in the nearest garbage dump, and find that it does indeed clear the garbage, but alarmingly it does this by manufacturing self-replicating garbage-eating nano-bots. These nano-both quickly consume the earth. The robot itself knows nothing other than how to construct this precise set of nano-bots from materials found in a garbage-dumb, which is impressive but not generally intelligent.

Another class of examples are things which are generally intelligent, but in a very messy way with many blind spots and weird quirks, some of which eventually lead to catastrophic outcomes. I think a better understanding of which kinds of malign generalization errors we might be missing could be potentially very important.

So how do we shape the way the model generalizes? I think the key question is understanding how the inputs to the search process (the model-space, the prior and the search criteria) affect the output, which I think is best understood as a posterior distribution in the model-space gained by conditioning on the model clearing the search criteria.

Some examples of how the inputs might relate to the outputs:

With an empty search criteria the posterior is equal to the prior and unless the prior is already very specific you should expect sampling from the posterior to give models acting randomly and getting benign failures everywhere.

If our search criteria requires unreasonable performance on some small test-set, and our prior doesn’t give a significant enough bias toward simple/general models then we should expect benign generalization failures due to ‘overfitting’.

If our search criteria, prior and model-space only focus on a limited task but are set up to correctly identify general solutions to this task then we might expect little generalization error within this task, while getting almost entirely benign errors outside the task. This seems to be where current ML systems are situated when they work well.

If we pick search criteria which require good performance on increasingly general tasks, and we make sure that the prior is increasingly weighted toward the right kind of simple/general solutions then we might expect to see less generalization failure overall in a broad domain, but we also risk malign generalization errors appearing.

Summing up, I think a reasonable definition of the prosaic AI alignment project is to prevent malign generalization error from ever happening, even as we try to eliminate benign errors. This seems difficult mostly because moving toward robust generalization and toward malign generalization seem very similar, and you need some way to differentially advantage the first. Some approaches to this include:

Design a model-space and prior which advantages the sort of ‘intended generalization’ that we want, or which are transparent enough that we can use really powerful search criteria.

Design search criteria which effectively shift the distribution to one containing mostly robustly generalizing models. Such criteria would likely involve a lot of inspection to see how the model actually works and generalizes internally. Paul Christiano’s approach of adversarial training seems like a plausibly good way to do this.

Find ways of making stronger search criteria feasible. An example of this is that the search criterion might be bottle-necked by human judgement and oversight, and amplification is a scheme which might remove this bottleneck and make more detailed search criteria more feasible.

Consider whether we can replace the idea of one big system which should generalize to any situation with many specialized systems which are allowed to have benign generalization failures in many domains. This might correspond to a more ‘comprehensive services’-style solution, as described by Eric Drexler and summarized by Rohin Shah.

Reintroducing agency

Now I will try to give an account of why the concepts of rational agency/goals/optimizers might be useful in this picture, even though they aren’t explicitly part of the problem statement nor the mentioned solutions. This is based on a hand-wavy hypothesis:

H: If you have a prior and model-space which sufficiently advantages descriptive simplicity and a selection criteria which tests performance in a sufficiently general set of situations, then your posterior distribution on the model-space will contain a large measure of models internally implementing highly effective expected utility optimization for some utility function.

There are several arguments supporting this hypothesis, such as those presented by Eliezer in Sufficiently Optimized Agents Appear Coherent and the simplicity/effectiveness of simple optimization algorithms.

If H is true then it provides a good reason to study and understand goals, agency and optimization as describing properties of a particular cluster of models which will play a very important role once we start using these methods to solve very general classes of problems.

As a slight aside, this also gives a framing for the much discussed mesa optimization problem in the Risks from Learned Optimization paper, which points out that there is no a priori reason to expect the utility function to be the one you might have used to grade the model as part of the selection criteria, and that most of the measure might in fact taken up by pseudo-aligned or deceptively-aligned models, which represent a particular example of malign generalization error. In fact, if H is true, avoiding malign generalization errors largely comes down to avoiding misaligned mesa optimizers.

I think the world where H is true is a good world, because it’s a world where we are much closer to understanding and predicting how sophisticated models generalize. If we are dealing with a model doing expected utility maximization we can ‘just’ try to understand whether we agree with its goal, and then essentially trust that it will correctly and stably generalize to almost any situation.

If you agree that understanding how an expected utility maximizer generalizes could be easier than for many other classes of minds, then studying this cluster of model-space could be useful even if H is false, as long as the weaker hypothesis H' still holds.

H': We will be able to find some model-space, prior and feasible selection criteria such that the posterior distribution on the model-space contains a large measure of models internally implementing highly effective expected utility maximization for some utility function.

In the world where H' holds we can then restrict ourselves to this way of searching, and can thus use the kinds of methods and assumptions which we could in the world where H was true.

In either of these cases I think current models of AI Alignment which treat optimizers with goals as the central problem are justified. However, I think there are reasons to believe H and possibly even H' might be false, which essentially come down to embedded agency and bounded rationality concerns pushing away from elegant agent frameworks. I hope to write more about this at some point in the future. I also feel generally uncomfortable resting the safety of humanity on assumptions like this, and would like a much better understanding of how generalization works in other clusters or parts of various model-spaces.

Summary

I have tried to present a version of the prosaic AI alignment project which doesn’t make important reference to the concept of agency, instead viewing it as a generalization problem where you are trying to avoid finding models which fail disastrously when presented with new situations. Agency then reappears as a potentially important cluster of the space of possible models, which under certain empirical hypotheses justifies it as the central topic, though I still wish we had more understanding of other parts of various model-spaces.

Thanks to Evan Hubinger, Lukas Finnveden and Adele Lopez for their feedback on this post! ↩︎

This explanation seems clear and helpful to me. I wonder if you can extend it to also explain non-agentic approaches to Prosaic AI Alignment (and why some people prefer those). For example you cite Paul Christiano a number of times, but I don’t think he intends to end up with a model “implementing highly effective expected utility maximization for some utility function”.

If we are dealing with a model doing expected utility maximization we can ‘just’ try to understand whether we agree with its goal, and then essentially trust that it will correctly and stably generalize to almost any situation.

I wonder if you can extend it to also explain non-agentic approaches to Prosaic AI Alignment (and why some people prefer those).

I’m quite confused about what a non-agentic approach actually looks like, and I agree that extending this to give a proper account would be really interesting. A possible argument for actively avoiding ‘agentic’ models from this framework is:

Models which generalize very competently also seem more likely to have malign failures, so we might want to avoid them.

If we believe H then things which generalize very competently are likely to have agent-like internal architecture.

Having a selection criteria or model-space/prior which actively pushes away from such agent-like architectures could then help push away from things which generalize too broadly.

I think my main problem with this argument is that step 3 might make step 2 invalid—it might be that if you actively punish agent-like architecture in your search then you will break the conditions that made ‘too broad generalization’ imply ‘agent-like architecture’, and thus end up with things that still generalize very broadly (with all the downsides of this) but just look a lot weirder.

This seems too optimistic/trusting. See Ontology identification problem, Modeling distant superintelligences, and more recently The “Commitment Races” problem.

Thanks for the links, I definitely agree that I was drastically oversimplifying this problem. I still think this task might be much simpler than the task of trying to understand the generalization of some strange model whose internal working we don’t even have a vocabulary to describe.

(black) hat tip to johnswentworth for the notion that the choice of boundary for the agent is arbitrary in the sense that you can think of a thermostat optimizing the environment or think of the environment as optimizing the thermostat. Collapsing sensor/control duality for at least some types of feedback circuits.

These sorts of problems are what caused me to want a presentation which didn’t assume well-defined agents and boundaries in the ontology, but I’m not sure how it applies to the above—I am not looking for optimization as a behavioral pattern but as a concrete type of computation, which involves storing world-models and goals and doing active search for actions which further the goals. Neither a thermostat nor the world outside seem to do this from what I can see? I think I’m likely missing your point.

Motivating example: consider a primitive bacterium with a thermostat, a light sensor, and a salinity detector, each of which has functional mappings to some movement pattern. You could say this system has a 3 dimensional map and appears to search over it.

In light of this exchange, it seems like it would be interesting to analyze how much arguments for problematic properties of superintelligent utility-maximizing agents (like instrumental convergence) actually generalize to more general well-generalizing systems.

I endorse this. I like the framing, and it’s very much in line with how I think about the problem. One point I’d make is: I’d replace the word “model” with “algorithm”, to be even more agnostic. “Model” seems for many people already to carry an implicit intuitive interpretation of what the learned algorithm is doing, namely “trying to faithfully represent the problem”, or something similar.

I think the world where H is true is a good world, because it’s a world where we are much closer to understanding and predicting how sophisticated models generalize.

This seemed liked a really surprising sentence to me. If the model is an agent, doesn’t that pull in all the classic concerns related to treacherous turns and so on? Whereas a non-agent probably won’t have an incentive to deceive you?

Even if the model is an agent, then you still need to be able to understand its goals based on their internal representation. Which could mean, for example, understanding what a deep neural network was doing. Which doesn’t appear to be much easier than the original task of “understand what a model, for example a deep neural network, is doing”.

## Tabooing ‘Agent’ for Prosaic Alignment

This post is an attempt to sketch a presentation of the alignment problem while tabooing words like agency, goals or optimization as core parts of the ontology.

^{[1]}This is not a critique of frameworks which treat these topics as fundamental, in fact I end up concluding that this is likely justified. This is not a ‘new’ framework in any sense, but I am writing it down with my own emphasis in case it helps others who feel sort of uneasy about standard views on agency. Any good ideas here probably grew out of Risks From Learned Optimization or my subsequent discussions with Chris, Joar and Evan.Epistemic State:Likely many errors both in facts and emphasis, I would be very happy to find out where they are.## Prosaic AI Alignment as a generalization problem

I think the current and near-future state of AI development is well-described by us having:

1. very little understanding of intelligence (here defined as something like “generally powerful problem solvers”), but

2. a lot of ‘dumb’ compute.

Prosaic AGI development is, in my view, about using 2 to get around 1. A very simplistic model of how to do this involves three components:

A parametrized

model-spacelarge enough that we think it contains the sorts of generalized problem solvers we want.A

search criteriawhich we can test for.A

search processwhich uses massive compute to find parameters in the model-space satisfying the search criteria.Much of ML is about designing all these parts to make the search process as compute-efficient as possible, for instance by making everything differentiable and using gradient-descent. For the purposes of this discussion I will consider an even simpler model where the search process simply samples random models from some

priorover the model-space until it finds one satisfying some (boolean) search criteria.While we are generally ignoring computational and practical concerns it is important that the search criteria is limited—you can only check that the model acts correctly on a small fraction of the possible situations we want to use this model in. We might talk about a search criteria being

feasibleif it is both possible to gather the data required to specify it and reasonable to expect to actually find a model fulfilling it with the amount of compute you have. The goal then is to pick a model-space, prior and feasible search criteria such that the model does what we want in any situation it might end up in. Following Paul Christiano (in Techniques for optimizing worst case performance) we might broadly distinguish two ways that this could be false, which we will refer to as ‘generalization failures’:Benign failure:It does something essentially random or stupid, which has little unexpected effect on the world.Malign Failure:It generalizes ‘competently’ but not in the way we want, affecting the world in potentially catastrophic and unexpected ways.Benign failures are seen a lot in current ML and often labelled robustness problems, where distributional shifts or adversarial examples lead to failures of generalization. These failures don’t seem very dangerous, and can usually be solved eventually through iteration and fine-tuning. This is not the case for malign generalization failure, which have far great risks. (Slightly breaking the taboo: the classic stories of training a deceptively aligned expected utility maximizer which only does what we want because it realizes it is being tested is a malign generalization failure, though in this framework this is just an example, and whether this is central or not is an empirical claim which will be explored in the second section.)

A contrived story which doesn’t rely on superintelligence but also demonstrates a malign generalization failure is:

Another class of examples are things which are generally intelligent, but in a very messy way with many blind spots and weird quirks, some of which eventually lead to catastrophic outcomes. I think a better understanding of which kinds of malign generalization errors we might be missing could be potentially very important.

So how do we shape the way the model generalizes? I think the key question is understanding how the

inputsto the search process (the model-space, the prior and the search criteria) affect theoutput,which I think is best understood as aposteriordistribution in the model-space gained by conditioning on the model clearing the search criteria.Some examples of how the inputs might relate to the outputs:

With an empty search criteria the posterior is equal to the prior and unless the prior is already very specific you should expect sampling from the posterior to give models acting randomly and getting benign failures everywhere.

If our search criteria requires unreasonable performance on some small test-set, and our prior doesn’t give a significant enough bias toward simple/general models then we should expect benign generalization failures due to ‘overfitting’.

If our search criteria, prior and model-space only focus on a limited task but are set up to correctly identify general solutions to this task then we might expect little generalization error within this task, while getting almost entirely benign errors outside the task. This seems to be where current ML systems are situated when they work well.

If we pick search criteria which require good performance on increasingly general tasks, and we make sure that the prior is increasingly weighted toward the right kind of simple/general solutions then we might expect to see less generalization failure overall in a broad domain, but we also risk malign generalization errors appearing.

Summing up, I think a reasonable definition of the prosaic AI alignment project is to prevent malign generalization error from ever happening, even as we try to eliminate benign errors. This seems difficult mostly because moving toward robust generalization and toward malign generalization seem very similar, and you need some way to differentially advantage the first. Some approaches to this include:

Design a model-space and prior which advantages the sort of ‘intended generalization’ that we want, or which are transparent enough that we can use really powerful search criteria.

Design search criteria which effectively shift the distribution to one containing mostly robustly generalizing models. Such criteria would likely involve a lot of inspection to see how the model actually works and generalizes internally. Paul Christiano’s approach of adversarial training seems like a plausibly good way to do this.

Find ways of making stronger search criteria feasible. An example of this is that the search criterion might be bottle-necked by human judgement and oversight, and amplification is a scheme which might remove this bottleneck and make more detailed search criteria more feasible.

Consider whether we can replace the idea of one big system which should generalize to any situation with many specialized systems which are allowed to have benign generalization failures in many domains. This might correspond to a more ‘comprehensive services’-style solution, as described by Eric Drexler and summarized by Rohin Shah.

## Reintroducing agency

Now I will try to give an account of why the concepts of rational agency/goals/optimizers might be useful in this picture, even though they aren’t explicitly part of the problem statement nor the mentioned solutions. This is based on a hand-wavy hypothesis:

H

: If you have a prior and model-space which sufficiently advantages descriptive simplicity and a selection criteria which tests performance in a sufficiently general set of situations, then your posterior distribution on the model-space will contain a large measure of models internally implementing highly effective expected utility optimization for some utility function.There are several arguments supporting this hypothesis, such as those presented by Eliezer in Sufficiently Optimized Agents Appear Coherent and the simplicity/effectiveness of simple optimization algorithms.

If H is true then it provides a good reason to study and understand goals, agency and optimization as describing properties of a particular cluster of models which will play a very important role once we start using these methods to solve very general classes of problems.

As a slight aside, this also gives a framing for the much discussed mesa optimization problem in the Risks from Learned Optimization paper, which points out that there is no

a priorireason to expect the utility function to be the one you might have used to grade the model as part of the selection criteria, and that most of the measure might in fact taken up bypseudo-alignedordeceptively-alignedmodels, which represent a particular example of malign generalization error. In fact, if H is true, avoiding malign generalization errors largely comes down to avoiding misaligned mesa optimizers.I think the world where H is true is a good world, because it’s a world where we are

muchcloser to understanding and predicting how sophisticated models generalize. If we are dealing with a model doing expected utility maximization we can ‘just’ try to understand whether we agree with its goal, and then essentially trust that it will correctly and stably generalize to almost any situation.If you agree that understanding how an expected utility maximizer generalizes could be easier than for many other classes of minds, then studying this cluster of model-space could be useful even if H is false, as long as the weaker hypothesis H' still holds.

H'

: We will be able to findsomemodel-space, prior and feasible selection criteria such that the posterior distribution on the model-space contains a large measure of models internally implementing highly effective expected utility maximization for some utility function.In the world where H' holds we can then restrict ourselves to this way of searching, and can thus use the kinds of methods and assumptions which we could in the world where H was true.

In either of these cases I think current models of AI Alignment which treat optimizers with goals as the central problem are justified. However, I think there are reasons to believe H and possibly even H' might be false, which essentially come down to embedded agency and bounded rationality concerns pushing away from elegant agent frameworks. I hope to write more about this at some point in the future. I also feel generally uncomfortable resting the safety of humanity on assumptions like this, and would like a much better understanding of how generalization works in other clusters or parts of various model-spaces.

## Summary

I have tried to present a version of the prosaic AI alignment project which doesn’t make important reference to the concept of agency, instead viewing it as a

generalization problemwhere you are trying to avoid finding models which fail disastrously when presented with new situations. Agency then reappears as a potentially important cluster of the space of possible models, which under certain empirical hypotheses justifies it as the central topic, though I still wish we had more understanding of other parts of various model-spaces.Thanks to Evan Hubinger, Lukas Finnveden and Adele Lopez for their feedback on this post! ↩︎

This explanation seems clear and helpful to me. I wonder if you can extend it to also explain non-agentic approaches to Prosaic AI Alignment (and why some people prefer those). For example you cite Paul Christiano a number of times, but I don’t think he intends to end up with a model “implementing highly effective expected utility maximization for some utility function”.

This seems too optimistic/trusting. See Ontology identification problem, Modeling distant superintelligences, and more recently The “Commitment Races” problem.

I’m quite confused about what a non-agentic approach actually looks like, and I agree that extending this to give a proper account would be really interesting. A possible argument for actively avoiding ‘agentic’ models from this framework is:

Models which generalize very competently also seem more likely to have malign failures, so we might want to avoid them.

If we believe H then things which generalize very competently are likely to have agent-like internal architecture.

Having a selection criteria or model-space/prior which actively pushes away from such agent-like architectures could then help push away from things which generalize too broadly.

I think my main problem with this argument is that step 3 might make step 2 invalid—it might be that if you actively punish agent-like architecture in your search then you will break the conditions that made ‘too broad generalization’ imply ‘agent-like architecture’, and thus end up with things that still generalize very broadly (with all the downsides of this) but just look a lot weirder.

Thanks for the links, I definitely agree that I was

drasticallyoversimplifying this problem. I still think this task might be much simpler than the task of trying to understand the generalization of some strange model whose internal working we don’t even have a vocabulary to describe.(black) hat tip to johnswentworth for the notion that the choice of boundary for the agent is arbitrary in the sense that you can think of a thermostat optimizing the environment or think of the environment as optimizing the thermostat. Collapsing sensor/control duality for at least some types of feedback circuits.

These sorts of problems are what caused me to want a presentation which didn’t assume well-defined agents and boundaries in the ontology, but I’m not sure how it applies to the above—I am not looking for optimization as a behavioral pattern but as a concrete type of computation, which involves storing world-models and goals and doing active search for actions which further the goals. Neither a thermostat nor the world outside seem to do this from what I can see? I think I’m likely missing your point.

Motivating example: consider a primitive bacterium with a thermostat, a light sensor, and a salinity detector, each of which has functional mappings to some movement pattern. You could say this system has a 3 dimensional map and appears to search over it.

In light of this exchange, it seems like it would be interesting to analyze how much arguments for problematic properties of superintelligent utility-maximizing agents (like instrumental convergence) actually generalize to more general well-generalizing systems.

Strongly agree with this, I think this seems very important.

I endorse this. I like the framing, and it’s very much in line with how I think about the problem. One point I’d make is: I’d replace the word “model” with “algorithm”, to be even more agnostic. “Model” seems for many people already to carry an implicit intuitive interpretation of what the learned algorithm is doing, namely “trying to faithfully represent the problem”, or something similar.

This seemed liked a really surprising sentence to me. If the model is an agent, doesn’t that pull in all the classic concerns related to treacherous turns and so on? Whereas a non-agent probably won’t have an incentive to deceive you?

Even if the model is an agent, then you still need to be able to understand its goals based on their internal representation. Which could mean, for example, understanding what a deep neural network was doing. Which doesn’t appear to be much easier than the original task of “understand what a model, for example a deep neural network, is doing”.