# Selection Theorems: A Program For Understanding Agents

What’s the type signature of an agent?

For instance, what kind-of-thing is a “goal”? What data structures can represent “goals”? Utility functions are a common choice among theorists, but they don’t seem quite right. And what are the inputs to “goals”? Even when using utility functions, different models use different inputs—Coherence Theorems imply that utilities take in predefined “bet outcomes”, whereas AI researchers often define utilities over “world states” or “world state trajectories”, and human goals seem to be over latent variables in humans’ world models.

And that’s just goals. What about “world models”? Or “agents” in general? What data structures can represent these things, how do they interface with each other and the world, and how do they embed in their low-level world? These are all questions about the type signatures of agents.

One general strategy for answering these sorts of questions is to look for what I’ll call Selection Theorems. Roughly speaking, **a Selection Theorem tells us something about what agent type signatures will be selected for (by e.g. natural selection or ML training or economic profitability) in some broad class of environments**. In inner/outer agency terms, it tells us what kind of inner agents will be selected by outer optimization processes.

We already have many Selection Theorems: Coherence and Dutch Book theorems, Good Regulator and Gooder Regulator, the Kelly Criterion, etc. These theorems generally seem to point in a similar direction—suggesting deep unifying principles exist—but they have various holes and don’t answer all the questions we want. We need better Selection Theorems if they are to be a foundation for understanding human values, inner agents, value drift, and other core issues of AI alignment.

**The quest for better Selection Theorems has a lot of “surface area”**—lots of different angles for different researchers to make progress, within a unified framework, but without redundancy. It also requires *relatively* little ramp-up; I don’t think someone needs to read the entire giant corpus of work on alignment to contribute useful new Selection Theorems. At the same time, **better Selection Theorems directly tackle the core conceptual problems of alignment and agency**; I expect sufficiently-good Selection Theorems would get us most of the way to solving the hardest parts of alignment. Overall, I think they’re a good angle for people who want to make useful progress on the theory of alignment and agency, and have strong theoretical/conceptual skills.

Outline of this post:

More detail on what “type signatures” and “Selection Theorems” are

Examples of existing Selection Theorems and what they prove (or assume) about agent type signatures

Aspects which I expect/want from future Selection Theorems

How to work on Selection Theorems

## What’s A Type Signature Of An Agent?

We’ll view the “type signature of an agent” as an answer to three main questions:

Representation: What “data structure” represents the agent—i.e. what are its high-level components, and how can they be represented?

Interfaces: What are the “inputs” and “outputs” between the components—i.e. how do they interface with each other and with the environment?

Embedding: How does the abstract “data structure” representation relate to the low-level system in which the agent is implemented?

A selection theorem typically assumes some parts of the type signature (often implicitly), and derives others.

For example, coherence theorems show that any non-dominated strategy is equivalent to maximization of Bayesian expected utility.

Representation: utility function and probability distribution.

Interfaces: both the utility function and distribution take in “bet outcomes”, assumed to be specified as part of the environment. The outputs of the agent are “actions” which maximize expected utility under the distribution; the inputs are “observations” which update the distribution via Bayes’ Rule.

Embedding: “agent” must interact with “environment” only via the specified “bets”. Utility function and distribution relate to low-level agent implementation via behavioral equivalence.

Coherence theorems fall short of what we ultimately want in a lot of ways: neither the assumptions nor the type signature are quite the right form for real-world agents. (More on that later.) But they’re a good illustration of what a selection theorem is, and how it tells us about the type signature of agents.

Here are some examples of “type signature” questions for specific aspects of agents:

World models

Does the agent have a world model or models?

What data structure can represent an agent’s world model? (Probability distributions are the most common choice.)

How does the agent’s world model correspond to the world? (For instance, which physical things do the random variables in a probability distribution correspond to, if any?)

What’s the relationship between the abstract “world model” and the physical stuff from which the world model is built?

Goals

Does the agent have a goal or goals?

What data structure can represent an agent’s goal? (Utility functions are the most common choice.)

How does the goal correspond to the world—especially if it’s evaluated within the world model?

Agents

Does the agent have well-defined goals or world models or other components?

Does the agent perform search/optimization within the world model, or in the world directly?

What are the agent’s “inputs” and “outputs”—e.g. actions and observations?

Does agent-like behavior imply agent-like internal architecture?

What’s the relationship between the abstract “agent” and the physical stuff from which the agent is built?

## What’s A Selection Theorem?

A Selection Theorem tells us something about what agent type signatures will be selected for in some broad class of environments. Two important points:

The theorem need not directly talk about selection—e.g. it could state some general property of optima, of “broad” optima, of “most” optima, or of optima under a particular kind of selection pressure (like natural selection or financial profitability).

Any given theorem need not address

*every*question about agent type signatures; it just needs to tell us*something*about agent type signatures.

For instance, the subagents argument says that, when our “agents” have internal state in a coherence-theorem-like setup, the “goals” will be pareto optimality over multiple utilities, rather than optimality of a single utility function. This says very little about embeddedness or world models or internal architecture; it addresses only one narrow aspect of agent type signatures. And, like the coherence theorems, it doesn’t *directly* talk about selection; it just says that any strategy which doesn’t fit the pareto-optimal form is strictly dominated by some other strategy (and therefore we’d expect that other strategy to be selected, all else equal).

Most Selection Theorems, in the short-to-medium term, will probably be like that: they’ll each address just one particular aspect of agent type signatures. That’s fine. As long as the assumptions are general enough and realistic enough, we can use lots of theorems together to narrow down the space of possible types.

Eventually, I do expect that most of the core ideas of Selection Theorems will be unified into a small number of Fundamental Theorems of Agency—perhaps even a single theorem. But that’s not a necessary assumption for the usefulness of this program, and regardless, I expect a lot of progress on theorems addressing specific aspects of agent type signatures before then.

## How to work on Selection Theorems

### New Theorems

The most open-ended way to work on the Selection Theorems program is, of course, to come up with new Selection Theorems.

If you’re relatively-new to this sort of work and wondering how one comes up with useful new theorems, here are some possible starting points:

Study examples of evolved agents to see what kind of type signatures they develop under what conditions. I recommend coming from as many different angles as possible—i.e. ML, economics, and biology—to build intuitions.

Once you have some intuition or empirical fact from some specific examples or a particular field, try to expand it to more general agents and selection processes.

Bio example: sessile (i.e. immobile) organisms don’t usually cephalize (i.e. develop brains). Can we turn this into a general theorem about agents?

Pick a frame, and try to apply it. For example, I’ve been getting surprisingly a lot of mileage out of the comparative advantage frame lately; it turns out to give some neat variants of the Coherence Theorems.

Start from agent type signatures—what type signature makes sense intuitively, based on how humans work? What selection processes would give rise to that type signature, and can you prove it?

Start from selection processes. What type signatures seem intuitively likely to be selected? Can you prove it?

Also, take a look at What’s So Bad About Ad-Hoc Mathematical Definitions? to help build some useful aesthetic intuitions.

### Incremental Work

This is work which starts from one or more existing selection theorem(s), and improves on them somehow.

Some starting points with examples where I’ve personally found them useful before:

Take an existing selection theorem, try to apply it to some real-world agency system or some system under selection pressure, see what goes wrong, and fix it. For instance, the subagents idea started from trying (and failing) to apply Coherence Theorems to financial markets.

Take some existing theorem and strengthen it. For instance, the original logical inductors piece showed the existence of

*a*logical inductor implemented as a market; I extended that to show that*any*logical inductor is behaviorally equivalent to a market.Take some existing theorem with a giant gaping hole and fix the hole. Gooder Regulator was basically that.

A couple other approaches for which I don’t have a great example from my own work, but which I expect to be similarly fruitful:

Take two existing selection theorems and unify them.

Take a selection theorem mainly designed for a particular setting (e.g. financial/betting markets) and back out the exact requirements needed to apply it in more general settings.

Empirical verification, i.e. check that an existing theorem works as expected on some real system. This is most useful when it fails, but success still helps us be sure our theorems aren’t missing anything, and the process of empirical testing forces us to better understand the theorems and their assumptions.

## Up Next

I currently have two follow-up posts planned:

One post with some existing Selection Theorems, which is already written and should go up later this week. [Edit: post is up.]

One post on agent type signatures for which I expect/want Selection Theorems—in other words, conjectures. This one is not yet written, and I expect it will go up early next week. [Edit: post is up.]

These are explicitly intended to help people come up with ways to contribute to the Selection Theorems program.

- (My understanding of) What Everyone in Technical Alignment is Doing and Why by 29 Aug 2022 1:23 UTC; 357 points) (
- How To Get Into Independent Research On Alignment/Agency by 19 Nov 2021 0:00 UTC; 321 points) (
- Alignment research exercises by 21 Feb 2022 20:24 UTC; 147 points) (
- Calling for Student Submissions: AI Safety Distillation Contest by 23 Apr 2022 20:24 UTC; 101 points) (EA Forum;
- Why The Focus on Expected Utility Maximisers? by 27 Dec 2022 15:49 UTC; 98 points) (
- Announcing the Alignment of Complex Systems Research Group by 4 Jun 2022 4:10 UTC; 80 points) (
- Searching for Search by 28 Nov 2022 15:31 UTC; 78 points) (
- Project Intro: Selection Theorems for Modularity by 4 Apr 2022 12:59 UTC; 69 points) (
- Alignment Org Cheat Sheet by 20 Sep 2022 17:36 UTC; 63 points) (
- Possible miracles by 9 Oct 2022 18:17 UTC; 62 points) (
- What Selection Theorems Do We Expect/Want? by 1 Oct 2021 16:03 UTC; 58 points) (
- How Do Selection Theorems Relate To Interpretability? by 9 Jun 2022 19:39 UTC; 57 points) (
- Clarifying the Agent-Like Structure Problem by 29 Sep 2022 21:28 UTC; 54 points) (
- The “Minimal Latents” Approach to Natural Abstractions by 20 Dec 2022 1:22 UTC; 50 points) (
- Some Existing Selection Theorems by 30 Sep 2021 16:13 UTC; 49 points) (
- Calling for Student Submissions: AI Safety Distillation Contest by 24 Apr 2022 1:53 UTC; 48 points) (
- Distillation Contest—Results and Recap by 26 Jul 2022 20:30 UTC; 46 points) (EA Forum;
- Goal Alignment Is Robust To the Sharp Left Turn by 13 Jul 2022 20:23 UTC; 46 points) (
- Understanding Selection Theorems by 28 May 2022 1:49 UTC; 40 points) (
- 4 Apr 2022 21:19 UTC; 40 points) 's comment on Call For Distillers by (
- Refining the Sharp Left Turn threat model, part 2: applying alignment techniques by 25 Nov 2022 14:36 UTC; 38 points) (
- 1 Jul 2022 4:13 UTC; 34 points) 's comment on Utility Maximization = Description Length Minimization by (
- Distillation Contest—Results and Recap by 29 Jul 2022 17:40 UTC; 33 points) (
- Selection processes for subagents by 30 Jun 2022 23:57 UTC; 33 points) (
- Epistemic Strategies of Selection Theorems by 18 Oct 2021 8:57 UTC; 32 points) (
- 17 Jun 2022 17:13 UTC; 32 points) 's comment on wrapper-minds are the enemy by (
- World-Model Interpretability Is All We Need by 14 Jan 2023 19:37 UTC; 29 points) (
- Internal Interfaces Are a High-Priority Interpretability Target by 29 Dec 2022 17:49 UTC; 29 points) (
- What Is The True Name of Modularity? by 1 Jul 2022 14:55 UTC; 27 points) (
- Some Remarks on Regulator Theorems No One Asked For by 5 Nov 2021 19:33 UTC; 19 points) (
- [AN #167]: Concrete ML safety problems and their relevance to x-risk by 20 Oct 2021 17:10 UTC; 19 points) (
- Disagreements about Alignment: Why, and how, we should try to solve them by 8 Aug 2022 22:32 UTC; 16 points) (EA Forum;
- Motivations, Natural Selection, and Curriculum Engineering by 16 Dec 2021 1:07 UTC; 16 points) (
- Riffing on the agent type by 8 Dec 2022 0:19 UTC; 16 points) (
- Why The Focus on Expected Utility Maximisers? by 27 Dec 2022 15:51 UTC; 11 points) (EA Forum;
- [Linkpost] How To Get Into Independent Research On Alignment/Agency by 14 Feb 2022 21:40 UTC; 10 points) (EA Forum;
- Disagreements about Alignment: Why, and how, we should try to solve them by 9 Aug 2022 18:49 UTC; 8 points) (
- 16 Jul 2022 16:36 UTC; 8 points) 's comment on All AGI safety questions welcome (especially basic ones) [July 2022] by (
- 16 Jun 2022 4:09 UTC; 5 points) 's comment on A central AI alignment problem: capabilities generalization, and the sharp left turn by (

## Epistemic Status

I am an aspiring selection theorist and I have thoughts.

## Why Selection Theorems?

Learning about selection theorems was very exciting. It’s one of those concepts that felt so obviously

right. A missing component in my alignment ontology that just clicked and made everything stronger.## Selection Theorems as a Compelling Agent Foundations Paradigm

There are many reasons to be sympathetic to agent foundations style safety research as it most directly engages the hard problems/core confusions of alignment/safety. However, one concern with agent foundations research is that we might build sky high abstraction ladders that grow increasingly disconnected from reality. Abstractions that don’t quite describe the AI systems we deal with in practice.

I think that in presenting this post, Wentworth successfully sidestepped the problem. He presented an intuitive story for why the Selection Theorems paradigm would be fruitful; it’s general enough to describe many paradigms of AI system development, yet concrete enough to say nontrivial/interesting things about the properties of AI systems (including properties that bear on their safety). Wentworth presents a few examples of extant selection theorems (most notably the coherence theorems) and later argues that selection theorems have a lot of research “surface area” and new researchers could be onboarded (relatively) quickly. He also outlined concrete steps people interested in selection theorems could take to contribute to the program.

Overall, I found this presentation of the case for selection theorems research convincing. I think that selection theorems provide a solid framework with which to formulate (and prove) safety desiderata/guarantees for AI systems that are robust to arbitrary capability amplification. Furthermore, selection theorems seem to be very robust to paradigm shifts in the development artificial intelligence. That is regardless of what changes in architecture or training methodology subsequent paradigms may bring, I expect selection theoretic results to still apply

^{[1]}.I currently consider selection theorems to be the most promising agent foundations flavoured research paradigm.

## Digression: Asymptotic Analysis and Complexity Theory

My preferred analogy for selection theorems is asymptotic complexity in computer science. Using asymptotic analysis we can make highly non-trivial statements about the performance of particular (or arbitrary!) algorithms that abstract away the underlying architecture, hardware, and other implementation details. As long as the implementation of the algorithm is amenable to our (very general) models of computation, the asymptotic/complexity theoretic guarantee will generally still apply.

For example, we have a very robust proof that no comparison-based sorting algorithm can attain better worst case time complexity than O(nlogn) (this happens to be a very tight lower bound as extant algorithms [e.g. mergesort] attain it). The model behind the lower bound of comparison sorting is very minimal and general:

Data operations

Comparing two elements

Moving elements (copying or swapping)

Cost: number of such operations

Any algorithm that performs sorting by directly comparing elements to determine their order conforms to this model. The lower bound of O(nlogn) obtains because we can model the execution of the sorting algorithm by a binary decision tree:

Nodes: individual comparisons between elements

Edges: different outputs of comparisons (≤ and >)

Leaf nodes: unique permutation of the input array that corresponds to that particular root to leaf path of the tree

The number of executions of the sorting algorithm for any given input permutation is given by the number of edges between the root node and that leaf. The worst case running time of the algorithm is given by the height of the tree. Because there are n! possible permutations of the input array, the lowest attainable worst case complexity is lg(n!)=nlg(n)−lg(e)n+O(logn). Which is in O(nlogn).

I reiterate that this is a very powerful result. Here we’ve set up very minimal assumptions about our model (comparisons are made between pairs of elements to determine order, the algorithm can copy or swap elements) and we’ve obtained a ridiculously strong impossibility result

^{[2]}.## Selection Theorems as a Complexity Theoretic Analogue

Selection theorems present a minimal model of an intelligent system as an agent situated in an environment. The agents are assumed to be the product of some optimisation process selecting for performance on a given metric (e.g. inexploitability in multi-agent environments, for the coherence theorems).

The exact optimisation process performing the selection is abstracted away [only the performance metric/objective function(s) of optimisation matters], and the hope is to do the same for the environment (that is, selection theoretic results should apply to a broad class of environments (e.g. for the coherence theorems, the only constraint imposed on the environment is that it contains other agents)].

Using the above model, selection theorems try to derive

^{[3]}agent “type signatures” (the representation [data structures], interfaces [inputs & outputs] and embedding [in an underlying physical (or other low level) system] of the agent and specific agent aspects (world models, goals, etc.). It’s through these type signatures that safety relevant properties of agents can be concretely formulated (and hopefully proven).For example, the proposed anti-naturalness of corrigibility to expected utility maximisation can be seen as an “impossibility result”

^{[4]}of a safety property (corrigibility) derived from a selection theorem (the coherence theorems).While this is a negative result, I expect no fundamental difficulty to obtaining positive selection theoretic guarantees of safety properties.

I see the promise of selection theorems as doing for AI safety, whatcomplexity theorydoes for algorithm performance.## The Power of Selection Theorems

I expect that we will be able to provide selection theoretic guarantees of nontrivial safety properties/desiderata.

In particular, I think selection theorems naturally lend themselves to proving properties that are selected for/emerge

in the limit^{[5]}of optimisation for particular objectives (convergence theorems?). I find the potential of asymptotic guarantees exhilarating.Properties proven to emerge in the limit become

more robustwith (increasing) scale. I think that’s an incredibly powerful result. Furthermore, asymptotic complexity analysis suggests that it’s often easier to make statements about what holds in the limit than about what holds at particular intermediate levels. (We can very easily talk about how the performance of two algorithms compare on a particular problem as input size tends towards infinity without considering implementation details or underlying hardware and ignoring all constant factors. To talk about the performance of two algorithms on input sets of a particular fixed size, we’d need to consider all the aforementioned details).The combination of:

“Properties that are selected for in the limit become more robust with (increasing) scale” and

“It is much easier to describe the limit of a process than particular intermediate states”

is immensely powerful

^{[6]}. It makes selection theorems a hugely compelling — perhaps the one I find most personally compelling — AI safety research paradigm.## Reservations

While I am quite enamoured with Wentworth’s selection theorems, I find myself somewhat dissatisfied. As Wentworth framed it, I think they are a bit off.

A major limitation of the coherence theorems is that they constrain agents to an archetype that does not necessarily describe real agents (or other intelligent systems) well. In particular, the coherence theorems assume agent preferences are:

Static (do not change with time)

Path independent (exact course of action taken to get somewhere does not affect the agent’s preferences, alternatively it assumes that agents do not have internal states that factor into their preferences)

Complete (for any two options, the agent prefers one of them or is indifferent. It doesn’t permit a notion of “incomplete preferences”)

These assumptions turn out to be not very realistic/don’t describe real world agents (e.g. humans) and some (relatively) inexploitable systems (e.g. financial markets) well.

The failure of coherence theorems to carve reality at the joints is a valuable lesson re: choosing the right preconditions for our theorems (if our preconditions are too restrictive/strong, they might describe systems that don’t matter in the real world [“spherical cows”]). And it’s a mistake I worry that the paradigm of “agent type signatures” might be making.

To be precise, I am quite unconvinced that “agent” is the “true name” of the relevant intelligent systems. There are powerful artifacts (e.g. the base versions of large language models) that do not match the agent archetype as traditionally conceived. I do not know that the artifacts that ultimately matter would necessarily conform to the agent archetype

^{[7]}. Theorems that are exclusively about the properties of agents may end up not being very applicable to important systems of interest (if e.g. the first AGIs are created by a [mostly] self-supervised training process).Agent selection theorems are IMO ultimately too restrictive (their preconditions are too strong to describe all intelligent systems of interest/they implicitly preclude from analysis some intelligent systems we’ll be interested in), and the selection theorem agenda should be generalised to optimisation processes and the kind of constructs they select for.

That is, regardless of paradigm, intelligent systems (e.g. humans, trained ML models and expert systems) are the products of optimisation processes (e.g. natural selection, stochastic gradient descent, and human design

^{[8]}respectively).So, a theory based solely on optimisation processes seems general enough to describe all intelligent systems of interest (while being targeted enough to say nontrivial/interesting things about such systems) and

minimal(we can’t relax the preconditions anymore while still obtaining nontrivial results about intelligent systems).The agent type signature paradigm is insufficiently general.

In the remainder of this post, I would like to slightly adjust the concept of selection theorems to better reflect what I think they should be

^{[9]}.## Types of Selection Theorems

There are two broad classes of theorems that seem valuable:

## Constructor Theorems

For a given (collection of) objective(s), and underlying configuration space what type

^{[10]}of artifacts are produced by constructive optimisation processes (e.g. natural selection, stochastic gradient descent and human design) that select for performance on said objective(s)?Fundamentally, they ask the question:

The aforementioned “convergence theorems” would be a particular kind of constructor theorems.

## Artifact Theorems

Artifact theorems are the dual of constructor theorems. If constructor theorems seek to identify the artifact type produced by a particular constructive optimisation process, then artifact theorems seek to identify the constructive optimisation process that produced particular artifacts (the human brain, trained ML models and the quicksort algorithm respectively).

That is:

I.e. describe the class of problems/domains/tasks the objectives belong to

Can we also specify a type for the objectives?

What properties do its members have?

Which properties are necessary to select for that artifact type?

What is its parent type?

Which properties are sufficient?

What are the interesting child types?

I suspect that e.g. investigating general intelligence artifact theorems would be a promising research agenda for robust safety of arbitrarily capable general systems.

Provided we use sufficiently general agent/system models as the foundation for our selection theoretic results.

I should point out that this impossibility result is somewhat atypical; for many interesting problems we don’t regularly obtain (non-trivial [e.g. the size of the input or output]) tight lower bounds on complexity.

Usually, some parts of the type signatures are assumed (implicitly or explicitly) by the theorem.

Jessica Taylor told me that she thinks the anti-naturalness of corrigibility is more of a “research intuition” than a theorem.

I’m under the impression that it was when thinking about what emerges in the limit that I first drew the relationship between selection theorems and complexity theory. However, this may be a false memory (or otherwise not a particularly reliable recollection of events).

It feels almost too good to be true, like we’re cheating in the mileage we get out of selection theorems.

While any physical system can be constituted as an agent situated in an environment, the agent archetype is not illuminating for all of them. Viewing a calculator as an agent does not really offer any missing insight into the operations of the calculator. It does not allow you to better predict its behaviour.

Insomuch as one accepts that design is a kind of optimisation process. And I would insist that you should, but I’ve not gotten around to writing up my thoughts on what exactly qualifies as an optimisation process in a form that I would endorse. Eliezer’s “Measuring Optimisation Power” is a fine enough first approximation

The quickest gloss is that:

- “Agent” should be replaced with “artifact” (a general term for any object that is the product of an optimisation process).

Some sample artifacts and the optimisation process that produced them:

* The human brain: natural selection

* Trained ML models: stochastic gradient descent

* 1.41421356237: Newton’s method (approximation for the square root of 2)

* The quicksort algorithm: human design

Among other things, a type should specify a set of properties that all members of the type share. If those properties are necessary and sufficient for an artifact to belong to a particular type, the type could simply be identified with its collection of properties.

Types can exist at different levels of abstraction (allowing them to specify artifact properties at different levels of detail).

An artifact can belong to multiple types (e.g. I might belong to the types: “human”, “male”, “Nigerian”).

Rather than identifying the optimisation process in detail, only the objective function of the optimisation process is considered. Any other particulars/specifics of the optimisation process are abstracted away (the same way implementation details are abstracted away in asymptotic analysis).

The motivation is that I think that any two optimisation processes with the same objective functions on the same configuration space with the same “optimisation power” are identical for our purposes. And for convergence theorems, even the optimisation power is abstracted away.

Quick review of the review, this could indeed make a very good top-level post.

@Raemon: here’s the review I mentioned wanting to write.

I’m wiped for the current writing session, but may extend it further later in the day over the coming week?

[When does the review session end?]

I like this research agenda because it provides a rigorous framing for thinking about inductive biases for agency and gives detailed and actionable advice for making progress on this problem. I think this is one of the most useful research directions in alignment foundations since it is directly applicable to ML-based AI systems.