Intuitively, the real world seems to contain agenty systems (e.g. humans), non-agenty systems (e.g. rocks), and ambiguous cases which display some agent-like behavior sometimes (bacteria, neural nets, financial markets, thermostats, etc). There’s a vague idea that agenty systems pursue consistent goals in a wide variety of environments, and that various characteristics are necessary for this flexible goal-oriented behavior.

But once we get into the nitty-gritty, it turns out we don’t really have a full mathematical formalization of these intuitions. We lack a characterization of agents.

To date, the closest we’ve come to characterizing agents in general are the coherence theorems underlying Bayesian inference and utility maximization. A wide variety of theorems with a wide variety of different assumptions all point towards agents which perform Bayesian inference and choose their actions to maximize expected utility. In this framework, an agent is characterized by two pieces:

A probabilistic world-model

A utility function

The Bayesian utility characterization of agency neatly captures many of our intuitions of agency: the importance of accurate beliefs about the environment, the difference between things which do and don’t consistently pursue a goal (or approximately pursue a goal, or sometimes pursue a goal…), the importance of updating on new information, etc.

Sadly, for purposes of AGI alignment, the standard Bayesian utility characterization is incomplete at best. Some example issues include:

The need for a cartesian boundary—a clear separation between “agent” and “environment”, with well-defined input/output channels between the two

Logical omniscience—the assumption that agents can fully compute all of the implications of the information available to them, and track every possible state of the world

One way to view agent foundations research is that it seeks a characterization of agents which resolves problems like the first two above. We want the same sort of benefits offered by the Bayesian utility characterization, but in a wider and more realistic range of agenty systems.

Characterizing Real-World Agents

We want to characterize agency. We have a bunch of real-world systems which display agency to varying degrees. One obvious strategy is to go study and characterize those real-world agenty systems.

Concretely, what would this look like?

Well, let’s set aside the shortcomings of the standard Bayesian utility characterization for a moment, and imagine applying it to a real-world system—a financial market, for instance. We have various coherence theorems saying that agenty systems must implement Bayesian utility maximization, or else allow arbitrage. We have a strong prior that financial markets don’t allow arbitrage (except perhaps very small arbitrage on very short timescales). So, financial markets should have a Bayesian utility function, right? Obvious next step: pick an actual market and try to figure out its world-model and utility function.

I tried this, and it didn’t work. Turns out markets don’t have a utility function, in general (in this context, it’s called a “representative agent”).

Ok, but markets are still inexploitable and still seem agenty, so where did it go wrong? Can we generalize Bayesian utility to characterize systems which are agenty like markets? This was the line of inquiry which led to “Why Subagents?”. The upshot: for systems with internal state (including markets), the standard utility maximization characterization generalizes to a multi-agent committee characterization.

This is an example of a general strategy:

Start with some characterization of agency—don’t worry if it’s not perfect yet

Apply it to a real-world agenty system—specifically, try to back out the characterizing properties, e.g. the probabilistic world-model and utility function in the case of a Bayesian utility characterization

If successful, great! We’ve gained a useful theoretical tool for an interesting real-world system.

If unsuccessful, first check whether the failure corresponds to a situation where the system actually doesn’t act very agenty—if so, then that actually supports our characterization of agency, and again tells us something interesting about a real-world system.

Otherwise, we’ve found a real-world case where our characterization of agency fails. Look at the system’s actual internal behavior to see where it differs from the assumptions of our characterization, and then generalize the characterization to handle this kind of system.

Note that the last step, generalizing the characterization, still needs to maintain the structure of a characterization of agency. For example, prospect theory does a fine job predicting the choices of humans, but it isn’t a general characterization of effective goal-seeking behavior. There’s no reason to expect prospect-theory-like behavior to be universal for effective goal-seeking systems. The coherence theorems of Bayesian utility, on the other hand, provide fairly general conditions under which Bayesian induction and expected utility maximization are an optimal goal-seeking strategy—and therefore “universal”, at least within the conditions assumed. Although the Bayesian utility framework is incomplete at best, that’s still the kind of thing we’re looking for: a characterization which should apply to all effective goal-seeking systems.

Some examples of (hypothetical) projects which follow this general strategy:

Look up the kinetic equations governing chemotaxis in e-coli. Either extract an approximate probabilistic world-model and utility function from the equations, find a suboptimality in the bacteria’s behavior, or identify a loophole and expand the characterization of agency.

Pick a financial market. Using whatever data you can obtain, either extract (not necessarily unique) utility functions and world models of the component agents, find an arbitrage opportunity, or identify a new loophole and expand the characterization of agency.

Start with the weights from a neural network trained on a task in the openai gym. Either extract a probabilistic world model and utility function from the weights, find a strategy which dominates the NN’s strategy, or identify a loophole and expand the characterization of agency

… and so forth.

Why Would We Want to Do This?

Characterization of real-world agenty systems has a lot of advantages as a general research strategy.

First and foremost: when working on mathematical theory, it’s easy to get lost in abstraction and lose contact with the real world. One can end up pushing symbols around ad nauseum, without any idea which way is forward. The easiest counter to this failure mode is to stay grounded in real-world applications. Just as a rationalist lets reality guide beliefs, a theorist lets the problems, properties and intuitions of the real world guide the theory.

Second, when attempting to characterize real-world agenty systems, one is very likely to make some kind of forward progress. If the characterization works, then we’ve learned something useful about an interesting real-world system. If it fails, then we’ve identified a hole in our characterization of agency—and we have an example on hand to guide the construction of a new characterization.

Third, characterization of real-world agenty systems is directly relevant to alignment: the alignment problem itself basically amounts to characterizing the wants and ontologies of humans. This isn’t the only problem relevant to FAI—tiling and stability and subagent alignment and the like are separate—but it is basically the whole “alignment with humans” part. Characterizing e.g. the wants and ontology of an e-coli seems like a natural stepping-stone.

One could object that real-world agenty systems lack some properties which are crucial to the design of aligned AGI—most notably reflection and planned self-modification. A theory developed by looking only at real-world agents will therefore likely be incomplete. On the other hand, you don’t figure out general relativity without figuring out Newtonian gravitation first. Our understanding of agency is currently so woefully poor that we don’t even understand real-world systems, so we might as well start with that and reap all the advantages listed above. Once that’s figured out, we should expect it to pave the way to the final theory: just as general relativity has to reproduce Newtonian gravity in the limit of low speed and low energy, more advanced characterizations of agency should reproduce more basic characterizations under the appropriate conditions. The subagents characterization, for example, reproduces the utility characterization in cases where the agenty system has no internal state. It all adds up to normality—new theories must be consistent with the old, at least to the extent that the old theories work.

Finally, a note on relative advantage. As a strategy, characterizing real-world agenty systems leans heavily on domain knowledge in areas like biology, machine learning, economics, and neuroscience/psychology, along with the math involved in any agency research. That’s a pretty large pareto skill frontier, and I’d bet that it’s pretty underexplored. That means there’s a lot of opportunity for new, large contributions to the theory, if you have the domain knowledge or are willing to put in the effort to acquire it.

## Characterizing Real-World Agents as a Research Meta-Strategy

## Background

Intuitively, the real world seems to contain agenty systems (e.g. humans), non-agenty systems (e.g. rocks), and ambiguous cases which display some agent-like behavior sometimes (bacteria, neural nets, financial markets, thermostats, etc). There’s a vague idea that agenty systems pursue consistent goals in a wide variety of environments, and that various characteristics are necessary for this flexible goal-oriented behavior.

But once we get into the nitty-gritty, it turns out we don’t really have a full mathematical formalization of these intuitions. We lack a characterization of agents.

To date, the closest we’ve come to characterizing agents in general are the coherence theorems underlying Bayesian inference and utility maximization. A wide variety of theorems with a wide variety of different assumptions all point towards agents which perform Bayesian inference and choose their actions to maximize expected utility. In this framework, an agent is characterized by two pieces:

A probabilistic world-model

A utility function

The Bayesian utility characterization of agency neatly captures many of our intuitions of agency: the importance of accurate beliefs about the environment, the difference between things which do and don’t consistently pursue a goal (or approximately pursue a goal, or sometimes pursue a goal…), the importance of updating on new information, etc.

Sadly, for purposes of AGI alignment, the standard Bayesian utility characterization is incomplete at best. Some example issues include:

The need for a cartesian boundary—a clear separation between “agent” and “environment”, with well-defined input/output channels between the two

Logical omniscience—the assumption that agents can fully compute all of the implications of the information available to them, and track every possible state of the world

Path independence and complete preferences—the assumption that an agent doesn’t have a general tendency to stay in the state it’s in

One way to view agent foundations research is that it seeks a characterization of agents which resolves problems like the first two above. We want the same sort of benefits offered by the Bayesian utility characterization, but in a wider and more realistic range of agenty systems.

## Characterizing Real-World Agents

We want to characterize agency. We have a bunch of real-world systems which display agency to varying degrees. One obvious strategy is to go study and characterize those real-world agenty systems.

Concretely, what would this look like?

Well, let’s set aside the shortcomings of the standard Bayesian utility characterization for a moment, and imagine applying it to a real-world system—a financial market, for instance. We have various coherence theorems saying that agenty systems must implement Bayesian utility maximization, or else allow arbitrage. We have a strong prior that financial markets don’t allow arbitrage (except perhaps very small arbitrage on very short timescales). So, financial markets should have a Bayesian utility function, right? Obvious next step: pick an actual market and try to figure out its world-model and utility function.

I tried this, and it didn’t work. Turns out markets don’t have a utility function, in general (in this context, it’s called a “representative agent”).

Ok, but markets are still inexploitable and still seem agenty, so where did it go wrong? Can we generalize Bayesian utility to characterize systems which are agenty like markets? This was the line of inquiry which led to “Why Subagents?”. The upshot: for systems with internal state (including markets), the standard utility maximization characterization generalizes to a multi-agent committee characterization.

This is an example of a general strategy:

Start with some characterization of agency—don’t worry if it’s not perfect yet

Apply it to a real-world agenty system—specifically, try to back out the characterizing properties, e.g. the probabilistic world-model and utility function in the case of a Bayesian utility characterization

If successful, great! We’ve gained a useful theoretical tool for an interesting real-world system.

If unsuccessful, first check whether the failure corresponds to a situation where the system actually doesn’t act very agenty—if so, then that actually supports our characterization of agency, and again tells us something interesting about a real-world system.

Otherwise, we’ve found a real-world case where our characterization of agency fails. Look at the system’s actual internal behavior to see where it differs from the assumptions of our characterization, and then generalize the characterization to handle this kind of system.

Note that the last step, generalizing the characterization, still needs to maintain the structure of a characterization of agency. For example, prospect theory does a fine job predicting the choices of humans, but it isn’t a general characterization of effective goal-seeking behavior. There’s no reason to expect prospect-theory-like behavior to be universal for effective goal-seeking systems. The coherence theorems of Bayesian utility, on the other hand, provide fairly general conditions under which Bayesian induction and expected utility maximization are an optimal goal-seeking strategy—and therefore “universal”, at least within the conditions assumed. Although the Bayesian utility framework is incomplete at best, that’s still the

kindof thing we’re looking for: a characterization which should apply to all effective goal-seeking systems.Some examples of (hypothetical) projects which follow this general strategy:

Look up the kinetic equations governing chemotaxis in e-coli. Either extract an approximate probabilistic world-model and utility function from the equations, find a suboptimality in the bacteria’s behavior, or identify a loophole and expand the characterization of agency.

Pick a financial market. Using whatever data you can obtain, either extract (not necessarily unique) utility functions and world models of the component agents, find an arbitrage opportunity, or identify a new loophole and expand the characterization of agency.

Start with the weights from a neural network trained on a task in the openai gym. Either extract a probabilistic world model and utility function from the weights, find a strategy which dominates the NN’s strategy, or identify a loophole and expand the characterization of agency

… and so forth.

## Why Would We Want to Do This?

Characterization of real-world agenty systems has a lot of advantages as a general research strategy.

First and foremost: when working on mathematical theory, it’s easy to get lost in abstraction and lose contact with the real world. One can end up pushing symbols around ad nauseum, without any idea which way is forward. The easiest counter to this failure mode is to stay grounded in real-world applications. Just as a rationalist lets reality guide beliefs, a theorist lets the problems, properties and intuitions of the real world guide the theory.

Second, when attempting to characterize real-world agenty systems, one is very likely to make

somekind of forward progress. If the characterization works, then we’ve learned something useful about an interesting real-world system. If it fails, then we’ve identified a hole in our characterization of agency—and we have an example on hand to guide the construction of a new characterization.Third, characterization of real-world agenty systems is directly relevant to alignment: the alignment problem itself basically amounts to characterizing the wants and ontologies of humans. This isn’t the only problem relevant to FAI—tiling and stability and subagent alignment and the like are separate—but it is basically the whole “alignment with humans” part. Characterizing e.g. the wants and ontology of an e-coli seems like a natural stepping-stone.

One could object that real-world agenty systems lack some properties which are crucial to the design of aligned AGI—most notably reflection and planned self-modification. A theory developed by looking only at real-world agents will therefore likely be incomplete. On the other hand, you don’t figure out general relativity without figuring out Newtonian gravitation first. Our understanding of agency is currently so woefully poor that we don’t even understand real-world systems, so we might as well start with that and reap all the advantages listed above. Once that’s figured out, we should expect it to pave the way to the final theory: just as general relativity has to reproduce Newtonian gravity in the limit of low speed and low energy, more advanced characterizations of agency should reproduce more basic characterizations under the appropriate conditions. The subagents characterization, for example, reproduces the utility characterization in cases where the agenty system has no internal state. It all adds up to normality—new theories must be consistent with the old, at least to the extent that the old theories work.

Finally, a note on relative advantage. As a strategy, characterizing real-world agenty systems leans heavily on domain knowledge in areas like biology, machine learning, economics, and neuroscience/psychology, along with the math involved in any agency research. That’s a pretty large pareto skill frontier, and I’d bet that it’s pretty underexplored. That means there’s a lot of opportunity for new, large contributions to the theory, if you have the domain knowledge or are willing to put in the effort to acquire it.