Embedded Agency via Abstraction

Claim: problems of agents embedded in their environment mostly reduce to problems of abstraction. Solve abstraction, and solutions to embedded agency problems will probably just drop out naturally.

The goal of this post is to explain the intuition underlying that claim. The point is not to defend the claim socially or to prove it mathematically, but to illustrate why I personally believe that understanding abstraction is the key to understanding embedded agency. Along the way, we’ll also discuss exactly which problems of abstraction need to be solved for a theory of embedded agency.

What do we mean by “abstraction”?

Let’s start with a few examples:

  • We have a gas consisting of some huge number of particles. We throw away information about the particles themselves, instead keeping just a few summary statistics: average energy, number of particles, etc. We can then make highly precise predictions about things like e.g. pressure just based on the reduced information we’ve kept, without having to think about each individual particle. That reduced information is the “abstract layer”—the gas and its properties.

  • We have a bunch of transistors and wires on a chip. We arrange them to perform some logical operation, like maybe a NAND gate. Then, we throw away information about the underlying details, and just treat it as an abstract logical NAND gate. Using just the abstract layer, we can make predictions about what outputs will result from what inputs. Note that there’s some fuzziness − 0.01 V and 0.02 V are both treated as logical zero, and in rare cases there will be enough noise in the wires to get an incorrect output.

  • I tell my friend that I’m going to play tennis. I have ignored a huge amount of information about the details of the activity—where, when, what racket, what ball, with whom, all the distributions of every microscopic particle involved—yet my friend can still make some reliable predictions based on the abstract information I’ve provided.

  • When we abstract formulas like “1+1=2” or “2+2=4″ into “n+n=2n”, we’re obviously throwing out information about the value of n, while still making whatever predictions we can given the information we kept. This is what abstraction is all about in math and programming: throw out as much information as you can, while still maintaining the core “prediction”.

  • I have a street map of New York City. The map throws out lots of info about the physical streets: street width, potholes, power lines and water mains, building facades, signs and stoplights, etc. But for many questions about distance or reachability on the physical city streets, I can translate the question into a query on the map. My query on the map will return reliable predictions about the physical streets, even though the map has thrown out lots of info.

The general pattern: there’s some ground-level “concrete” model, and an abstract model. The abstract model throws away or ignores information from the concrete model, but in such a way that we can still make reliable predictions about some aspects of the underlying system.

Notice that, in most of these examples, the predictions of the abstract model need not be perfectly accurate. The mathematically exact abstractions used in pure math and CS are an unusual corner case: they don’t deal with the sort of fuzzy boundaries we see in the real world. “Tennis”, on the other hand, is a fuzzy abstraction of many real-world activities, and there are edge cases which are sort-of-tennis-but-maybe-not. Most of the interesting problems involve non-exact abstraction, so we’ll mostly talk about that, with the understanding that math/​CS-style abstraction is just the case with zero fuzz.

In terms of existing theory, I only know of one field which explicitly quantifies abstraction without needing hard edges: statistical mechanics. The heart of the field is things like “I have a huge number of tiny particles in a box, and I want to treat them as one abstract object which I’ll call ‘gas’. What properties will the gas have?” Jaynes puts the tools of statistical mechanics on foundations which can, in principle, be used for quantifying abstraction more generally. (I don’t think Jaynes had all the puzzle pieces, but he had a lot more than anyone else I’ve read.) It’s rather difficult to find good sources for learning stat mech the Jaynes way; Walter Grandy has a few great books, but they’re not exactly intro-level.

Summary: abstraction is about ignoring or throwing away information, in such a way that we can still make reliable predictions about some aspects of the underlying system.

Embedded World-Models

The next few sections will walk through different ways of looking at the core problems of embedded agency, as presented in the embedded agency sequence. We’ll start with embedded world-models, since these introduce the key constraint for everything else.

The underlying challenge of embedded world models is that the map is smaller than the territory it represents. The map simply won’t have enough space to perfectly represent the state of the whole territory—much less every possible territory, as required for Bayesian inference. A piece of paper with some lines on it doesn’t have space to represent the full microscopic configuration of every atom comprising the streets of New York City.

Obvious implication: the map has to throw out some information about the territory. (Note that this isn’t necessarily true in all cases: the territory could have some symmetry allowing for a perfect compressed representation. But this probably won’t apply to most real-world systems, e.g. the full microscopic configuration of every atom comprising the streets of New York City.)

So we need to throw out some information to make a map, but we still want to be able to reliably predict some aspects of the territory—otherwise there wouldn’t be any point in building a map to start with. In other words, we need abstraction.

Exactly what problems of abstraction do we need to solve?

The simplest problems are things like:

  • Given a map-mapping process, characterize the queries whose answers the map can reliably predict. Example: figure out what what questions a streetmap can answer by watching a cartographer produce a streetmap.

  • Given some representation of the map-territory correspondence, translate queries from the territory-representation to the map-representation and vice versa. Example: after understanding the relationship between streets and lines on paper, turn “how far is Times Square from the Met?” into “How far is the Times Square symbol from the Met symbol on the map, and what’s the scale?”

  • Given a territory, characterize classes of queries which can be reliably answered using a map much smaller than the territory itself. Example: recognize that the world contains lots of things with leaves, bark, branches, etc, and these “trees” are similar enough that a compressed map can reliably make predictions about specific trees - e.g. things with branches and bark are also likely to have leaves.

  • Given a territory and a class of queries, construct a map which throws out as much information as possible while still allowing accurate prediction over the query class.

  • Given a map and a class of queries whose answers the map can reliably predict, characterize the class of territories which the map might represent.

  • Given multiple different maps supporting different queries, how can we use them together consistently? Example: a construction project may need to use both a water-main map and a streetmap to figure out where to dig.

These kinds of questions directly address many of the issues from Abram & Scott’s embedded world-models post: grain-of-truth, high-level/​multi-level models, ontological crises. But we still need to discuss the biggest barrier to a theory of embedded world-models: diagonalization, i.e. a territory which sees the map’s predictions and then falsifies them.

If the map is embedded in the territory, then things in the territory can look at what the map predicts, then make the prediction false. For instance, some troll in the department of transportation could regularly check Google’s traffic map for NYC, then quickly close off roads to make the map as inaccurate as possible. This sort of thing could even happen naturally, without trolls: if lots of people follow Google’s low-traffic route recommendations, then the recommended routes will quickly fill up with traffic.

These examples suggest that, when making a map of a territory which contains the map, there is a natural role for randomization: Google’s traffic-mapping team can achieve maximum accuracy by randomizing their own predictions. Rather than recommending the same minimum-traffic route for everyone, they can randomize between a few routes and end up at a Nash equilibrium in their prediction game.

We’re speculating about a map making predictions based on a game-theoretic mixed strategy, but at this point we haven’t even defined the rules of the game. What is the map’s “utility function” in this game? The answer to that sort of question should come from thinking about the simpler questions from earlier. We want a theory where the “rules of the game” for self-referential maps follow naturally from the theory for non-self-referential maps. This is one major reason why I see abstraction as the key to embedded agency, rather than embedded agency as the key to abstraction: I expect a solid theory of non-self-referential abstractions to naturally define the rules/​objectives of self-referential abstraction. Also, I expect the non-referential-theory to characterize embedded map-making processes, which the self-referential theory will likely need to recognize in the territory.

Embedded Decision Theory

The main problem for embedded decision theory—as opposed to decision theory in general—is how to define counterfactuals. We want to ask questions like “what would happen if I dropped this apple on that table”, even if we can look at our own internal program and see that we will not, in fact, drop the apple. If we want our agent to maximize some expected utility function E[u(x)], then the “x” needs to represent a counterfactual scenario in which the agent takes some action—and we need to be able to reason about that scenario even if the agent ends up taking some other action.

Of course, we said in the previous section that the agent is using a map which is smaller than the territory—in “E[u(x)]”, that map defines the expectation operator E[-]. (Of course, we could imagine architectures which don’t explicitly use an expectation operator or utility function, but the main point carries over: the agent’s decisions will be based on a map smaller than the territory.) Decision theory requires that we run counterfactual queries on that map, so it needs to be a causal model.

In particular, we need a causal model which allows counterfactual queries over the agent’s own “outputs”, i.e. the results of any optimization it runs. In other words, the agent needs to be able to recognize itself—or copies of itself—in the environment. The map needs to represent, if not a hard boundary between agent and environment, at least the pieces which will be changed by the agent’s computation and/​or actions.

What constraints does this pose on a theory of abstraction suitable for embedded agency?

The main constraints are:

  • The map and territory should both be causal (possibly with symmetry)

  • Counterfactual queries on the map should naturally correspond to counterfactuals on the territory

  • The agent needs some idea of which counterfactuals on the map correspond to its own computations/​actions in the territory—i.e. it needs to recognize itself

These are the minimum requirements for the agent to plan out its actions based on the map, implement the plan in the territory, and have such plans work.

Note that there’s still a lot of degrees of freedom here. For instance, how does the agent handle copies of itself embedded in the environment? Some answers to that question might be “better” than others, in terms of producing more utility or something, but I see that as a decision theory question which is not a necessary prerequisite for a theory of embedded agency. On the other hand, a theory of embedded agency would probably help build decision theories which reason about copies of the agent. This is a major reason why I see a theory of abstraction as a prerequisite to new decision theories, but not new decision theories as a prerequisite to abstraction: we need abstraction on causal models just to talk about embedded decision theory, but problems like agent-copies can be built later on top of a theory of abstraction—especially a theory of abstraction which already handles self-referential maps.

Self-Reasoning & Improvement

Problems of self-reasoning, improvement, tiling, and so forth are similar to the problems of self-referential abstraction, but on hard mode. We’re no longer just thinking about a map of a territory which contains the map; we’re thinking about a map of a territory which contains the whole map-making process, and we want to e.g. modify the map-making process to produce more reliable maps. But if our goals are represented on the old, less-reliable map, can we safely translate those goals into the new map? For that matter, do the goals on the old map even make sense in the territory?

So… hard mode. What do we need from our theory of abstraction?

A lot of this boils down to the “simple” questions from earlier: make sure queries on the old map translate intelligibly into queries on the territory, and are compatible with queries on other maps, etc. But there are some significant new elements here: reflecting specifically on the map-making process, especially when we don’t have an outside-view way to know that we’re thinking about the territory “correctly” to begin with.

These things feel to me like “level 2” questions. Level 1: build a theory of abstraction between causal models. Handle cases where the map models a copy of itself, e.g. when an agent labels its own computations/​actions in the map. Part of that theory should talk about map-making processes: for what queries/​territories will a given map-maker produce a map which makes successful predictions? What map-making processes produce successful self-referential maps? Once level 1 is nailed down, we should have the tools to talk about level 2: running counterfactuals in which we change the map-making process.

Of course, not all questions of self-reasoning/​improvement are about abstraction. We could also questions about e.g. how to make an agent which modifies its own code to run faster, without changing input/​output (though of course input/​output are slippery notions in an embedded world…). We could ask questions about how to make an agent modify its own decision theory. Etc. These problems don’t inherently involve abstraction. My intuition, however, is that the problems which don’t involve self-referential abstraction usually seem easier. That’s not to say people shouldn’t work on them—there’s certainly value there, and they seem more amenable to incremental progress—but the critical path to a workable theory of embedded agency seems to go through self-referential maps and map-makers.


Agents made of parts have subsystems. Insofar as those subsystems are also agenty and have goals of their own, we want them to be aligned with the top-level agent. What new requirements does this pose for a theory of abstraction?

First and foremost, if we want to talk about agent subsystems, then our map can’t just black-box the whole agent. We can’t circumvent the lack of an agent-environment boundary by simply drawing our own agent-environment boundary, and ignoring everything on the “agent” side. That doesn’t necessarily mean that we can’t do any self-referential black boxing. For instance, if we want to represent a map which contains a copy of itself, then a natural method is to use a data structure which contains a pointer to itself. That sort of strategy has not necessarily been ruled out, but we can’t just blindly apply it to the whole agent.

In particular, if we’re working with causal models (possibly with symmetry), then the details of the map-making process and the reflecting-on-map-making process and whatnot all need to be causal as well. We can’t call on oracles or non-constructive existence theorems or other such magic. Loosely speaking, our theory of abstraction needs to be computable.

In addition, we don’t just want to model the agent as having parts, we want to model some of the parts as agenty—or at least consider that possibility. In particular, that means we need to talk about other maps and other map-makers embedded in the environment. We want to be able to recognize map-making processes embedded in the territory. And again, this all needs to be computable, so we need algorithms to recognize map-making processes embedded in the territory.

We’re talking about these capabilities in the context of aligning subagents, but this is really a key requirement for alignment more broadly. Ultimately, we want to point at something in the territory and say “See that agenty thing over there? That’s a human; there’s a bunch of them out in the world. Figure out their values, and help satisfy those values.” Recognizing agents embedded in the territory is a key piece of this, and recognizing embedded map-making processes seems to me like the hardest part of that problem—again, it’s on the critical path.


Time for a recap.

The idea of abstraction is to throw out information, while still maintaining the ability to provide reliable predictions on at least some queries.

In order to address the core problems of embedded world models, a theory of abstraction would need to first handle some “simple” questions:

  • Characterize which queries work on which maps of which territories.

  • Characterize which query classes admit significantly-compressed maps on which territories.

  • Characterize map-making processes which produce reliable maps.

  • Translate queries between map-representation and territory-representation, and between different map-representations

We hope that a theory which addresses these problems on non-self-referential maps will suggest natural objectives/​rules for self-referential maps.

Embedded decision theory adds a few more constraints, in order to define counterfactuals for optimization:

  • Our theory of abstraction should work with causal models for both the territory and the map

  • We need ways of mapping between counterfactuals on the map and counterfactuals on the territory

  • Agents need some way to recognize their own computations/​outputs in the territory, and represent them in the map.

A theory of embedded agency seems necessary for talking about embedded decision theory in a well-defined way.

Self-reasoning kicks self-referential map-making one rung up the meta-ladder, and starts to talk about maps of map-making processes and related issues. These aren’t the only problems of self-reasoning, but it does feel like self-referential abstraction captures the “hard part”—it’s on the critical path to a full theory.

Finally, subsystems push us to make the entire theory of abstraction causal/​computable. Also, it requires algorithms for recognizing agents—and thus map-makers—embedded in the territory. That’s a problem we probably want to solve for safety purposes anyway. Again, abstraction isn’t the only part of the problem, but it seems to capture enough of the hard part to be on the critical path.