Embedded Agency via Abstraction

Claim: prob­lems of agents em­bed­ded in their en­vi­ron­ment mostly re­duce to prob­lems of ab­strac­tion. Solve ab­strac­tion, and solu­tions to em­bed­ded agency prob­lems will prob­a­bly just drop out nat­u­rally.

The goal of this post is to ex­plain the in­tu­ition un­der­ly­ing that claim. The point is not to defend the claim so­cially or to prove it math­e­mat­i­cally, but to illus­trate why I per­son­ally be­lieve that un­der­stand­ing ab­strac­tion is the key to un­der­stand­ing em­bed­ded agency. Along the way, we’ll also dis­cuss ex­actly which prob­lems of ab­strac­tion need to be solved for a the­ory of em­bed­ded agency.

What do we mean by “ab­strac­tion”?

Let’s start with a few ex­am­ples:

  • We have a gas con­sist­ing of some huge num­ber of par­ti­cles. We throw away in­for­ma­tion about the par­ti­cles them­selves, in­stead keep­ing just a few sum­mary statis­tics: av­er­age en­ergy, num­ber of par­ti­cles, etc. We can then make highly pre­cise pre­dic­tions about things like e.g. pres­sure just based on the re­duced in­for­ma­tion we’ve kept, with­out hav­ing to think about each in­di­vi­d­ual par­ti­cle. That re­duced in­for­ma­tion is the “ab­stract layer”—the gas and its prop­er­ties.

  • We have a bunch of tran­sis­tors and wires on a chip. We ar­range them to perform some log­i­cal op­er­a­tion, like maybe a NAND gate. Then, we throw away in­for­ma­tion about the un­der­ly­ing de­tails, and just treat it as an ab­stract log­i­cal NAND gate. Us­ing just the ab­stract layer, we can make pre­dic­tions about what out­puts will re­sult from what in­puts. Note that there’s some fuzzi­ness − 0.01 V and 0.02 V are both treated as log­i­cal zero, and in rare cases there will be enough noise in the wires to get an in­cor­rect out­put.

  • I tell my friend that I’m go­ing to play ten­nis. I have ig­nored a huge amount of in­for­ma­tion about the de­tails of the ac­tivity—where, when, what racket, what ball, with whom, all the dis­tri­bu­tions of ev­ery micro­scopic par­ti­cle in­volved—yet my friend can still make some re­li­able pre­dic­tions based on the ab­stract in­for­ma­tion I’ve pro­vided.

  • When we ab­stract for­mu­las like “1+1=2” or “2+2=4″ into “n+n=2n”, we’re ob­vi­ously throw­ing out in­for­ma­tion about the value of n, while still mak­ing what­ever pre­dic­tions we can given the in­for­ma­tion we kept. This is what ab­strac­tion is all about in math and pro­gram­ming: throw out as much in­for­ma­tion as you can, while still main­tain­ing the core “pre­dic­tion”.

  • I have a street map of New York City. The map throws out lots of info about the phys­i­cal streets: street width, pot­holes, power lines and wa­ter mains, build­ing fa­cades, signs and stoplights, etc. But for many ques­tions about dis­tance or reach­a­bil­ity on the phys­i­cal city streets, I can trans­late the ques­tion into a query on the map. My query on the map will re­turn re­li­able pre­dic­tions about the phys­i­cal streets, even though the map has thrown out lots of info.

The gen­eral pat­tern: there’s some ground-level “con­crete” model, and an ab­stract model. The ab­stract model throws away or ig­nores in­for­ma­tion from the con­crete model, but in such a way that we can still make re­li­able pre­dic­tions about some as­pects of the un­der­ly­ing sys­tem.

No­tice that, in most of these ex­am­ples, the pre­dic­tions of the ab­stract model need not be perfectly ac­cu­rate. The math­e­mat­i­cally ex­act ab­strac­tions used in pure math and CS are an un­usual cor­ner case: they don’t deal with the sort of fuzzy bound­aries we see in the real world. “Ten­nis”, on the other hand, is a fuzzy ab­strac­tion of many real-world ac­tivi­ties, and there are edge cases which are sort-of-ten­nis-but-maybe-not. Most of the in­ter­est­ing prob­lems in­volve non-ex­act ab­strac­tion, so we’ll mostly talk about that, with the un­der­stand­ing that math/​CS-style ab­strac­tion is just the case with zero fuzz.

In terms of ex­ist­ing the­ory, I only know of one field which ex­plic­itly quan­tifies ab­strac­tion with­out need­ing hard edges: statis­ti­cal me­chan­ics. The heart of the field is things like “I have a huge num­ber of tiny par­ti­cles in a box, and I want to treat them as one ab­stract ob­ject which I’ll call ‘gas’. What prop­er­ties will the gas have?” Jaynes puts the tools of statis­ti­cal me­chan­ics on foun­da­tions which can, in prin­ci­ple, be used for quan­tify­ing ab­strac­tion more gen­er­ally. (I don’t think Jaynes had all the puz­zle pieces, but he had a lot more than any­one else I’ve read.) It’s rather difficult to find good sources for learn­ing stat mech the Jaynes way; Walter Grandy has a few great books, but they’re not ex­actly in­tro-level.

Sum­mary: ab­strac­tion is about ig­nor­ing or throw­ing away in­for­ma­tion, in such a way that we can still make re­li­able pre­dic­tions about some as­pects of the un­der­ly­ing sys­tem.

Embed­ded World-Models

The next few sec­tions will walk through differ­ent ways of look­ing at the core prob­lems of em­bed­ded agency, as pre­sented in the em­bed­ded agency se­quence. We’ll start with em­bed­ded world-mod­els, since these in­tro­duce the key con­straint for ev­ery­thing else.

The un­der­ly­ing challenge of em­bed­ded world mod­els is that the map is smaller than the ter­ri­tory it rep­re­sents. The map sim­ply won’t have enough space to perfectly rep­re­sent the state of the whole ter­ri­tory—much less ev­ery pos­si­ble ter­ri­tory, as re­quired for Bayesian in­fer­ence. A piece of pa­per with some lines on it doesn’t have space to rep­re­sent the full micro­scopic con­figu­ra­tion of ev­ery atom com­pris­ing the streets of New York City.

Ob­vi­ous im­pli­ca­tion: the map has to throw out some in­for­ma­tion about the ter­ri­tory. (Note that this isn’t nec­es­sar­ily true in all cases: the ter­ri­tory could have some sym­me­try al­low­ing for a perfect com­pressed rep­re­sen­ta­tion. But this prob­a­bly won’t ap­ply to most real-world sys­tems, e.g. the full micro­scopic con­figu­ra­tion of ev­ery atom com­pris­ing the streets of New York City.)

So we need to throw out some in­for­ma­tion to make a map, but we still want to be able to re­li­ably pre­dict some as­pects of the ter­ri­tory—oth­er­wise there wouldn’t be any point in build­ing a map to start with. In other words, we need ab­strac­tion.

Ex­actly what prob­lems of ab­strac­tion do we need to solve?

The sim­plest prob­lems are things like:

  • Given a map-map­ping pro­cess, char­ac­ter­ize the queries whose an­swers the map can re­li­ably pre­dict. Ex­am­ple: figure out what what ques­tions a streetmap can an­swer by watch­ing a car­tog­ra­pher pro­duce a streetmap.

  • Given some rep­re­sen­ta­tion of the map-ter­ri­tory cor­re­spon­dence, trans­late queries from the ter­ri­tory-rep­re­sen­ta­tion to the map-rep­re­sen­ta­tion and vice versa. Ex­am­ple: af­ter un­der­stand­ing the re­la­tion­ship be­tween streets and lines on pa­per, turn “how far is Times Square from the Met?” into “How far is the Times Square sym­bol from the Met sym­bol on the map, and what’s the scale?”

  • Given a ter­ri­tory, char­ac­ter­ize classes of queries which can be re­li­ably an­swered us­ing a map much smaller than the ter­ri­tory it­self. Ex­am­ple: rec­og­nize that the world con­tains lots of things with leaves, bark, branches, etc, and these “trees” are similar enough that a com­pressed map can re­li­ably make pre­dic­tions about spe­cific trees - e.g. things with branches and bark are also likely to have leaves.

  • Given a ter­ri­tory and a class of queries, con­struct a map which throws out as much in­for­ma­tion as pos­si­ble while still al­low­ing ac­cu­rate pre­dic­tion over the query class.

  • Given a map and a class of queries whose an­swers the map can re­li­ably pre­dict, char­ac­ter­ize the class of ter­ri­to­ries which the map might rep­re­sent.

  • Given mul­ti­ple differ­ent maps sup­port­ing differ­ent queries, how can we use them to­gether con­sis­tently? Ex­am­ple: a con­struc­tion pro­ject may need to use both a wa­ter-main map and a streetmap to figure out where to dig.

Th­ese kinds of ques­tions di­rectly ad­dress many of the is­sues from Abram & Scott’s em­bed­ded world-mod­els post: grain-of-truth, high-level/​multi-level mod­els, on­tolog­i­cal crises. But we still need to dis­cuss the biggest bar­rier to a the­ory of em­bed­ded world-mod­els: di­ag­o­nal­iza­tion, i.e. a ter­ri­tory which sees the map’s pre­dic­tions and then falsifies them.

If the map is em­bed­ded in the ter­ri­tory, then things in the ter­ri­tory can look at what the map pre­dicts, then make the pre­dic­tion false. For in­stance, some troll in the de­part­ment of trans­porta­tion could reg­u­larly check Google’s traf­fic map for NYC, then quickly close off roads to make the map as in­ac­cu­rate as pos­si­ble. This sort of thing could even hap­pen nat­u­rally, with­out trolls: if lots of peo­ple fol­low Google’s low-traf­fic route recom­men­da­tions, then the recom­mended routes will quickly fill up with traf­fic.

Th­ese ex­am­ples sug­gest that, when mak­ing a map of a ter­ri­tory which con­tains the map, there is a nat­u­ral role for ran­dom­iza­tion: Google’s traf­fic-map­ping team can achieve max­i­mum ac­cu­racy by ran­dom­iz­ing their own pre­dic­tions. Rather than recom­mend­ing the same min­i­mum-traf­fic route for ev­ery­one, they can ran­dom­ize be­tween a few routes and end up at a Nash equil­ibrium in their pre­dic­tion game.

We’re spec­u­lat­ing about a map mak­ing pre­dic­tions based on a game-the­o­retic mixed strat­egy, but at this point we haven’t even defined the rules of the game. What is the map’s “util­ity func­tion” in this game? The an­swer to that sort of ques­tion should come from think­ing about the sim­pler ques­tions from ear­lier. We want a the­ory where the “rules of the game” for self-refer­en­tial maps fol­low nat­u­rally from the the­ory for non-self-refer­en­tial maps. This is one ma­jor rea­son why I see ab­strac­tion as the key to em­bed­ded agency, rather than em­bed­ded agency as the key to ab­strac­tion: I ex­pect a solid the­ory of non-self-refer­en­tial ab­strac­tions to nat­u­rally define the rules/​ob­jec­tives of self-refer­en­tial ab­strac­tion. Also, I ex­pect the non-refer­en­tial-the­ory to char­ac­ter­ize em­bed­ded map-mak­ing pro­cesses, which the self-refer­en­tial the­ory will likely need to rec­og­nize in the ter­ri­tory.

Embed­ded De­ci­sion Theory

The main prob­lem for em­bed­ded de­ci­sion the­ory—as op­posed to de­ci­sion the­ory in gen­eral—is how to define coun­ter­fac­tu­als. We want to ask ques­tions like “what would hap­pen if I dropped this ap­ple on that table”, even if we can look at our own in­ter­nal pro­gram and see that we will not, in fact, drop the ap­ple. If we want our agent to max­i­mize some ex­pected util­ity func­tion E[u(x)], then the “x” needs to rep­re­sent a coun­ter­fac­tual sce­nario in which the agent takes some ac­tion—and we need to be able to rea­son about that sce­nario even if the agent ends up tak­ing some other ac­tion.

Of course, we said in the pre­vi­ous sec­tion that the agent is us­ing a map which is smaller than the ter­ri­tory—in “E[u(x)]”, that map defines the ex­pec­ta­tion op­er­a­tor E[-]. (Of course, we could imag­ine ar­chi­tec­tures which don’t ex­plic­itly use an ex­pec­ta­tion op­er­a­tor or util­ity func­tion, but the main point car­ries over: the agent’s de­ci­sions will be based on a map smaller than the ter­ri­tory.) De­ci­sion the­ory re­quires that we run coun­ter­fac­tual queries on that map, so it needs to be a causal model.

In par­tic­u­lar, we need a causal model which al­lows coun­ter­fac­tual queries over the agent’s own “out­puts”, i.e. the re­sults of any op­ti­miza­tion it runs. In other words, the agent needs to be able to rec­og­nize it­self—or copies of it­self—in the en­vi­ron­ment. The map needs to rep­re­sent, if not a hard bound­ary be­tween agent and en­vi­ron­ment, at least the pieces which will be changed by the agent’s com­pu­ta­tion and/​or ac­tions.

What con­straints does this pose on a the­ory of ab­strac­tion suit­able for em­bed­ded agency?

The main con­straints are:

  • The map and ter­ri­tory should both be causal (pos­si­bly with sym­me­try)

  • Coun­ter­fac­tual queries on the map should nat­u­rally cor­re­spond to coun­ter­fac­tu­als on the territory

  • The agent needs some idea of which coun­ter­fac­tu­als on the map cor­re­spond to its own com­pu­ta­tions/​ac­tions in the ter­ri­tory—i.e. it needs to rec­og­nize itself

Th­ese are the min­i­mum re­quire­ments for the agent to plan out its ac­tions based on the map, im­ple­ment the plan in the ter­ri­tory, and have such plans work.

Note that there’s still a lot of de­grees of free­dom here. For in­stance, how does the agent han­dle copies of it­self em­bed­ded in the en­vi­ron­ment? Some an­swers to that ques­tion might be “bet­ter” than oth­ers, in terms of pro­duc­ing more util­ity or some­thing, but I see that as a de­ci­sion the­ory ques­tion which is not a nec­es­sary pre­req­ui­site for a the­ory of em­bed­ded agency. On the other hand, a the­ory of em­bed­ded agency would prob­a­bly help build de­ci­sion the­o­ries which rea­son about copies of the agent. This is a ma­jor rea­son why I see a the­ory of ab­strac­tion as a pre­req­ui­site to new de­ci­sion the­o­ries, but not new de­ci­sion the­o­ries as a pre­req­ui­site to ab­strac­tion: we need ab­strac­tion on causal mod­els just to talk about em­bed­ded de­ci­sion the­ory, but prob­lems like agent-copies can be built later on top of a the­ory of ab­strac­tion—es­pe­cially a the­ory of ab­strac­tion which already han­dles self-refer­en­tial maps.

Self-Rea­son­ing & Improvement

Prob­lems of self-rea­son­ing, im­prove­ment, tiling, and so forth are similar to the prob­lems of self-refer­en­tial ab­strac­tion, but on hard mode. We’re no longer just think­ing about a map of a ter­ri­tory which con­tains the map; we’re think­ing about a map of a ter­ri­tory which con­tains the whole map-mak­ing pro­cess, and we want to e.g. mod­ify the map-mak­ing pro­cess to pro­duce more re­li­able maps. But if our goals are rep­re­sented on the old, less-re­li­able map, can we safely trans­late those goals into the new map? For that mat­ter, do the goals on the old map even make sense in the ter­ri­tory?

So… hard mode. What do we need from our the­ory of ab­strac­tion?

A lot of this boils down to the “sim­ple” ques­tions from ear­lier: make sure queries on the old map trans­late in­tel­ligibly into queries on the ter­ri­tory, and are com­pat­i­ble with queries on other maps, etc. But there are some sig­nifi­cant new el­e­ments here: re­flect­ing speci­fi­cally on the map-mak­ing pro­cess, es­pe­cially when we don’t have an out­side-view way to know that we’re think­ing about the ter­ri­tory “cor­rectly” to be­gin with.

Th­ese things feel to me like “level 2” ques­tions. Level 1: build a the­ory of ab­strac­tion be­tween causal mod­els. Han­dle cases where the map mod­els a copy of it­self, e.g. when an agent la­bels its own com­pu­ta­tions/​ac­tions in the map. Part of that the­ory should talk about map-mak­ing pro­cesses: for what queries/​ter­ri­to­ries will a given map-maker pro­duce a map which makes suc­cess­ful pre­dic­tions? What map-mak­ing pro­cesses pro­duce suc­cess­ful self-refer­en­tial maps? Once level 1 is nailed down, we should have the tools to talk about level 2: run­ning coun­ter­fac­tu­als in which we change the map-mak­ing pro­cess.

Of course, not all ques­tions of self-rea­son­ing/​im­prove­ment are about ab­strac­tion. We could also ques­tions about e.g. how to make an agent which mod­ifies its own code to run faster, with­out chang­ing in­put/​out­put (though of course in­put/​out­put are slip­pery no­tions in an em­bed­ded world…). We could ask ques­tions about how to make an agent mod­ify its own de­ci­sion the­ory. Etc. Th­ese prob­lems don’t in­her­ently in­volve ab­strac­tion. My in­tu­ition, how­ever, is that the prob­lems which don’t in­volve self-refer­en­tial ab­strac­tion usu­ally seem eas­ier. That’s not to say peo­ple shouldn’t work on them—there’s cer­tainly value there, and they seem more amenable to in­cre­men­tal progress—but the crit­i­cal path to a work­able the­ory of em­bed­ded agency seems to go through self-refer­en­tial maps and map-mak­ers.


Agents made of parts have sub­sys­tems. In­so­far as those sub­sys­tems are also agenty and have goals of their own, we want them to be al­igned with the top-level agent. What new re­quire­ments does this pose for a the­ory of ab­strac­tion?

First and fore­most, if we want to talk about agent sub­sys­tems, then our map can’t just black-box the whole agent. We can’t cir­cum­vent the lack of an agent-en­vi­ron­ment bound­ary by sim­ply draw­ing our own agent-en­vi­ron­ment bound­ary, and ig­nor­ing ev­ery­thing on the “agent” side. That doesn’t nec­es­sar­ily mean that we can’t do any self-refer­en­tial black box­ing. For in­stance, if we want to rep­re­sent a map which con­tains a copy of it­self, then a nat­u­ral method is to use a data struc­ture which con­tains a poin­ter to it­self. That sort of strat­egy has not nec­es­sar­ily been ruled out, but we can’t just blindly ap­ply it to the whole agent.

In par­tic­u­lar, if we’re work­ing with causal mod­els (pos­si­bly with sym­me­try), then the de­tails of the map-mak­ing pro­cess and the re­flect­ing-on-map-mak­ing pro­cess and what­not all need to be causal as well. We can’t call on or­a­cles or non-con­struc­tive ex­is­tence the­o­rems or other such magic. Loosely speak­ing, our the­ory of ab­strac­tion needs to be com­putable.

In ad­di­tion, we don’t just want to model the agent as hav­ing parts, we want to model some of the parts as agenty—or at least con­sider that pos­si­bil­ity. In par­tic­u­lar, that means we need to talk about other maps and other map-mak­ers em­bed­ded in the en­vi­ron­ment. We want to be able to rec­og­nize map-mak­ing pro­cesses em­bed­ded in the ter­ri­tory. And again, this all needs to be com­putable, so we need al­gorithms to rec­og­nize map-mak­ing pro­cesses em­bed­ded in the ter­ri­tory.

We’re talk­ing about these ca­pa­bil­ities in the con­text of al­ign­ing sub­agents, but this is re­ally a key re­quire­ment for al­ign­ment more broadly. Ul­ti­mately, we want to point at some­thing in the ter­ri­tory and say “See that agenty thing over there? That’s a hu­man; there’s a bunch of them out in the world. Figure out their val­ues, and help satisfy those val­ues.” Rec­og­niz­ing agents em­bed­ded in the ter­ri­tory is a key piece of this, and rec­og­niz­ing em­bed­ded map-mak­ing pro­cesses seems to me like the hard­est part of that prob­lem—again, it’s on the crit­i­cal path.


Time for a re­cap.

The idea of ab­strac­tion is to throw out in­for­ma­tion, while still main­tain­ing the abil­ity to provide re­li­able pre­dic­tions on at least some queries.

In or­der to ad­dress the core prob­lems of em­bed­ded world mod­els, a the­ory of ab­strac­tion would need to first han­dle some “sim­ple” ques­tions:

  • Char­ac­ter­ize which queries work on which maps of which ter­ri­to­ries.

  • Char­ac­ter­ize which query classes ad­mit sig­nifi­cantly-com­pressed maps on which ter­ri­to­ries.

  • Char­ac­ter­ize map-mak­ing pro­cesses which pro­duce re­li­able maps.

  • Trans­late queries be­tween map-rep­re­sen­ta­tion and ter­ri­tory-rep­re­sen­ta­tion, and be­tween differ­ent map-representations

We hope that a the­ory which ad­dresses these prob­lems on non-self-refer­en­tial maps will sug­gest nat­u­ral ob­jec­tives/​rules for self-refer­en­tial maps.

Embed­ded de­ci­sion the­ory adds a few more con­straints, in or­der to define coun­ter­fac­tu­als for op­ti­miza­tion:

  • Our the­ory of ab­strac­tion should work with causal mod­els for both the ter­ri­tory and the map

  • We need ways of map­ping be­tween coun­ter­fac­tu­als on the map and coun­ter­fac­tu­als on the territory

  • Agents need some way to rec­og­nize their own com­pu­ta­tions/​out­puts in the ter­ri­tory, and rep­re­sent them in the map.

A the­ory of em­bed­ded agency seems nec­es­sary for talk­ing about em­bed­ded de­ci­sion the­ory in a well-defined way.

Self-rea­son­ing kicks self-refer­en­tial map-mak­ing one rung up the meta-lad­der, and starts to talk about maps of map-mak­ing pro­cesses and re­lated is­sues. Th­ese aren’t the only prob­lems of self-rea­son­ing, but it does feel like self-refer­en­tial ab­strac­tion cap­tures the “hard part”—it’s on the crit­i­cal path to a full the­ory.

Fi­nally, sub­sys­tems push us to make the en­tire the­ory of ab­strac­tion causal/​com­putable. Also, it re­quires al­gorithms for rec­og­niz­ing agents—and thus map-mak­ers—em­bed­ded in the ter­ri­tory. That’s a prob­lem we prob­a­bly want to solve for safety pur­poses any­way. Again, ab­strac­tion isn’t the only part of the prob­lem, but it seems to cap­ture enough of the hard part to be on the crit­i­cal path.