Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

This newsletter is an extended summary of the recently released Cartesian frames sequence.

Cartesian Frames(Scott Garrabrant) (summarized by Rohin): The embedded agency sequence (AN #31) hammered in the fact that there is no clean, sharp dividing line between an agent and its environment. This sequence proposes an alternate formalism: Cartesian frames. Note this is a paradigm that helps us think about agency: you should not be expecting some novel result that, say, tells us how to look at a neural net and find agents within it.

The core idea is that rather than assuming the existence of a Cartesian dividing line, we consider how such a dividing line could be constructed. For example, when we think of a sports team as an agent, the environment consists of the playing field and the other team; but we could also consider a specific player as an agent, in which case the environment consists of the rest of the players (on both teams) and the playing field. Each of these are valid ways of carving up what actually happens into an “agent” and an “environment”, they are frames by which we can more easily understand what’s going on, hence the name “Cartesian frames”.

A Cartesian frame takes choice as fundamental: the agent is modeled as a set of options that it can freely choose between. This means that the formulation cannot be directly applied to deterministic physical laws. It instead models what agency looks like “from the inside”. If you are modeling a part of the world as capable of making choices, then a Cartesian frame is appropriate to use to understand the perspective of that choice-making entity.

Formally, a Cartesian frame consists of a set of agent options A, a set of environment options E, a set of possible worlds W, and an interaction function that, given an agent option and an environment option, specifies which world results. Intuitively, the agent can “choose” an agent option, the environment can “choose” an environment option, and together these produce some world. You might notice that we’re treating the agent and environment symmetrically; this is intentional, and means that we can define analogs of all of our agent notions for environments as well (though they may not have nice philosophical interpretations).

The full sequence uses a lot of category theory to define operations on these sorts of objects and show various properties of the objects and their operations. I will not be summarizing this here; instead, I will talk about their philosophical interpretations.

First, let’s look at an example of using a Cartesian frame on something that isn’t typically thought of as an agent: the atmosphere, within the broader climate system. The atmosphere can “choose” whether to trap sunlight or not. Meanwhile, in the environment, either the ice sheets could melt or they could not. If sunlight is trapped and the ice sheets melt, then the world is Hot. If exactly one of these is true, then the world is Neutral. Otherwise, the world is Cool.

(Yes, this seems very unnatural. That’s good! The atmosphere shouldn’t be modeled as an agent! I’m choosing this example because its unintuitive nature makes it more likely that you think about the underlying rule, rather than just the superficial example. I will return to more intuitive examples later.)

Controllables

A property of the world is something like “it is neutral or warmer”. An agent can ensure a property if it has some option such that no matter what environment option is chosen, the property is true of the resulting world. The atmosphere could ensure the warmth property above by “choosing” to trap sunlight. Similarly the agent can prevent a property if it can guarantee that the property will not hold, regardless of the environment option. For example, the atmosphere can prevent the property “it is hot”, by “choosing” not to trap sunlight. The agent can control a property if it can both ensure and prevent it. In our example, there is no property that the atmosphere can control.

Coarsening or refining worlds

We often want to describe reality at different levels of abstraction. Sometimes we would like to talk about the behavior of various companies; at other times we might want to look at an individual employee. We can do this by having a function that maps low-level (refined) worlds to high-level (coarsened) worlds. In our example above, consider the possible worlds {YY, YN, NY, NN}, where the first letter of a world corresponds to whether sunlight was trapped (Yes or No), and the second corresponds to whether the ice sheets melted. The worlds {Hot, Neutral, Cool} that we had originally are a coarsened version of this, where we map YY to Hot, YN and NY to Neutral, and NN to Cool.

Interfaces

A major upside of Cartesian frames is that given the set of possible worlds that can occur, we can choose how to divide it up into an “agent” and an “environment”. Most of the interesting aspects of Cartesian frames are in the relationships between different ways of doing this division, for the same set of possible worlds.

First, we have interfaces. Given two different Cartesian frames <A, E, W> and <B, F, W> with the same set of worlds, an interface allows us to interpret the agent A as being used in place of the agent B. Specifically, if A would choose an option a, the interface maps this to one of B’s options b. This is then combined with the environment option f (from F) to produce a world w.

A valid interface also needs to be able to map the environment option f to e, and then combine it with the agent option a to get the world. This alternate way of computing the world must always give the same answer.

Since A can be used in place of B, all of A’s options must have equivalents in B. However, B could have options that A doesn’t. So the existence of this interface implies that A is “weaker” in a sense than B. (There are a bunch of caveats here.)

(Relevant terms in the sequence: morphism)

Decomposing agents into teams of subagents

The first kind of subagent we will consider is a subagent that can control “part of” the agent’s options. Consider for example a coordination game, where there are N players who each individually can choose whether or not to press a Big Red Button. There are only two possible worlds: either the button is pressed, or it is not pressed. For now, let’s assume there are two players, Alice and Bob.

One possible Cartesian frame is the frame for the entire team. In this case, the team has perfect control over the state of the button—the agent options are either to press the button or not to press the button, and the environment does not have any options (or more accurately, it has a single “do nothing” option).

However, we can also decompose this into separate Alice and Bob subagents. What does a Cartesian frame for Alice look like? Well, Alice also has two options—press the button, or don’t. However, Alice does not have perfect control over the result: from her perspective, Bob is part of the environment. As a result, for Alice, the environment also has two options—press the button, or don’t. The button is pressed if Alice presses it or if the environment presses it. (The Cartesian frame for Bob is identical, since he is in the same position that Alice is in.)

Note however that this decomposition isn’t perfect: given the Cartesian frames for Alice and Bob, you cannot uniquely recover the original Cartesian frame for the team. This is because both Alice and Bob’s frames say that the environment has some ability to press the button—we know that this is just from Alice and Bob themselves, but given just the frames we can’t be sure that there isn’t a third person Charlie who also might press the button. So, when we combine Alice and Bob back into the frame for a two-person team, we don’t know whether or not the environment should have the ability to press the button. This makes the mathematical definition of this kind of subagent a bit trickier though it still works out.

Another important note is that this is relative to how coarsely you model the world. We used a fairly coarse model in this example: only whether or not the button was pressed. If we instead used a finer model that tracked which subset of people pressed the button, then we would be able to uniquely recover the team’s Cartesian frame from Alice and Bob’s individual frames.

(Relevant terms in the sequence: multiplicative subagents, sub-tensors, tensors)

Externalizing and internalizing

This decomposition isn’t just for teams of people: even a single “mind” can often be thought of as the interaction of various parts. For example, hierarchical decision-making can be thought of as the interaction between multiple agents at different levels of the hierarchy.

This decomposition can be done using externalization. Externalization allows you to take an existing Cartesian frame and some specific property of the world, and then construct a new Cartesian frame where that property of the world is controlled by the environment.

Concretely, let’s imagine a Cartesian frame for Alice that represents her decision on whether to cook a meal or eat out. If she chooses to cook a meal, then she must also decide which recipe to follow. If she chooses to eat out, she must decide which restaurant to eat out at.

We can externalize the high-level choice of whether Alice cooks a meal or eats out. This results in a Cartesian frame where the environment chooses whether Alice is cooking or eating out, and the agent must then choose a restaurant or recipe as appropriate. This is the Cartesian frame corresponding to the low-level policy that must pursue whatever subgoal is chosen by the high-level planning module (which is now part of the environment). The agent of this frame is a subagent of Alice.

The reverse operation is called internalization, where some property of the world is brought under the control of the agent. In the above example, if we take the Cartesian frame for the low-level policy, and then internalize the cooking / eating out choice, we get back the Cartesian frame for Alice as a unified whole.

Note that in general externalization and internalization are not inverses of each other. As a simple example, if you externalize something that is already “in the environment” (e.g. whether it is raining, in a frame for Alice), that does nothing, but when you then internalize it, that thing is now assumed to be under the agent’s control (e.g. now the “agent” in the frame can control whether or not it is raining). We will return to this point when we talk about observability.

Decomposing agents into disjunctions of subagents

Our subagents so far have been “team-based”: the original agent could be thought of as a supervisor that got to control all of the subagents together. (The team agent in the button-pressing game could be thought of as controlling both Alice and Bob’s actions; in the cooking / eating out example Alice could be thought of as controlling both the high-level subgoal selection as well as the low-level policy that executes on the subgoals.)

The sequence also introduces another decomposition into subagents, where the superagent can be thought of as a supervisor that gets to choose which of the subagents gets to control the overall behavior. Thus, the superagent can do anything that either of the subagents could do.

Let’s return to our cooking / eating out example. We previously saw that we could decompose Alice into a high-level subgoal-choosing subagent that chooses whether to cook or eat out, and a low-level subgoal-execution subagent that then chooses which recipe to make or which restaurant to go to. We can also decompose Alice as being the choice of two subagents: one that chooses which restaurant to go to, and one that chooses which recipe to make. The union of these subagents is an agent that first chooses whether to go to a restaurant or to make a recipe, and then uses the appropriate subagent to choose the restaurant or recipe: this is exactly a description of Alice.

(Relevant terms in the sequence: additive subagents, sub-sums, sums)

Committing and assuming

One way to think about the subagents of the previous example is that they are the result of Alice committing to a particular subset of choices. If Alice commits to eating out (but doesn’t specify at what restaurant), then the resulting frame is equivalent to the restaurant-choosing subagent.

Similarly to committing, we can also talk about assuming. Just as commitments restrict the set of options available to the agent, assumptions restrict the set of options available to the environment.

Just as we can union two agents together to get an agent that gets to choose between two subagents, we can also union two environments together to get an environment that gets to choose between two subenvironments. (In this case the agent is more constrained: it must be able to handle the environment regardless of which way the environment chooses.)

(Relevant terms in the sequence: product)

Observables

The most interesting (to me) part of this sequence was the various equivalent definitions of what it means for something to be observable. The overall story is similar to the one in Knowledge is Freedom: an agent is said to “observe” a property P if it is capable of making different decisions based on whether P holds or not.

Thus we get our first definition of observability: a property P of the world is observable if, for any two agent options a and b, the agent also has an option that is equivalent to “if P then a else b”.

Intuitively, this is meant to be similar to the notion of “inputs” to an agent. Intuitively, a neural net should be able to express arbitrary computations over its inputs, and so if we view the neural net as “choosing” what computation to do (by “choosing” what its parameters are), then the neural net can have its outputs (agent options) depend in arbitrary ways on the inputs. Thus, we say that the neural net “observes” its inputs, because what the neural net does can depend freely on the inputs.

Note that this is a very black-or-white criterion: we must be able to express every conditional policy on the property for it to be observable; if even one such policy is not expressible then the property is not observable.

One way to think about this is that an observable property needs to be completely under the control of the environment, that is, the environment option should completely determine whether the resulting world satisfies the property or not—nothing the agent does can matter (for this property). To see this, suppose that there was some environment option e that didn’t fully determine a property P, so that there are agent options a and b such that the world corresponding to (a, e) satisfies P but the one corresponding to (b, e) does not. Then our agent cannot implement the conditional policy “if P then b else a”, because it would lead to a self-referential contradiction (akin to “this sentence is false”) when the environment chooses e. Thus, P cannot be observable.

This is not equivalent to observability: it is possible for the environment to fully control P, while the agent is still unable to always condition on P. So we do need something extra. Nevertheless, this intuition suggests a few other ways of thinking about observability. The key idea is to identify a decomposition of the agent based on P that should only work if the environment has all the control, and then to identify a union step that puts the agent back together, that automatically adds in all of the policies that are conditional on P. I’ll describe these definitions here; the sequence proves that they are in fact equivalent to the original definition above.

First, recall that externalization and internalization are methods that allow us to “transfer” control of some property from the agent to the environment and vice versa. Thus, if all the control of P is in the environment, one would hope that internalization followed by externalization just transfers the control back and forth. In addition, when we externalize P, the externalization process will enforce that the agent can condition on P arbitrarily (this is how it is defined). This suggests the definition: P is observable if and only if internalizing P followed by externalizing P gives us back the original frame.

Second, if the environment has all of the control over P, then we should be able to decompose the agent into two parts: one that decides what to do when P is true, and one that decides what to do when P is false. We can achieve this using assumptions, that is, the first agent is the original agent under the assumption that P is true, and the second is under the assumption that P is false. Note that if the environment didn’t have perfect control over P, this would not work, as the environment options where P is not guaranteed to be true or false would simply be deleted, and could not be reconstructed from the two new agents.

We now need to specify how to put the agents back together, in a way that includes all the conditional policies on P. There are actually two variants in how we can do this:

In the first case, we combine the agents by unioning the environments, which lets the environment choose whether P is true or not. Given how this union is defined, the new agent is able to specify both what to do given the environment’s choice, as well as what it would have done in the counterfactual case where the environment had decided P differently. This allows it to implement all conditional policies on P. So, P is observable if and only if decomposing the frame using assumptions on P, and then unioning the environments of the resulting frames gives back the original frame.

In the second case, after getting agents via assumption on P, you extend each agent so that in the case where its assumption is false, it is as though it takes a noop action. Intuitively, the resulting agent is an agent that is hobbled so that it has no power in worlds where P comes out differently than was assumed. These agents are then combined into a team. Intuitively, the team selects an option of the form “the first agent tries to do X (which only succeeds when P is true) and the second agent tries to do Y (which only succeeds when P is false)”. Like the previous decomposition, this specifies both what to do in whatever actual environment results, as well as what would have been done in the counterfactual world where the value of P was reversed. Thus, this way of combining the agents once again adds in all conditional policies on P. So, P is observable if and only if decomposing the frame using assumptions on P, then hobbling the resulting frames in cases where their assumptions are false, and then putting the agents back in a team, is equivalent to the original frame.

Time

Cartesian frames do not have an intrinsic notion of time. However, we can still use them to model sequential processes, by having the agent options be policies rather than actions, and having the worlds be histories or trajectories rather than states.

To say useful things about time, we need to broaden our notion of observables. So far I’ve been talking about whether you can observe binary properties P that are either true or false. In fact, all of the definitions can be easily generalized to n-ary properties P that can take on one of N values. We’ll be using this notion of observability here.

Consider a game of chess where Alice plays as white and Bob as black. Intuitively, when Alice is choosing her second move, she can observe Bob’s first move. However, the property “Bob’s first move” would not be observable in Alice’s Cartesian frame, because Alice’s first move cannot depend on Bob’s first move (since Bob hasn’t made it yet), and so when deciding the first move we can’t implement policies that condition on what Bob’s first move is.

Really, we want some way to say “after Alice has made her first move, from the perspective of the rest of her decisions, Bob’s first move is observable”. But we know how to remove some control from the agent in order to get the perspective of “everything else”—that’s externalization! In particular, in Alice’s frame, if we externalize the property “Alice’s first move”, then the property “Bob’s first move” is observable in the new frame.

This suggests a way to define a sequence of frames that represent the passage of time: we define the Tth frame as “the original frame, but with the first T moves externalized”, or equivalently as “the T-1th frame, but with the Tth move externalized”. Each of these frames are subagents of the original frame, since we can think of the full agent (Alice) as the team of “the agent that plays the first T moves” and “the agent that plays the T+1th move and onwards”. As you might expect, as “time” progresses, the agent loses controllables and gains observables. For example, by move 3 Alice can no longer control her first two moves, but she can now observe Bob’s first two moves, relative to Alice at the beginning of the game.

Rohin’s opinion: I like this way of thinking about agency: we’ve been talking about “where to draw the line around the agent” for quite a while in AI safety, but there hasn’t been a nice formalization of this until now. In particular, it’s very nice that we can compare different ways of drawing the line around the agent, and make precise various concepts around this, such as “subagent”.

I’ve also previously liked the notion that “to observe P is to be able to change your decisions based on the value of P”, but I hadn’t really seen much discussion about it until now. This sequence makes some real progress on conceptual understanding of this perspective: in particular, the notion that observability requires “all the control to be in the environment” is not one I had until now. (Though I should note that this particular phrasing is mine, and I’m not sure the author would agree with the phrasing.)

One of my checks for the utility of foundational theory for a particular application is to see whether the key results can be explained without having to delve into esoteric mathematical notation. I think this sequence does very well on this metric—for the most part I didn’t even read the proofs, yet I was able to reconstruct conceptual arguments for many of the theorems that are convincing to me. (They aren’t and shouldn’t be as convincing as the proofs themselves.) However, not all of the concepts score so well on this—for example, the generic subagent definition was sufficiently unintuitive to me that I did not include it in this summary.

FEEDBACK

I’m always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

## [AN #127]: Rethinking agency: Cartesian frames as a formalization of ways to carve up the world into an agent and its environment

Link post

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter

resources here. In particular, you can look throughthis spreadsheetof all summaries that have ever been in the newsletter.Audio version

here(may not be up yet).Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

This newsletter is an extended summary of the recently released Cartesian frames sequence.

Cartesian Frames(Scott Garrabrant)(summarized by Rohin): Theembedded agency sequence(AN #31) hammered in the fact that there is no clean, sharp dividing line between an agent and its environment. This sequence proposes an alternate formalism: Cartesian frames. Note this is a paradigm that helps usthink about agency: you should not be expecting some novel result that, say, tells us how to look at a neural net and find agents within it.The core idea is that rather than

assumingthe existence of a Cartesian dividing line, we consider how such a dividing line could beconstructed. For example, when we think of a sports team as an agent, the environment consists of the playing field and the other team; but we could also consider a specific player as an agent, in which case the environment consists of the rest of the players (on both teams) and the playing field. Each of these are valid ways of carving up what actually happens into an “agent” and an “environment”, they areframesby which we can more easily understand what’s going on, hence the name “Cartesian frames”.A Cartesian frame takes

choiceas fundamental: the agent is modeled as a set of options that it can freely choose between. This means that the formulation cannot be directly applied to deterministic physical laws. It instead models what agency looks like“from the inside”.Ifyou are modeling a part of the world as capable of making choices,thena Cartesian frame is appropriate to use to understand the perspective of that choice-making entity.Formally, a Cartesian frame consists of a set of agent options A, a set of environment options E, a set of possible worlds W, and an interaction function that, given an agent option and an environment option, specifies which world results. Intuitively, the agent can “choose” an agent option, the environment can “choose” an environment option, and together these produce some world. You might notice that we’re treating the agent and environment symmetrically; this is intentional, and means that we can define analogs of all of our agent notions for environments as well (though they may not have nice philosophical interpretations).

The full sequence uses a lot of category theory to define operations on these sorts of objects and show various properties of the objects and their operations. I will not be summarizing this here; instead, I will talk about their philosophical interpretations.

First, let’s look at an example of using a Cartesian frame on something that isn’t typically thought of as an agent: the atmosphere, within the broader climate system. The atmosphere can “choose” whether to trap sunlight or not. Meanwhile, in the environment, either the ice sheets could melt or they could not. If sunlight is trapped and the ice sheets melt, then the world is Hot. If exactly one of these is true, then the world is Neutral. Otherwise, the world is Cool.

(Yes, this seems very unnatural. That’s good! The atmosphere shouldn’t be modeled as an agent! I’m choosing this example because its unintuitive nature makes it more likely that you think about the underlying rule, rather than just the superficial example. I will return to more intuitive examples later.)

ControllablesA

propertyof the world is something like “it is neutral or warmer”. An agent canensurea property if it has some option such that no matter what environment option is chosen, the property is true of the resulting world. The atmosphere could ensure the warmth property above by “choosing” to trap sunlight. Similarly the agent canpreventa property if it can guarantee that the property will not hold, regardless of the environment option. For example, the atmosphere can prevent the property “it is hot”, by “choosing” not to trap sunlight. The agent cancontrola property if it can both ensure and prevent it. In our example, there is no property that the atmosphere can control.Coarsening or refining worldsWe often want to describe reality at different levels of abstraction. Sometimes we would like to talk about the behavior of various companies; at other times we might want to look at an individual employee. We can do this by having a function that maps low-level (refined) worlds to high-level (coarsened) worlds. In our example above, consider the possible worlds {YY, YN, NY, NN}, where the first letter of a world corresponds to whether sunlight was trapped (Yes or No), and the second corresponds to whether the ice sheets melted. The worlds {Hot, Neutral, Cool} that we had originally are a coarsened version of this, where we map YY to Hot, YN and NY to Neutral, and NN to Cool.

InterfacesA major upside of Cartesian frames is that given the set of possible worlds that can occur, we can choose how to divide it up into an “agent” and an “environment”. Most of the interesting aspects of Cartesian frames are in the relationships between different ways of doing this division, for the same set of possible worlds.

First, we have interfaces. Given two different Cartesian frames <A, E, W> and <B, F, W> with the same set of worlds, an interface allows us to interpret the agent A as being used in place of the agent B. Specifically, if A would choose an option a, the interface maps this to one of B’s options b. This is then combined with the environment option f (from F) to produce a world w.

A valid interface also needs to be able to map the environment option f to e, and then combine it with the agent option a to get the world. This alternate way of computing the world must always give the same answer.

Since A can be used in place of B, all of A’s options must have equivalents in B. However, B could have options that A doesn’t. So the existence of this interface implies that A is “weaker” in a sense than B. (There are a bunch of caveats here.)

(Relevant terms in the sequence:

morphism)Decomposing agents into teams of subagentsThe first kind of subagent we will consider is a subagent that can control “part of” the agent’s options. Consider for example a coordination game, where there are N players who each individually can choose whether or not to press a Big Red Button. There are only two possible worlds: either the button is pressed, or it is not pressed. For now, let’s assume there are two players, Alice and Bob.

One possible Cartesian frame is the frame for the entire team. In this case, the team has perfect control over the state of the button—the agent options are either to press the button or not to press the button, and the environment does not have any options (or more accurately, it has a single “do nothing” option).

However, we can also decompose this into separate Alice and Bob

subagents. What does a Cartesian frame for Alice look like? Well, Alice also has two options—press the button, or don’t. However, Alice does not have perfect control over the result: from her perspective, Bob is part of the environment. As a result, for Alice, the environment also has two options—press the button, or don’t. The button is pressed if Alice presses itorif the environment presses it. (The Cartesian frame for Bob is identical, since he is in the same position that Alice is in.)Note however that this decomposition isn’t perfect: given the Cartesian frames for Alice and Bob, you cannot uniquely recover the original Cartesian frame for the team. This is because both Alice and Bob’s frames say that the environment has some ability to press the button—

weknow that this is just from Alice and Bob themselves, but given just the frames we can’t be sure that there isn’t a third person Charlie who also might press the button. So, when we combine Alice and Bob back into the frame for a two-person team, we don’t know whether or not the environment should have the ability to press the button. This makes the mathematical definition of this kind of subagent a bit trickier though it still works out.Another important note is that this is relative to how coarsely you model the world. We used a fairly coarse model in this example: only whether or not the button was pressed. If we instead used a finer model that tracked which subset of people pressed the button, then we

wouldbe able to uniquely recover the team’s Cartesian frame from Alice and Bob’s individual frames.(Relevant terms in the sequence:

multiplicative subagents, sub-tensors, tensors)Externalizing and internalizingThis decomposition isn’t just for teams of people: even a single “mind” can often be thought of as the interaction of various parts. For example, hierarchical decision-making can be thought of as the interaction between multiple agents at different levels of the hierarchy.

This decomposition can be done using

externalization. Externalization allows you to take an existing Cartesian frame and some specific property of the world, and then construct a new Cartesian frame where that property of the world is controlled by the environment.Concretely, let’s imagine a Cartesian frame for Alice that represents her decision on whether to cook a meal or eat out. If she chooses to cook a meal, then she must also decide which recipe to follow. If she chooses to eat out, she must decide which restaurant to eat out at.

We can externalize the high-level choice of whether Alice cooks a meal or eats out. This results in a Cartesian frame where the environment chooses whether Alice is cooking or eating out, and the agent must then choose a restaurant or recipe as appropriate. This is the Cartesian frame corresponding to the low-level policy that must pursue whatever subgoal is chosen by the high-level planning module (which is now part of the environment). The agent of this frame is a subagent of Alice.

The reverse operation is called internalization, where some property of the world is brought under the control of the agent. In the above example, if we take the Cartesian frame for the low-level policy, and then internalize the cooking / eating out choice, we get back the Cartesian frame for Alice as a unified whole.

Note that in general externalization and internalization are

notinverses of each other. As a simple example, if you externalize something that is already “in the environment” (e.g. whether it is raining, in a frame for Alice), that does nothing, but when you then internalize it, that thing is now assumed to be under the agent’s control (e.g. now the “agent” in the frame can control whether or not it is raining). We will return to this point when we talk about observability.Decomposing agents into disjunctions of subagentsOur subagents so far have been “team-based”: the original agent could be thought of as a supervisor that got to control all of the subagents together. (The team agent in the button-pressing game could be thought of as controlling both Alice and Bob’s actions; in the cooking / eating out example Alice could be thought of as controlling both the high-level subgoal selection as well as the low-level policy that executes on the subgoals.)

The sequence also introduces another decomposition into subagents, where the superagent can be thought of as a supervisor that gets to choose

whichof the subagents gets to control the overall behavior. Thus, the superagent can do anything that either of the subagents could do.Let’s return to our cooking / eating out example. We previously saw that we could decompose Alice into a high-level subgoal-choosing subagent that chooses whether to cook or eat out, and a low-level subgoal-execution subagent that then chooses which recipe to make or which restaurant to go to. We can also decompose Alice as being the choice of two subagents: one that chooses which restaurant to go to, and one that chooses which recipe to make. The union of these subagents is an agent that first chooses whether to go to a restaurant or to make a recipe, and then uses the appropriate subagent to choose the restaurant or recipe: this is exactly a description of Alice.

(Relevant terms in the sequence:

additive subagents, sub-sums, sums)Committing and assumingOne way to think about the subagents of the previous example is that they are the result of Alice

committingto a particular subset of choices. If Alice commits to eating out (but doesn’t specify at what restaurant), then the resulting frame is equivalent to the restaurant-choosing subagent.Similarly to committing, we can also talk about

assuming. Just as commitments restrict the set of options available to the agent, assumptions restrict the set of options available to the environment.Just as we can union two agents together to get an agent that gets to choose between two subagents, we can also union two environments together to get an environment that gets to choose between two subenvironments. (In this case the agent is more constrained: it must be able to handle the environment regardless of which way the environment chooses.)

(Relevant terms in the sequence:

product)ObservablesThe most interesting (to me) part of this sequence was the various equivalent definitions of what it means for something to be observable. The overall story is similar to the one in

Knowledge is Freedom: an agent is said to “observe” a property P if it is capable of making different decisions based on whether P holds or not.Thus we get our first definition of observability:

a property P of the world isobservableif, for any two agent options a and b, the agent also has an option that is equivalent to “if P then a else b”.Intuitively, this is meant to be similar to the notion of “inputs” to an agent. Intuitively, a neural net should be able to express arbitrary computations over its inputs, and so if we view the neural net as “choosing” what computation to do (by “choosing” what its parameters are), then the neural net can have its outputs (agent options) depend in arbitrary ways on the inputs. Thus, we say that the neural net “observes” its inputs, because what the neural net does can depend freely on the inputs.

Note that this is a very black-or-white criterion: we must be able to express

everyconditional policy on the property for it to be observable; if even one such policy is not expressible then the property is not observable.One way to think about this is that an observable property needs to be completely under the control of the environment, that is, the environment option should completely determine whether the resulting world satisfies the property or not—nothing the agent does can matter (for this property). To see this, suppose that there was some environment option e that didn’t fully determine a property P, so that there are agent options a and b such that the world corresponding to (a, e) satisfies P but the one corresponding to (b, e) does not. Then our agent cannot implement the conditional policy “if P then b else a”, because it would lead to a self-referential contradiction (akin to “this sentence is false”) when the environment chooses e. Thus, P cannot be observable.

This is not equivalent to observability: it is possible for the environment to fully control P, while the agent is still unable to always condition on P. So we do need something extra. Nevertheless, this intuition suggests a few other ways of thinking about observability. The key idea is to identify a decomposition of the agent based on P that should only work if the environment has all the control, and then to identify a union step that puts the agent back together, that automatically adds in all of the policies that are conditional on P. I’ll describe these definitions here; the sequence proves that they are in fact equivalent to the original definition above.

First, recall that externalization and internalization are methods that allow us to “transfer” control of some property from the agent to the environment and vice versa. Thus, if all the control of P is in the environment, one would hope that internalization followed by externalization just transfers the control back and forth. In addition, when we externalize P, the externalization process will enforce that the agent can condition on P arbitrarily (this is how it is defined). This suggests the definition:

P is observable if and only if internalizing P followed by externalizing P gives us back the original frame.Second, if the environment has all of the control over P, then we should be able to decompose the agent into two parts: one that decides what to do when P is true, and one that decides what to do when P is false. We can achieve this using

assumptions, that is, the first agent is the original agent under the assumption that P is true, and the second is under the assumption that P is false. Note that if the environment didn’t have perfect control over P, this would not work, as the environment options where P is not guaranteed to be true or false would simply be deleted, and could not be reconstructed from the two new agents.We now need to specify how to put the agents back together, in a way that includes all the conditional policies on P. There are actually two variants in how we can do this:

In the first case, we combine the agents by unioning the environments, which lets the environment choose whether P is true or not. Given how this union is defined, the new agent is able to specify both what to do given the environment’s choice,

as well aswhat it would have done in the counterfactual case where the environment had decided P differently. This allows it to implement all conditional policies on P. So,P is observable if and only if decomposing the frame using assumptions on P, and then unioning the environments of the resulting frames gives back the original frame.In the second case, after getting agents via assumption on P, you extend each agent so that in the case where its assumption is false, it is as though it takes a noop action. Intuitively, the resulting agent is an agent that is hobbled so that it has no power in worlds where P comes out differently than was assumed. These agents are then combined into a team. Intuitively, the team selects an option of the form “the first agent tries to do X (which only succeeds when P is true) and the second agent tries to do Y (which only succeeds when P is false)”. Like the previous decomposition, this specifies both what to do in whatever actual environment results, as well as what would have been done in the counterfactual world where the value of P was reversed. Thus, this way of combining the agents once again adds in all conditional policies on P. So,

P is observable if and only if decomposing the frame using assumptions on P, then hobbling the resulting frames in cases where their assumptions are false, and then putting the agents back in a team, is equivalent to the original frame.TimeCartesian frames do not have an intrinsic notion of time. However, we can still use them to model sequential processes, by having the agent options be

policiesrather than actions, and having the worlds be histories or trajectories rather than states.To say useful things about time, we need to broaden our notion of observables. So far I’ve been talking about whether you can observe binary properties P that are either true or false. In fact, all of the definitions can be easily generalized to n-ary properties P that can take on one of N values. We’ll be using this notion of observability here.

Consider a game of chess where Alice plays as white and Bob as black. Intuitively, when Alice is choosing her second move, she can observe Bob’s first move. However, the property “Bob’s first move” would not be observable in Alice’s Cartesian frame, because Alice’s

firstmove cannot depend on Bob’s first move (since Bob hasn’t made it yet), and so when deciding the first move we can’t implement policies that condition on what Bob’s first move is.Really, we want some way to say “after Alice has made her first move, from the perspective of the rest of her decisions, Bob’s first move is observable”. But we know how to remove some control from the agent in order to get the perspective of “everything else”—that’s externalization! In particular, in Alice’s frame, if we externalize the property “Alice’s first move”, then the property “Bob’s first move”

isobservable in the new frame.This suggests a way to define a sequence of frames that represent the passage of time: we define the Tth frame as “the original frame, but with the first T moves externalized”, or equivalently as “the T-1th frame, but with the Tth move externalized”. Each of these frames are subagents of the original frame, since we can think of the full agent (Alice) as the team of “the agent that plays the first T moves” and “the agent that plays the T+1th move and onwards”. As you might expect, as “time” progresses, the agent loses controllables and gains observables. For example, by move 3 Alice can no longer control her first two moves, but she can now observe Bob’s first two moves, relative to Alice at the beginning of the game.

Rohin’s opinion:I like this way of thinking about agency: we’ve been talking about “where to draw the line around the agent” for quite a while in AI safety, but there hasn’t been a nice formalization of this until now. In particular, it’s very nice that we can compare different ways of drawing the line around the agent, and make precise various concepts around this, such as “subagent”.I’ve also previously liked the notion that “to observe P is to be able to change your decisions based on the value of P”, but I hadn’t really seen much discussion about it until now. This sequence makes some real progress on conceptual understanding of this perspective: in particular, the notion that observability requires “all the control to be in the environment” is not one I had until now. (Though I should note that this particular phrasing is mine, and I’m not sure the author would agree with the phrasing.)

One of my checks for the utility of foundational theory for a particular application is to see whether the key results can be explained without having to delve into esoteric mathematical notation. I think this sequence does very well on this metric—for the most part I didn’t even read the proofs, yet I was able to reconstruct conceptual arguments for many of the theorems that are convincing to me. (They aren’t and shouldn’t be as convincing as the proofs themselves.) However, not all of the concepts score so well on this—for example, the generic subagent definition was sufficiently unintuitive to me that I did not include it in this summary.

FEEDBACKI’m always happy to hear feedback; you can send it to me,

Rohin Shah, byreplying to this email.PODCASTAn audio podcast version of the

Alignment Newsletteris available. This podcast is an audio version of the newsletter, recorded byRobert Miles.