Formalizing «Boundaries» with Markov blankets + Criticism of this approach
In this post, I distill how «boundaries» might be formalized in terms of Markov blankets. I also summarize criticism of the practical application of this approach from Abram Demski.
(Demski doesn’t think Markov blankets are useful/complete for formalizing «boundaries» because Markov blankets can’t handle agents that move through space.)
Note: I don’t actually think «boundaries»-as-explained-here is the best way forward! I’m writing this post to set up my next post on an alternative formulation (that I call “membranes”) which comes from a somewhat different perspective.
Here’s what I consider to be the main hypothesis for the agenda of directly applying «boundaries» to AI safety: most (if not all) instances of active harm from AI can be formally described as forceful violation of the ~objective (or ~intersubjective) causal separation between humans and their environment. (For example, someone being murdered would be a violation of their physical ‘boundary’, and someone being mind-controlled would be a violation of their informational ‘boundary’.)
And here’s what I consider to be the ultimate goal: To create safety by formally and ~objectively specifying «boundaries» and respect of «boundaries» as an outer alignment safety goal. I.e.: have AI systems respect the boundaries of humans.
The main premise I see here is that there exists some meaningful causal separation between humans and their environment that can be observed externally.
Work by other researchers: Davidad is optimistic about this idea and hopes to use it in his Open Agency Architecture (OAA) safety paradigm. Prior work on the topic has also been done via «Boundaries» sequence (Andrew Critch) and Cartesian Frames (Scott Garrabrant). For more, see «Boundaries/Membranes» and AI safety compilation or the Boundaries / Membranes tag.
How could «boundaries» be formally specified? One way that has been proposed by past research is by using Markov blankets.
[The section below is largely a conceptual distillation of Andrew Critch’s Part 3a: Defining boundaries as directed Markov blankets.]
(Also, I want to flag that Markov blankets are also an important concept in Active Inference.)
Explaining Markov blankets
By the end of this section, I intend for you to understand the following (Pearlian causal) diagram:
(Note: I will assume a basic familiarity with Markov chains in this post.)
First, I want you to imagine a simple Markov chain that represents the fact that a human influences itself over time:
Second, I want you to imagine a Markov chain that represents the fact that the environment influences itself over time:
Okay. Now, notice that in between the human and its environment there’s some kind of boundary. For example, their skin (a physical boundary) and their interpretation/cognition (an informational boundary). If this were not a human but instead a bacterium, then the boundary I mean would (mostly) be the bacterium’s cell membrane.
Third, imagine a Markov chain that represents that boundary influencing itself over time:
Okay, so we have these three Markov chains running in parallel:
But they also influence each other, so let’s build that into the model, too.
How should they be connected?
Well, how does the environment affect a human?
Ok, so I want you to notice that when an environment affects a human, it doesn’t influence them directly, but instead it influences their skin or their cognition (their boundary), and then their boundary influences them.
For example, I shine light in your eyes (part of your environment), it activates your eyes (part of your boundary), and your eyes send information to your brain (part of your insides).
Which is to say, this is what does not happen:
(This is called “infiltration”.) The environment does not directly influence the human.
Instead, the environment influences the boundary which influences them, which looks like this:
The environment influences your skin and your senses, and your skin and senses influence you.
Okay, now let’s do the other direction. How does a human influence their environment?
It’s not that a human controls the environment directly…
(This is called “exfiltration”; this does not happen.)
…but that the human takes actions (via their boundary), and then their actions affect the environment:
For example, it’s not that the environment “reads your mind” directly, but rather that you express yourself and then others read your words and actions.
Now, putting together both of directions of human-influences-environment and environment-influences-human, we get this:
Also, I want you to notice which arrows that are conspicuously missing from the diagram above:
Please compare this diagram to the one before it.
So that’s how we can model the approximate causal separation between an agent and the environment.
Defining boundary violations
Finally, we can define boundary violations as exactly this:
Boundary violations are infiltration across human Markov blankets.
Leakage and leakage minimization
Of course, in reality, there’s actually leakage and the ‘real’ Markov blanket between any human and their environment does include the arrows I said were missing.
For example, viruses in the air might influence me in ways I can’t control or prevent. Similarly, my brain waves are emanating out into the fields around me.
However, humans are agents that are actively minimizing that leakage. For example:
You don’t want to be directly controlled by your environment. (You don’t want infiltration.)
Instead, you want to take in information and then be able to decide what to do with it. You want to have a say about how things affect you.
A bacterium wants things to go through its gates and ion channels, and not just pierce its membrane.
If I could cheaply improve my boundary’s immunity to viruses, I would.
Humans are embedded agents (of course). However, humans are also actively seeking to de-embed themselves from the environment and make themselves independent from the environment.
You don’t want the way that you’re influencing the world to be by people mind-reading you. (Exfiltration)
Instead, you want to be affecting the world intentionally, through your actions.
If you believed that someone might be able to predict you well or get close to predicting you well and you don’t want that, you would probably take evasive maneuvers.
Even if this works, how would the AI system detect the Markov blankets?
A few months ago I asked a member of the Causal Incentives group (the authors of the links above) if causal discovery could be used empirically to discover agents in the real world and I remember a vibe of “yeah possibly”. (Though also this didn’t seem like their goal.)
Credits and math
This section was largely based on Andrew Critch’s «Boundaries», Part 3a: Defining boundaries as directed Markov blankets — LessWrong. That post has more technical details, and defines infiltration more rigorously in terms of mutual information. E.g.:
[Critch also splits the boundary into two components, “active” (~actions) and “passive” (~perceptions). A more thorough version of this post would have split the “B” in the diagrams above into these components, too, but I didn’t think it was necessary to do here.]
Criticism of Markov blankets as «boundaries» from Abram Demski
I spoke to Abram Demski about the general idea of using Markov blankets to specify «boundaries» and «boundary» violations, and he shared his doubt with me. In the text below, I have attempted to summarize his position in a fictional conversation.
Epistemic status: I wrote a draft of this section, and then Demski made several comments which I then integrated.
Me: «Boundaries»! Markov blankets!
AD: Eh, I think Markov blankets probably aren’t the way to go to represent “boundaries”. Markov blankets can’t follow entities through space as well as you might imagine. Suppose you have a Bayesian network whose variables are the physical properties of small regions in space-time. A Markov blanket is a set of nodes (with specific conditional independence requirements, but that part isn’t important right now). Say your probabilistic model contains a grasshopper who might hop in two different directions, depending on stimuli. Since your model represents both possibilities, you can’t select a Markov blanket to contain all and only the grasshopper. You’re forced to include some empty air (if you include either of the hop-trajectories inside your Markov blanket) or to miss some grasshopper (if you include only space-time regions certain to contain the grasshopper, by only circling the grasshopper before it makes the uncertain jump).
Me: Ah, I see.
Me: Hm, are there any other frameworks that you think might work instead?
AD: I don’t see any currently. (Including Cartesian Frames and Finite Factored Sets. And Active Inference doesn’t work for this as it just makes the same Markov-blanket mistake, IMO.) That said, I don’t see any fundamental difficulty in why this mathematical structure can’t exist. Frankly, I would like to find a mathematical definition of “trying not to violate another agent’s boundary”.
Setting up an introduction to “membranes”
Even besides the algorithmic difficulties discussed above, I see some other big philosophical problems with «boundaries», too. Mainly: for an agent to use «boundaries», it seems like it must know where every other agent’s «boundaries» are, and I’m not sure that’s practical. (For example, how do I know that my next exhalation won’t release a virus that your immune system is uniquely vulnerable to and it will kill you?)
Instead, I conceive of an alternative approach where AI systems are aware of their own ~«boundaries», but quite imperfectly aware of other agents’ «boundaries». I call this approach “membranes”. In this approach, AI systems would be aware of what belongs to them (what is “inside” of their «boundary» or membrane), and what doesn’t belong to them (what is “outside” of their «boundary» or membrane and may belong to others). I will introduce this more thoroughly in a sequence coming soon.
Subscribe to the boundaries / membranes tag if you’d like to stay updated on this agenda.
or other agents/moral patients we want respected
It’s not clear exactly how to specify the environment a priori, but it should end up roughly being the complement of the human with respect to the rest of the universe.