Vanessa Kosoy comments on Hierarchical Agency: A Missing Piece in AI Alignment

Vanessa Kosoy 24 Dec 2025 13:20 UTC
LW: 7 AF: 6
−1
AF
In this post Jan Kulveit calls for creating a theory of “hierarchical agency”, i.e. a theory that talks about agents composed of agents, which might themselves be composed of agents etc.
The form of the post is a dialogue between Kulveit and Claude (the AI). I don’t like this format. I think that dialogues are a bad format in general, disorganized and not skimming friendly. The only case where IMO dialogues are defensible, is when it’s a real dialogue: real people with different world-views that are trying to bridge and/or argue their differences.
Now, about the content. I agree with Kulveit that multi-agency is important. I’m not entirely sold on the importance of the “hierarchical agency” frame, but I agree that “when can a system of agents be regarded as a single agent” seems like a question that should be answered. At the very least, a certain type of answer to this question might relieve of us of the need to worry about “what if humans are multi-agents” in the context of alignment (because, arguably, it would be possible to just regard humans as uni-agents anyway).
After reading this post, I came up with the following (extremely simplistic) toy model for hierarchical agency.
Let $D$ be the set of possible decisions an agent can make, $W$ the set of “possible worlds”, $O$ the set of “outcomes” and $G : D \times W \to Δ O$ the (known) process by which outcomes are generated. Then, an agent that makes decision $d^{*} \in D$ can be ascribed the belief $ζ : D \to Δ W$ and the utility function $u : O \to R$ when $d^{*}$ is the unique maximum of $E_{w \sim ζ (d), o \sim G (d, w)} [u (o)]$ over $d$ . Some decisions cannot be ascribed “intent” at all: for example, if $| W | = 1$ then $d \in D$ be be ascribed intent iff $G (d)$ is an exposed point of the convex hull of the image of $G$ . (See also)
We can now consider a system of $n \in N$ agents with decision sets $D_{1} \dots D_{n}$ and a process $G : \prod_{i} D_{i} \times W \to Δ O$ . For each set $A \subseteq {1 \dots n}$ , we can define $D_{A} := \prod_{i \in A} D_{i}$ and $W_{A} := \prod_{i \notin A} D_{i} \times W$ , and then $G_{A} : D_{A} \times W_{A} \to Δ O$ is defined in the obvious way. We can then ask which agent sets $A$ have “collective intent” and which don’t.
To give a simple example, let $D_{1} = D_{2} = {H, T, coin}$ , $| W | = 1$ , $O = {H, T}^{2}$ and $G$ is defined in the obvious way, where $coin$ corresponds to flipping a fair coin to decide between $H$ and $T$ . Then, if both agents choose $c o i n$ then they have intent individually (we can think of them as playing matching pennies, with each agent believing the other one’s action to depend on their own action), but not collectively. On the other hand, if they play a pure strategy that they have collective intent as well.
Extending this into a theory that fully engages with all relevant aspects of the problem would require incorporating infra-Bayesianism, Formal Computational Realism, possibly some form of the Algorithmic Descriptive Agency Measure etc, and more generally, first developing the theory of uni-agents. But maybe starting from the “hierarchical agency” end can be useful as well.
What links here?
- Deeper Reviews for the top 15 (of the 2024 Review) by Raemon (14 Jan 2026 23:59 UTC; 45 points)