Against Time in Agent Models

When programming distributed systems, we always have many computations running in parallel. Our servers handle multiple requests in parallel, perform read and write operations on the database in parallel, etc.

The prototypical headaches of distributed programming involve multiple processes running in parallel, each performing multiple read/write operations on the same database fields. Maybe some database field says “foo”, and process 1 overwrites it with “bar”. Process 2 reads the field—depending on the timing, it may see either “foo” or “bar”. Then process 2 does some computation and writes another field—for instance, maybe it sees “foo” and writes {“most_recent_value”: “foo”} to a cache. Meanwhile, process 1 overwrote “foo” with “bar”, so it also overwrites the cache with {“most_recent_value”: “bar”}. But these two processes are running in parallel, so these operations could happen in any order—including interleaving. For instance, the order could be:

Process 2 reads “foo”
Process 1 overwrites “foo” with “bar”
Process 1 overwrites the cache with {“most_recent_value”: “bar”}
Process 2 overwrites the cache with {“most_recent_value”: “foo”}

… and now the cached value no longer matches the value in the database; our cache is broken.

One of the main heuristics for thinking about this sort of problem in distributed programming is: there is no synchronous time. What does that mean?

Well, in programming we often picture a “state-update” model: the system has some state, and at each timestep the state is updated. The update rule is a well-defined function of the state; every update happens at a well-defined time. This is how each of the individual processes works in our example: each executes two steps in a well-defined order, and each step changes the state of the system

Single process: a well-defined sequence of steps, each updating the state.

But with multiple processes in parallel, this state-update model no longer works. In our example, we can diagram our two processes like this:

Each process has its own internal “time”: the database read/write happens first, and the cache overwrite happens second. But between processes, there is no guaranteed time-ordering. For instance, the first step of process 1 could happen before all of process 2, in between the steps of process 2, or after all of process 2.

Two processes: many possible time-orderings of the operations.

We cannot accurately represent this system as executing along one single time-dimension. Proof:

Step 1 of process 1 is not guaranteed to happen either before or after step 1 of process 2; at best we could represent them as happening “at the same time”
Step 2 of process 1 is also not guaranteed to happen either before or after step 1 of process 2; at best we could represent them as happening “at the same time”
… but step 2 of process 1 is unambiguously after step 1 of process 1 in time, so the two steps can’t happen at the same time.

In order to accurately represent this sort of thing, it has to be possible for one step to be unambiguously after another, even though both of them are neither before nor after some third step.

The “most general” data structure to represent such a relationship is not a one-dimension “timeline” (i.e. total order), but rather a directed acyclic graph (i.e. partial order). That’s how time works in distributed systems: it’s a partial order, not a total order. A DAG, not a timeline. That DAG goes by many different names—including computation DAG, computation circuit, or causal model.

Beyond Distributed Programming

The same basic idea carries over to distributed systems more generally—i.e. any system physically spread out in space, with lots of different stuff going on in parallel. In a distributed system, “time” is a partial order, not a total order.

In the context of embedded agents: we want to model agenty systems which are “made of parts”, i.e. the agent is itself a system physically spread out in space with lots of different stuff going on in parallel. Likewise, the environment is made of parts. Both are distributed systems.

This is in contrast to state-update models of agency. In a state-update model, the environment has some state, the agent has some state, and at each timestep their states update. The update rule is a well-defined function of state; every update happens at a well-defined time.

Instead of the state-update picture, I usually picture an agent and its environment as a computation DAG (aka circuit aka causal model), where each node is a self-contained local computation. We carve off some chunk of this DAG to call “the agent”.

The obvious “Cartesian boundary”—i.e. the interface between agent and environment—is just a Markov blanket in the DAG (i.e. a cut which breaks the graph into two parts). That turns out to be not-quite-right, but it’s a good conceptual starting point.

Main takeaway: computational DAGs let us talk about agents without imposing a synchronous notion of “time” or “state updates”, so we can play well with distributed systems.