Embedded Agency: Not Just an AI Problem

Requisite Background: Embedded Agency Sequence


Fast forward a few years, and imagine that we have a complete physical model of an e-coli bacteria. We know every function of every gene, kinetics of every reaction, physics of every membrane and motor. Computational models of the entire bacteria are able to accurately predict responses to every experiment we run.

Biologists say things like “the bacteria takes in information from its environment, processes that information, and makes decisions which approximately maximize fitness within its ancestral environment.” We have strong outside-view reasons to expect that the information processing in question probably approximates Bayesian reasoning (for some model of the environment), and the decision-making process approximately maximizes some expected utility function (which itself approximates fitness within the ancestral environment).

So presumably, given a complete specification of the bacteria’s physics, we ought to be able to back out its embedded world-model and utility function. How exactly do we do that, mathematically? What equations do we even need to solve?

As a computational biology professor I used to work with said, “Isn’t that, like, the entire problem of biology?”


Economists say things like “financial market prices provide the best publicly-available estimates for the probabilities of future events.” Prediction markets are an easy case, but let’s go beyond that: we have massive amounts of price data and transaction data from a wide range of financial markets—futures, stocks, options, bonds, forex… We also have some background general economic data, e.g. Fed open-market operations and IOER rate, tax code, regulatory code, and the like. How can we back out the markets’ implicit model of the economy as a whole? What equations do we need to solve to figure out, not just what markets expect, but markets’ implicit beliefs about how the world works?

Then the other half: aside from what markets expect, what do markets want? Can we map out the (approximate, local) utility functions of the component market participants, given only market data?


Imagine we have a complete model of the human connectome. We’ve mapped every connection in one human brain, we know the dynamics of every cell type. We can simulate it all accurately enough to predict experimental outcomes.

Psychologists (among others) expect that human brains approximate Bayesian reasoning and utility maximization, at least within some bounds. Given a complete model of the brain, presumably we could back out the human’s beliefs, their ontology, and what they want. How do we do that? What equations would we need to solve?


Pull up the specifications for a trained generative adversarial network (GAN). We have all the parameters, we know all the governing equations of the network.

We expect the network to approximate Bayesian reasoning (for some model). Indeed, GAN training is specifically set up to mimic the environment of decision-theoretic agents. If anything is going to precisely approximate mathematical ideal agency, this is it. So, given the specification, how can we back out the network’s implied probabilistic model? How can we decode its internal ontology—and under what conditions do we expect it to develop nontrivial ontological structure at all?