Beliefs at different timescales

Why is a chess game the opposite of an ideal gas? On short timescales an ideal gas is described by elastic collisions. And a single move in chess can be modeled by a policy network.

The difference is in long timescales: If we simulated elastic collisions for a long time, we’d end up with a complicated distribution over the microstates of the gas. But we can’t run simulations for a long time, so we have to make do with the Boltzmann distribution, which is a lot less accurate.

Similarly, if we rolled out our policy network to get a distribution over chess game outcomes (win/loss/draw), we’d get the distribution of outcomes under self-play. But if we’re observing a game between two players who are better players than us, we have access to a more accurate model based on their Elo ratings.

Can we formalize this? Suppose we’re observing a chess game. Our beliefs about the next move are conditional probabilities of the form $P_{1} (x_{k + 1} | x_{0} \dots x_{k})$ , and our beliefs about the next $n$ moves are conditional probabilities of the form $P_{n} (x_{k + 1} \dots x_{k + n} | x_{0} \dots x_{k})$ . We can transform beliefs of one type into the other using the operators

$(Π_{n} P_{1}) (x_{k + 1} \dots x_{k + n} | x_{0} \dots x_{k}) := n - 1 \prod i = 0 P_{1} (x_{k + i + 1} | x_{0} \dots x_{k + i})$

$(Σ^{n} P_{n}) (x_{k + 1} | x_{0} \dots x_{k}) := \sum x_{k + 2} \dots \sum x_{k + n} P_{n} (x_{k + 1} \dots x_{k + n} | x_{0} \dots x_{k})$

If we’re logically omniscient, we’ll have $Π_{n} P_{1} = P_{n}$ and $Σ^{n} P_{n} = P_{1}$ . But in general we will not. A chess game is short enough that $Π_{n}$ is easy to compute, but $Σ^{n}$ is too hard because it has exponentially many terms. So we can have a long-term model $P_{n}$ that is more accurate than the rollout $Π_{n} P_{1}$ , and a short-term model $P_{1}$ that is less accurate than $Σ^{n} P_{n}$ . This is a sign that we’re dealing with an intelligence: We can predict outcomes better than actions.

If instead of a chess game we’re predicting an ideal gas, the relevant timescales are so long that we can’t compute $Π_{n}$ or $Σ^{n}$ . Our long-term thermodynamic model $P_{n}$ is less accurate than a simulation $Π_{n} P_{1}$ . This is often a feature of reductionism: Complicated things can be reduced to simple things that can be modeled more accurately, although more slowly.

In general, we can have several models at different timescales, and $Π$ and $Σ$ operators connecting all the levels. For example, we might have a short-term model describing the physics of fundmental particles; a medium-term model describing a person’s motor actions; and a long-term model describing what that person accomplishes over the course of a year. The medium-term model will be less accurate than a rollout of the short-term model, and the long-term model may be more accurate than a rollout of the medium-term model if the person is smarter than us.