In this article, John reproves, without mentioning it, an information-theoretic version of the second law of thermodynamics, and then tries to apply it to talk about the loss of bits of optimization over a distance. I will first explain the classical result and then explain his attempted generalization. I say “attempted” since (a) I think he misses at least one assumption and (b) it is a bit unclear how to make sense of the result in the physical world.
In the classical setting of the result, one has a Markov chain M0 → M1 → … → Mn of random variables with fixed conditional distributions P(Mi | Mi-1) but unspecified initial distribution P(M0). Let’s call the two initial distributions P(M0 | opt) and P(M0 | ref) like John does in the article. One can then look at the KL divergence of the two distributions at each stage and prove that they get smaller and smaller. Why is this a second law of thermodynamics? The reason is that you can assume the initial reference distribution to be uniform, put more assumptions on the situation to show that all intermediate reference distributions will be uniform as well, and use the KL-inequality to show that the entropy of the optimized distribution will get larger and larger along the Markov chain. All of that is explained in chapter 4.4 of the classic information theory book by Thomas and Cover, and from that point of view, the article isn’t new. One nice interpretation in John’s context is that if the goal of the optimizer is just to produce order (= low entropy), then one should expect this effect to disappear over time: rooms don’t remain tidy.
Now, John would like to apply this to the case of Bayesian networks as generalizations of Markov chains, but isn’t very careful in stating the right assumptions. I don’t know what the right assumptions are either, but here is a caveat: in the picture drawn in the article, some of the later intermediate layers causally influence earlier intermediate layers by arrows going to the left. This means that the ordering of the layers is not strictly reflected in the partial ordering in the Bayes net. As an extreme example of where this goes wrong, consider the setting of a Markov chain M0 → M2 → M1 in which M2 has an arrow “back” to M1. John’s formulas state that P(M2 | M1, ref) = P(M2 | M1, opt), but is this true? Let’s use Bayes rule to figure it out:
One of the steps is not an equality since P(M2 | ref) and P(M2 | opt) are not the same. (Admittedly, I didn’t write down a full example of the non-equality being true, but playing around with numbers, you should be able to find one)
So, the visualization in the article seems to be wrong, insofar as it is supposed to apply to the result. I think the following assumption is likely sufficient, and probably one could relax this further:
all variables in Mi+1 come after all variables in Mi in the partial ordering implied by the graph.
So much about the assumptions. Let’s assume we have settled these and the theorem is true. Then the article makes the following interpretation:
M0 are the actions of an agent.
KL divergence equals bits of optimization
Then the result states that you can’t cause more bits of optimization at a distance than at your vicinity.
My Opinion
There are some things I’m confused about here:
Why measure bits of optimization as a KL divergence between distributions? What is the reference distribution here? Arguably, if one just “shifts a normal distribution far away”, then this doesn’t seem like optimization to me since the number of configurations doesn’t really get smaller in any way; the KL divergence might still be large.
Maybe John just wants to be general in the same way in which the second law of thermodynamics can initially be formulated in terms of KL divergence. Under that interpretation, we might always want to assume the reference distribution to be uniform. The bits of optimization would then be the difference between the entropies of the uniform and the optimized distribution, which is larger for narrower optimized distributions. This seems closer to what I consider to be “optimization”.
Is it valid to think of the universe as consisting of just one “large causal graph”?
At the lowest level, this doesn’t seem correct since there doesn’t seem to be causality in fundamental physics (at least, physicists often say this… I’m not one of them)
If we “bundle together” variables, then the assumption of a causal graph seems more true, but then it’s also less clear to me whether this causal graph can be modeled to be independent of changes in an optimizer’s actions. What if, for example, we assign a variable to a bacterium, but then it splits into two or not, depending on our actions? Or what if I perform actions that change the inner workings of another agent, resulting in changes in the conditional distributions of arrows going into the agent compared to the counterfactual world? These seem fundamentally very difficult questions that give me the impression that this article only scratches the surface.
A final confusion is that KL divergence is not symmetric, meaning that if you switch between optimized and reference distributions back and forth, the bits of optimization will differ. This problem disappears if we just agree to always use the uniform distribution as the reference, but insofar as John also wants to apply it in the general case, I’m unsure how to interpret this asymmetry.
Y’know, I figured someone must have already shown this and used it before, but I wasn’t sure where. Didn’t expect it’d already be on my bookshelf. Thanks for making that connection!
This means that the ordering of the layers is not strictly reflected in the partial ordering in the Bayes net. As an extreme example of where this goes wrong, consider the setting of a Markov chain M0 → M2 → M1 in which M2 has an arrow “back” to M1.
I think this part should not matter, so long as we’re not using a do() operator anywhere? For chains specifically, we can reverse all the arrows and still get an equivalent factorization. Not sure off the top of my head how that plays with mixed-direction arrows, but at the very least we should be able to factor the sequence of nested markov blankets as an undirected chain (since each blanket mediates interactions between “earlier” and “later” blankets), and from there show that it factors as a directed chain.
Not sure off the top of my head how that plays with mixed-direction arrows,
If you have arrows M0 ← M1 → M2, then it is also equivalent to an ordinary chain.
If you have arrows M0 → M1 ← M2, then it becomes inequivalent, due to collider bias. Basically, if you condition on M1, then you introduce dependencies between M0 and M2 (and also everything upstream of M0 and M2).
(I actually suspect collider bias ends up mattering for certain types of abstractions, and I don’t think it has been investigated how in detail.)
Summary:
In this article, John reproves, without mentioning it, an information-theoretic version of the second law of thermodynamics, and then tries to apply it to talk about the loss of bits of optimization over a distance. I will first explain the classical result and then explain his attempted generalization. I say “attempted” since (a) I think he misses at least one assumption and (b) it is a bit unclear how to make sense of the result in the physical world.
In the classical setting of the result, one has a Markov chain M0 → M1 → … → Mn of random variables with fixed conditional distributions P(Mi | Mi-1) but unspecified initial distribution P(M0). Let’s call the two initial distributions P(M0 | opt) and P(M0 | ref) like John does in the article. One can then look at the KL divergence of the two distributions at each stage and prove that they get smaller and smaller. Why is this a second law of thermodynamics? The reason is that you can assume the initial reference distribution to be uniform, put more assumptions on the situation to show that all intermediate reference distributions will be uniform as well, and use the KL-inequality to show that the entropy of the optimized distribution will get larger and larger along the Markov chain. All of that is explained in chapter 4.4 of the classic information theory book by Thomas and Cover, and from that point of view, the article isn’t new. One nice interpretation in John’s context is that if the goal of the optimizer is just to produce order (= low entropy), then one should expect this effect to disappear over time: rooms don’t remain tidy.
Now, John would like to apply this to the case of Bayesian networks as generalizations of Markov chains, but isn’t very careful in stating the right assumptions. I don’t know what the right assumptions are either, but here is a caveat: in the picture drawn in the article, some of the later intermediate layers causally influence earlier intermediate layers by arrows going to the left. This means that the ordering of the layers is not strictly reflected in the partial ordering in the Bayes net. As an extreme example of where this goes wrong, consider the setting of a Markov chain M0 → M2 → M1 in which M2 has an arrow “back” to M1. John’s formulas state that P(M2 | M1, ref) = P(M2 | M1, opt), but is this true? Let’s use Bayes rule to figure it out:
P(M2 | M1, ref) = P(M1 | M2, ref) * P(M2 | ref) / (sum_M2’ P(M1 | M2’, ref) * P(M2’ | ref))
= P(M1 | M2, opt) * P(M2 | ref) / (sum_M2’ P(M1 | M2’, opt) * P(M2’ | ref))
!= P(M1 | M2, opt) * P(M2 | opt) / (sum_M2’ P(M1 | M2’, opt) * P(M2’ | opt))
= P(M2 | M1, opt)
One of the steps is not an equality since P(M2 | ref) and P(M2 | opt) are not the same. (Admittedly, I didn’t write down a full example of the non-equality being true, but playing around with numbers, you should be able to find one)
So, the visualization in the article seems to be wrong, insofar as it is supposed to apply to the result. I think the following assumption is likely sufficient, and probably one could relax this further:
all variables in Mi+1 come after all variables in Mi in the partial ordering implied by the graph.
So much about the assumptions. Let’s assume we have settled these and the theorem is true. Then the article makes the following interpretation:
M0 are the actions of an agent.
KL divergence equals bits of optimization
Then the result states that you can’t cause more bits of optimization at a distance than at your vicinity.
My Opinion
There are some things I’m confused about here:
Why measure bits of optimization as a KL divergence between distributions? What is the reference distribution here? Arguably, if one just “shifts a normal distribution far away”, then this doesn’t seem like optimization to me since the number of configurations doesn’t really get smaller in any way; the KL divergence might still be large.
Maybe John just wants to be general in the same way in which the second law of thermodynamics can initially be formulated in terms of KL divergence. Under that interpretation, we might always want to assume the reference distribution to be uniform. The bits of optimization would then be the difference between the entropies of the uniform and the optimized distribution, which is larger for narrower optimized distributions. This seems closer to what I consider to be “optimization”.
Is it valid to think of the universe as consisting of just one “large causal graph”?
At the lowest level, this doesn’t seem correct since there doesn’t seem to be causality in fundamental physics (at least, physicists often say this… I’m not one of them)
If we “bundle together” variables, then the assumption of a causal graph seems more true, but then it’s also less clear to me whether this causal graph can be modeled to be independent of changes in an optimizer’s actions. What if, for example, we assign a variable to a bacterium, but then it splits into two or not, depending on our actions? Or what if I perform actions that change the inner workings of another agent, resulting in changes in the conditional distributions of arrows going into the agent compared to the counterfactual world? These seem fundamentally very difficult questions that give me the impression that this article only scratches the surface.
A final confusion is that KL divergence is not symmetric, meaning that if you switch between optimized and reference distributions back and forth, the bits of optimization will differ. This problem disappears if we just agree to always use the uniform distribution as the reference, but insofar as John also wants to apply it in the general case, I’m unsure how to interpret this asymmetry.
Y’know, I figured someone must have already shown this and used it before, but I wasn’t sure where. Didn’t expect it’d already be on my bookshelf. Thanks for making that connection!
I think this part should not matter, so long as we’re not using a do() operator anywhere? For chains specifically, we can reverse all the arrows and still get an equivalent factorization. Not sure off the top of my head how that plays with mixed-direction arrows, but at the very least we should be able to factor the sequence of nested markov blankets as an undirected chain (since each blanket mediates interactions between “earlier” and “later” blankets), and from there show that it factors as a directed chain.
If you have arrows M0 ← M1 → M2, then it is also equivalent to an ordinary chain.
If you have arrows M0 → M1 ← M2, then it becomes inequivalent, due to collider bias. Basically, if you condition on M1, then you introduce dependencies between M0 and M2 (and also everything upstream of M0 and M2).
(I actually suspect collider bias ends up mattering for certain types of abstractions, and I don’t think it has been investigated how in detail.)