Shannon’s information theory is indeed a great intellectual achievement, and is enormously useful in the fields of data compression and error correcting codes, but there is a tendency for people to try to apply it to beyond the areas where it is useful.
Some of these applications are ok but not essential. For instance, some people like to view maximum likelihood estimation of parameters as minimizing relative entropy. If that helps you visualize it, that’s fine. But it doesn’t really add anything to just directly visualizing maximization of the likelihood function. Mutual information can sometimes be a helpful thing to think about. But the deep theorems of Shannon on things like channel capacity don’t really play a role here.
Other applications seem to me to be self deception, in which the glamour of Shannon’s achievement conceals that there’s really no justification for some supposed application of it.
Some of Jaynes’ work is in this category. One example is his (early? he may have later abandoned it...) view that “ignorance” should be expressed by a probability distribution that maximizes entropy, subject to constraints on the observed expectations of certain functions. This is “not even wrong”. Jaynes viewed the probability distributions as being subjective (ie, possibly differing between people). But he viewed the observed expectations as being objective. This is incoherent. It’s also almost never relevant in practice.
The idea seems to have come about by thinking of statistical physics, in which although in theory measurements of quantities such as temperature are random, in practice, the number of molecules involved is so enormous that the temperature is in effect a well-defined number, representing a expectation with respect to the distribution of states of the system. It is assumed that this is somehow generalizable to thought experiments such as “suppose that you know that the expected value from rolling a loaded die is 3.27, what should you use as the distribution over possible dice rolls...”. But how could I possibly know that the expected value is 3.27 when I don’t know the distribution? And if I did (eg, I recorded results of many rolls, giving me a good idea of the distribution, but then lost all my records except the average), why would I use the maximum entropy distribution? There’s just no actual justification for this. The Bayesian procedure would be to define your prior distribution over distributions, then condition on the expected value being 3.27, and find the average distribution over the posterior distribution of distributions. There’s no reason to think the result of this is the maximum entropy distribution.
A caution here...
Shannon’s information theory is indeed a great intellectual achievement, and is enormously useful in the fields of data compression and error correcting codes, but there is a tendency for people to try to apply it to beyond the areas where it is useful.
Some of these applications are ok but not essential. For instance, some people like to view maximum likelihood estimation of parameters as minimizing relative entropy. If that helps you visualize it, that’s fine. But it doesn’t really add anything to just directly visualizing maximization of the likelihood function. Mutual information can sometimes be a helpful thing to think about. But the deep theorems of Shannon on things like channel capacity don’t really play a role here.
Other applications seem to me to be self deception, in which the glamour of Shannon’s achievement conceals that there’s really no justification for some supposed application of it.
Some of Jaynes’ work is in this category. One example is his (early? he may have later abandoned it...) view that “ignorance” should be expressed by a probability distribution that maximizes entropy, subject to constraints on the observed expectations of certain functions. This is “not even wrong”. Jaynes viewed the probability distributions as being subjective (ie, possibly differing between people). But he viewed the observed expectations as being objective. This is incoherent. It’s also almost never relevant in practice.
The idea seems to have come about by thinking of statistical physics, in which although in theory measurements of quantities such as temperature are random, in practice, the number of molecules involved is so enormous that the temperature is in effect a well-defined number, representing a expectation with respect to the distribution of states of the system. It is assumed that this is somehow generalizable to thought experiments such as “suppose that you know that the expected value from rolling a loaded die is 3.27, what should you use as the distribution over possible dice rolls...”. But how could I possibly know that the expected value is 3.27 when I don’t know the distribution? And if I did (eg, I recorded results of many rolls, giving me a good idea of the distribution, but then lost all my records except the average), why would I use the maximum entropy distribution? There’s just no actual justification for this. The Bayesian procedure would be to define your prior distribution over distributions, then condition on the expected value being 3.27, and find the average distribution over the posterior distribution of distributions. There’s no reason to think the result of this is the maximum entropy distribution.
Thank you for this context & perspective. I found this quite helpful.