Information theory and the symmetry of updating beliefs

Contents:

1. The beautiful symmetry of Bayesian updating
2. Odds and log odds: a short comparison
3. Further discussion of information

Rationality is all about handling this thing called “information”. Fortunately, we live in an era after the rigorous formulation of Information Theory by C.E. Shannon in 1948, a basic understanding of which can actually help you think about your beliefs, in a way similar but complementary to probability theory. Indeed, it has flourished as an area of research exactly because it helps people in many areas of science to describe the world. We should take advantage of this!

The information theory of events, which I’m about to explain, is about as difficult as high school probability. It is certainly easier than the information theory of multiple random variables (which right now is explained on Wikipedia), even though the equations look very similar. If you already know it, this can be a linkable source of explanations to save you writing time :)

So! To get started, what better way to motivate information theory than to answer a question about Bayesianism?

The beautiful symmetry of Bayesian updating

The factor by which observing A increases the probability of B is the same as the factor by which observing B increases the probability of A. This factor is P(A and B)/​(P(A)·P(B)), which I’ll denote by pev(A,B) for reasons to come. It can vary from 0 to +infinity, and allows us to write Bayes’ Theorem succinctly in both directions:

P(A|B)=P(A)·pev(A,B), and P(B|A)=P(B)·pev(A,B)

What does this symmetry mean, and how should it affect the way we think?

A great way to think of pev(A,B) is as a multiplicative measure of mutual evidence, which I’ll call mutual probabilistic evidence to be specific. If pev=1 if they’re independent, if pev>1 they make each other more likely, and if pev<1 if they make each other less likely.

But two ways to think are better than one, so I will offer a second explanation, in terms of information, which I often find quite helpful in analyzing my own beliefs:

Probabilistic evidence is related to “mutual information”

Lets examine a simple example, and work our way up to illustrating what I mean:

Say I flip four fair coins with faces “0” and “1“ to generate a 4-bit binary string, X. If I tell you that “X=1???”, meaning that the first coin reads “1”, this reduces the number of possibilities by ½. We’d like to say here that you’ve gained “1 bit of information”. Suppose instead that I say “X begins with 01 or 10″. This has quantitatively the same effect, in that it reduces the number of possibilities by ½, so it should also be called “1 bit of information”. You might call the first statement an “explicit bit” in that it explicitly specifies a 1 or 0 in the sequence, but this is merely a qualitative distinction. For once, we’re interested in quantity, not quality.

Now, let A be the event “X=111?” (the event that the first three bits come up “1″, and the last bit can be anything), which has probability P(A)=2-3. If A is true but you don’t know it, you need to observe exactly 3 independent bits (e.g. the first 3 coins) to confirm it. Intuitively, this is how uncertain A is, because it tells us how far away we are from confirming A. On the other hand, if I tell you A is true, you now only need 1 more independent bit to specify X=111?, so we can say A has “provided 3 bits” of “information”. Intuitively, this is how informative A is. These vague ideas nudge us toward the following definition:

The information value of an event

We denote and define the information value of an event A (aka “surprisal” or “self-information”, but not in this post) by the formula

inf(A) := log½(P(A)) = -log2(P(A))

which in our example is -log2(2-3)= 3 bits, just as we’d like. As was suggested, this quantity has two different intuitive meanings, which by the miracle of logic correspond to the same number inf(A), measured in bits:

1) Uncertainty: How many independent bits are required to confirm that A is true.

2) Informativity: How many independent bits are gained if we are told that A is true.

Caution: information value is not “data”, but rather it is a number that can tell you how uncertain or how informative the data is. Be on the lookout for when “information” means “data”, and when it means “information value.”

Mutual information = informational evidence

Next, let B be the event “X=??11”, so P(B)=2-2., and recall that A is the event “X=??11“. Both A and B tell us that the third position reads “1”, which is independent from the other explicit bits they specify. In this sense, there is 1 bit of “redundancy” in observing both A and B. Notice that A provides 3 bits, B provides 2 bits, but “A and B” together specify that “X=1111” which is only 4 bits, and 3+2-4=1. Thus, we can calculate “redundancy” as

inf(A) + inf(B) - inf(A and B),

which is why this expression is called the mutual information of A and B. But wait… taking -log2 of probabilistic evidence pev(A,B)=P(A and B)/​(P(A)·P(B)) yields exactly the same expression! So I’ll also call it informational evidence, and write

iev(A,B) := -log2pev(A,B) = inf(A) + inf(B) - inf(A and B)

While we’re at it, lets just take -log2 of the rest of Bayes’ theorem and see what we get. We can define conditional information value by letting

inf(A|B) := -log2 P(A|B) = inf(A and B) - inf(B),

and now Bayes’ theorem attains the following form:

inf(A|B) = inf(A) - iev(A,B) ← information theoretic Bayes’ Theorem

In Bayesian updating, A hasn’t happened yet, so here let’s use our “uncertainty” interpretation of information value. As you can see from the equation, if iev(A,B) is positive, the uncertainty of A decreases upon observing B, meaning A becomes more likely. If it is negative, the uncertainty of A increases, so A becomes less likely. It ranges from -infinity to +infinity according as A and B completely contradict or completely confirm each other. In summary:

Bayesian updating = subtracting mutual evidence from uncertainty.

This is my other favorite way to think about updating. The fact that evidence can also be thought of as a kind of redundancy or mutual information gives a concrete interpretation for the symmetry of belief updating. As well, since “N bits of information” is so easy to conceptualize as a precise quantity, it gives a quantitative intutive meaning to “how much A and B support each other”. In fact, noticing this is what got me interested in information theory in the first place, and I hope it has piqued your interest, too!

What kept me interested is the simple fact that informational evidence behaves so nicely:

(Symmetry) iev(A,B) = iev(B,A)

More examples and discussion to boost your familiarity with information value is provided in Section 3, but for now, lets break for a comparison with two other methods to describe Bayesian updating.


Odds and log odds: a short comparison.

(I think these deserve a special mention, because they have already been discussed on LessWrong.com.)

Bayes’ theorem can also be expressed fairly neatly using odds with likelihood ratios, and log odds with log likelihood-ratios. One shortcoming with using odds when updating are that the likelihood-ratio K(B|A)=P(B|A)/​P(B|¬A), sometimes called the Bayes factor, is not symmetric, so it does not make the symmetry of updating obvious. Likewise, log likelihood-ratios are not symmetric either.

But odds and log odds have their advantages. For example, if B1 and B2 are conditionally independent given A and conditionally independent given ¬A, then K(B1 and B2, A) = K(B1|A)·K(B2|A), and similarly for any number of B’s. These conditions are met naturally when B1 and B2 are causal consequences of A which do not causally influence each other. By contrast, in causal systems, it is usually not the case that pev(A, B1 and B2) = pev(A,B1)·pev(A,B2). (Reading Pearl’s “Causality: Models, Reasoning, and Inference” clarified this for me once and for all, by making precise what a “causal system” is.)


Further discussion of information

In our excitement to get to an information theoretic Bayes’ theorem, we glossed over a lot of opportunities to stop and reflect, so lets do some more of that here.

Information vs “data” or “knowledge”

C.E. Shannon originally used the full phrase “information value”, but nowadays it is often shortened to “information”. As mentioned, information is not a synonym for “data” or “knowledge” when used in this way.

It may be help to analogize this with how “mass” is not “matter”. If I place 2 grams of matter on the left side of a balance scale, and 3 grams on the right, it will tip to the right, because 3g-2g=1g>0g. Where is this 1 gram of matter? Which “1 gram of matter” is the matter that tips the scales? The question is meaningless, because the 1g doesn’t refer to any matter in particular, just a difference in total amounts. But you can ask “how much mass does this matter have?”, and likewise “how much information does this data have?”.

Why “the redundant information” doesn’t make sense

When iev(A,B) is positive, we spoke of the mutual information of A and B as “redundancy”. But what is this redundant information? What does it say? Again, this is the “information value is data” fallacy making ill-posed questions. It’s somewhat like asking which gram of matter should be removed from the scales above in order to balance it. To illustrate more precisely, suppose again that A says “X=111?” and B says “X=??11″. If R is the event “X=??1?”, it is tempting to call R “the mutual information” of A and B. Indeed, if we first observe R, then A and B become independent, so there is no more redundancy. But this R is not unique. Any list of 8 outcomes that include the A outcomes and B outcomes would also work this way. For example, we could take R to say “X is one of 0011, 0111, 1011, 1110, 1111, 1000, 0100, 0001”.

To infinity and… well, just to infinity.

We saw that information value inf(A) ranges from 0 to +infinity, and can be interpreted either as informativity or uncertainty, depending on whether the event has happened or not. Let’s think a little about the extremes of this scale:

That 0 information value corresponds to a 100%-likely event means:

1) 0 informativity: you don’t gain any information from observing an event that you already knew was certain (ignoring 0%-likely discrepancies), and

2) 0 uncertainty: you don’t require any information to verify an event that is certain to occur , and

That +infinity information value corresponds to a 0%-likely event means:

1) infinite uncertainty: no finite amount of information can convince you of a 0%-likely event (though perhaps an infinite series of tests could bring you arbitrarily close), and

2) infinite informativity: if you observe a 0%-likely event, you might win a Nobel prize (it means someone messed up by having a prior belief of 0% somewhere when they shouldn’t have).

For the values in between, more likely = less uncertain = less informative, and less likely = more uncertain = more informative.

What other cool stuff can happen?

To get more comfortable with how information values work, let’s return to our random 4-bit string X, generated by flipping four coins:

Encoding. Let C be the event “X contains exactly one 1”, i.e. X=1000, 0100, 0010, or 0001. This happens with probability 4/​16=1/​4=2-2, so inf(C) = 2 bits. If C is true, it provides 2 bits of information about X, and using an additional 2 bits we could encode the position of the “1″ by writing “first”=00, “second”=01, “third”=10, and “fourth”=11. Thus we end up using 4 bits in total to specify or “encode” X, as we’d expect. In general, there are theorems characterizing information entirely in terms of encoding/​decoding, which is part of what makes it so useful in applications.

Negative evidence. Let D be the event “X starts with 1”, which one sees directly as specifying inf(D) = 1 bit of information. It is easy to see that P(D)=1/​2 and P(D|C)=1/​4, so we know C makes D less likely (and vice versa, by update symmetry!), but lets practice thinking in terms of information. Together, “C and D” just means X=1000, so inf(C and D) = 4 bits: it completely determines X. On the other hand, we saw that inf(C)=2 bits, and inf(D)=1 bit, so iev(C,D) = 2+1-4 = −1, confirming that either of them would present negative evidence for the other.

Non-integer information values. Being defined as a logarithm, in real life information values are usually not integers, just like probabilities are not usually simple fractions. This is not actually a problem, but reflects a flexibility of the definition. For example, consider the event ¬B: “X does not start with 11”, which has probability 34, hence inf(¬B)=-log2(3/​4) = 0.415. If we also knew “X does not end with 11″, that would give us another 0.415 bits of information (since it’s independent!). All of our formulae work just fine with non-integer information values, so we can add these to conclude we have 0.830 bits. This being less than 1 means we still haven’t constrained the number possibilities as much as knowing a single bit for certain (i.e. 50%). Indeed, 9 out of the 16 possibilities neither start nor end with 11.

Okay, but is this anything more than a cool trick with logarithms?

Yes! This definition of information has loads of real-world applications that legitimize it as a scientific quantity of interest:

*Communication (bandwidth = information per second),

*Data compression (information = how ‘incompressible’ the output of a data source is),

*Statistical mechanics and physics (entropy = average uncertainty = expected informativity of observing a system),

and of course, artificial intelligence.

Where can I read more?

Eliezer has written a number of posts which involve the information theory of handling multiple random variables at once. So, if you want to learn more about it, Wikipedia is currently a decent source. The general philosophy is to take expected values of the quantities defined here to obtain analogues for random variables, so you’re already half-way there.

For something more coherent and in-depth, A Mathematical Theory of Communication, Shannon (1948), which is credited with pioneering modern information theory, impressively remains a fantastic introduction to the subject. Way to go, Shannon!

There’s lots more good stuff to learn in that paper alone, but I’ll end this post here. What I think is most relevant to LessWrong readers is the awareness of a precise definition of information, and that it can help you think about beliefs and Bayesianism.