jsteinhardt comments on Unknown unknowns

jsteinhardt 6 Aug 2011 9:53 UTC
2 points
I don’t think entropy quite works that way. For notational convenience, let Q(p) denote the entropy of p. Then just because Q(p) > Q(q), does not mean that q is strictly more informative than p. In other words, it is not the case that there is some total ordering on distributions, such that for any p,q with Q(p) > Q(q), I can get from p to q with Q(p)-Q(q) bits of information. The closest statement you can make would be in terms of KL divergence, but it is important to note that both KL(p||q) and KL(q||p) are positive, so KL is providing a distance, not an ordering.

Also note that entropy does not in fact decrease with more information. It decreases in expectation, and even then only relative to the subjective belief distribution. But this isn’t even a particularly special property. Jensen’s inequality together with conservation of expected evidence implies that, instead of Q(p) = E[-log(p(x))], we could have taken any concave function Q over the space of probability distributions, which would include functions of the form Q(p) = E[f(p(x))] as long as 2f’(z)+zf″(z) ⇐ 0 for all z.

[Proof of the statement about Jensen: Let p2 be the distribution we get from p after updating. Then E[f(p2) | p] ⇐ f(E[p2 | p]) = f(p), where ⇐ is Jensen applied to f and E[p2 | p] = p by conservation of expected evidence.]

EDIT: For the interested reader, this is also strongly related to Doob’s martingale convergence theorem, as your beliefs are a martingale and any concave function of them is a supermartingale.
What links here?
- jsteinhardt's comment on Criticisms of intelligence explosion by lukeprog (24 Nov 2011 0:48 UTC; 0 points)