The Radon-Nikodym theorem allows us to define the Kullback-Leibler divergence in a more general setting. Even though this setting is more abstract, in this abstract setting it is clear how the KL-divergence is a way to define relative entropy. The abstract setting limits our options to only the correct option.
Suppose that f is the probability density function of a random variable supported on [0,1]. Then we define the continuous entropy of f as ∫10−f(x)⋅log(f(x))dx.
Suppose that we are given two probability measures μ,ν on the Borel subsets of [0,1] and we know that μ is the Lebesgue probability measure on [0,1]. Then is there any way to abstractly define the continuous entropy of ν that only refers to the abstract measures μ,ν without referring to f or by recovering f from the abstract properties of μ,ν? Yes there is. The Radon-Nikodym theorem allows us to recover f from the measure theoretic properties of μ,ν and it therefore allows us to recover the continuous entropy.
Suppose that (X,M) is a σ-algebra. Let μ,ν be two measures on X. We say that μ is absolutely continuous with respect to ν if ν(A)=0⇒μ(A)=0, and we write μ≪ν in this case. Recall that a measure μ is σ-finite if there are An∈M for each natural number n such that μ(An)<∞ for each n and X=⋃∞n=0μ(An).
Theorem: (Radon-Nikodym theorem) Suppose that μ≪ν and μ,ν are both σ-finite measures. Then there exists a positive measurable function f:X→[0,∞) where μ(A)=∫Af(x)dν(x) for each A∈M.
The function f is called the Radon-Nikodym derivative of μ with respect to ν and is denoted by f=dμdν. The Radon-Nikodym derivative is unique up to a set of measure zero.
We therefore define the Kullback-Kiebler divergence as DKL(μ||ν)=∫Xdμdν⋅log(dμdν)dν=∫Xlog(dμdν)dμ which coincides with the continuous entropy without the −1 factor.
Now, the assumption that ν is isomorphic to the Lebesgue probability measure on [0,1] is a surprisingly general assumption that works for many probability measures ν.
A Boolean algebra B is said to be σ-complete if whenever an∈B for each natural number n, the set{an:n∈N} has a least upper bound which we denote by ⋁∞n=0an.
A measure algebra is a pair (B,μ) where B is a σ-complete Boolean algebra and μ:B→[0,∞) is a function with μ(x)=0 iff x=0 and μ(⋁∞n=0an)=∑∞n=0μ(an) whenever am∧an=0 for m≠n. If (B,μ) is a measure algebra, then define a metric d (and hence a topology and uniformity and whatnot) on B by setting d(x,y)=μ(x⊕y). Recall that a topological space is separable if it has a countable dense subset.
For example, if M is the collection of Lebesgue measurable subsets of [0,1], and I is the collection of measure zero subsets of [0,1], then the quotient Boolean algebra M/I is complete and if μ is the Lebesgue measure on M/I, then (M/I,μ) is a separable atomless probability measure algebra, and this measure algebra is unique.
Theorem: (Caratheodory) Every separable atomless probability measure algebra is isomorphic.
We have an even stronger uniqueness theorem for this measure algebra.
Theorem: Every Borel probability measure on a complete separable metric space without any atoms is isomorphic to the Lebesgue measure on the interval [0,1].
Suppose now that μ,ν are Borel probability measures on a complete separable metric space that possibly have atoms. Then there is some measure η where ν×η is isomorphic to the Lebesgue probability measure. In this case, dμdν(x)=d(μ×η)d(ν×η)(x,y) for almost all x,y, so
=DKL(μ||ν). In this case, we can pretend that ν×η is the Lebesgue measure on [0,1] and that DKL(μ×η||ν×η) just measures the continuous entropy of μ×η with respect to the Lebesgue probabilty measure on [0,1].
The KL-divergence also generalizes greatly. Given a function h:[0,∞)→[0,∞) (h is usually convex or concave), one can define another measure of similarity between probability measures such as Dh(μ||ν)=∫h(dμdν)dν known as the h-divergence. One can also define a measure on [0,∞) from the Radon-Nikodym derivative by (dμdν)∗ν where if g:X→Y and ν is a measure on X, then g∗(ν) is the measure on Y defined by g∗(ν)(R)=ν(g−1[R]). The random variable (dμdν)∗ν gives information about the similarity between the probability measures μ,ν from which one can always recover the h-divergence. In this case,E[h(dμdν)∗ν)]=∫h(x)d(dμdν)∗ν(x)=∫h(dμdν(x))dν(x)=Dh(μ||ν).
The Radon-Nikodym theorem allows us to define the Kullback-Leibler divergence in a more general setting. Even though this setting is more abstract, in this abstract setting it is clear how the KL-divergence is a way to define relative entropy. The abstract setting limits our options to only the correct option.
Suppose that f is the probability density function of a random variable supported on [0,1]. Then we define the continuous entropy of f as ∫10−f(x)⋅log(f(x))dx.
Suppose that we are given two probability measures μ,ν on the Borel subsets of [0,1] and we know that μ is the Lebesgue probability measure on [0,1]. Then is there any way to abstractly define the continuous entropy of ν that only refers to the abstract measures μ,ν without referring to f or by recovering f from the abstract properties of μ,ν? Yes there is. The Radon-Nikodym theorem allows us to recover f from the measure theoretic properties of μ,ν and it therefore allows us to recover the continuous entropy.
Suppose that (X,M) is a σ-algebra. Let μ,ν be two measures on X. We say that μ is absolutely continuous with respect to ν if ν(A)=0⇒μ(A)=0, and we write μ≪ν in this case. Recall that a measure μ is σ-finite if there are An∈M for each natural number n such that μ(An)<∞ for each n and X=⋃∞n=0μ(An).
Theorem: (Radon-Nikodym theorem) Suppose that μ≪ν and μ,ν are both σ-finite measures. Then there exists a positive measurable function f:X→[0,∞) where μ(A)=∫Af(x)dν(x) for each A∈M.
The function f is called the Radon-Nikodym derivative of μ with respect to ν and is denoted by f=dμdν. The Radon-Nikodym derivative is unique up to a set of measure zero.
We therefore define the Kullback-Kiebler divergence as DKL(μ||ν)=∫Xdμdν⋅log(dμdν)dν=∫Xlog(dμdν)dμ which coincides with the continuous entropy without the −1 factor.
Now, the assumption that ν is isomorphic to the Lebesgue probability measure on [0,1] is a surprisingly general assumption that works for many probability measures ν.
A Boolean algebra B is said to be σ-complete if whenever an∈B for each natural number n, the set{an:n∈N} has a least upper bound which we denote by ⋁∞n=0an.
A measure algebra is a pair (B,μ) where B is a σ-complete Boolean algebra and μ:B→[0,∞) is a function with μ(x)=0 iff x=0 and μ(⋁∞n=0an)=∑∞n=0μ(an) whenever am∧an=0 for m≠n. If (B,μ) is a measure algebra, then define a metric d (and hence a topology and uniformity and whatnot) on B by setting d(x,y)=μ(x⊕y). Recall that a topological space is separable if it has a countable dense subset.
For example, if M is the collection of Lebesgue measurable subsets of [0,1], and I is the collection of measure zero subsets of [0,1], then the quotient Boolean algebra M/I is complete and if μ is the Lebesgue measure on M/I, then (M/I,μ) is a separable atomless probability measure algebra, and this measure algebra is unique.
Theorem: (Caratheodory) Every separable atomless probability measure algebra is isomorphic.
We have an even stronger uniqueness theorem for this measure algebra.
Theorem: Every Borel probability measure on a complete separable metric space without any atoms is isomorphic to the Lebesgue measure on the interval [0,1].
Suppose now that μ,ν are Borel probability measures on a complete separable metric space that possibly have atoms. Then there is some measure η where ν×η is isomorphic to the Lebesgue probability measure. In this case, dμdν(x)=d(μ×η)d(ν×η)(x,y) for almost all x,y, so
DKL(μ×η||ν×η)=∫log(d(μ×η)d(ν×η)(x,y))d(μ×η)(x,y)=∫log(dμdη(x,y))d(μ×η)(x,y)=∫log(dμdη(x))dμ(x)
=DKL(μ||ν). In this case, we can pretend that ν×η is the Lebesgue measure on [0,1] and that DKL(μ×η||ν×η) just measures the continuous entropy of μ×η with respect to the Lebesgue probabilty measure on [0,1].
The KL-divergence also generalizes greatly. Given a function h:[0,∞)→[0,∞) (h is usually convex or concave), one can define another measure of similarity between probability measures such as Dh(μ||ν)=∫h(dμdν)dν known as the h-divergence. One can also define a measure on [0,∞) from the Radon-Nikodym derivative by (dμdν)∗ν where if g:X→Y and ν is a measure on X, then g∗(ν) is the measure on Y defined by g∗(ν)(R)=ν(g−1[R]). The random variable (dμdν)∗ν gives information about the similarity between the probability measures μ,ν from which one can always recover the h-divergence. In this case,E[h(dμdν)∗ν)]=∫h(x)d(dμdν)∗ν(x)=∫h(dμdν(x))dν(x)=Dh(μ||ν).