Joseph Van Name comments on Six (and a half) intuitions for KL divergence

Joseph Van Name 20 Jul 2023 10:26 UTC
6 points
1
The Radon-Nikodym theorem allows us to define the Kullback-Leibler divergence in a more general setting. Even though this setting is more abstract, in this abstract setting it is clear how the KL-divergence is a way to define relative entropy. The abstract setting limits our options to only the correct option.
Suppose that $f$ is the probability density function of a random variable supported on $[0, 1]$ . Then we define the continuous entropy of $f$ as $\int_{0}^{1} - f (x) \cdot log (f (x)) d x$ .
Suppose that we are given two probability measures $μ, ν$ on the Borel subsets of $[0, 1]$ and we know that $μ$ is the Lebesgue probability measure on $[0, 1]$ . Then is there any way to abstractly define the continuous entropy of $ν$ that only refers to the abstract measures $μ, ν$ without referring to $f$ or by recovering $f$ from the abstract properties of $μ, ν$ ? Yes there is. The Radon-Nikodym theorem allows us to recover $f$ from the measure theoretic properties of $μ, ν$ and it therefore allows us to recover the continuous entropy.
Suppose that $(X, M)$ is a $σ$ -algebra. Let $μ, ν$ be two measures on $X$ . We say that $μ$ is absolutely continuous with respect to $ν$ if $ν (A) = 0 \Rightarrow μ (A) = 0$ , and we write $μ ≪ ν$ in this case. Recall that a measure $μ$ is $σ$ -finite if there are $A_{n} \in M$ for each natural number $n$ such that $μ (A_{n}) < \infty$ for each $n$ and $X = ⋃_{n = 0}^{\infty} μ (A_{n})$ .
Theorem: (Radon-Nikodym theorem) Suppose that $μ ≪ ν$ and $μ, ν$ are both $σ$ -finite measures. Then there exists a positive measurable function $f : X \to [0, \infty)$ where $μ (A) = \int_{A} f (x) d ν (x)$ for each $A \in M$ .
The function $f$ is called the Radon-Nikodym derivative of $μ$ with respect to $ν$ and is denoted by $f = \frac{d μ}{d ν}$ . The Radon-Nikodym derivative is unique up to a set of measure zero.
We therefore define the Kullback-Kiebler divergence as $D_{K L} (μ | | ν) = \int_{X} \frac{d μ}{d ν} \cdot log (\frac{d μ}{d ν}) d ν = \int_{X} log (\frac{d μ}{d ν}) d μ$ which coincides with the continuous entropy without the $- 1$ factor.
Now, the assumption that $ν$ is isomorphic to the Lebesgue probability measure on $[0, 1]$ is a surprisingly general assumption that works for many probability measures $ν$ .
A Boolean algebra $B$ is said to be $σ$ -complete if whenever $a_{n} \in B$ for each natural number $n$ , the set ${a_{n} : n \in N}$ has a least upper bound which we denote by $⋁_{n = 0}^{\infty} a_{n}$ .
A measure algebra is a pair $(B, μ)$ where $B$ is a $σ$ -complete Boolean algebra and $μ : B \to [0, \infty)$ is a function with $μ (x) = 0$ iff $x = 0$ and $μ (⋁_{n = 0}^{\infty} a_{n}) = \sum_{n = 0}^{\infty} μ (a_{n})$ whenever $a_{m} \land a_{n} = 0$ for $m \neq n$ . If $(B, μ)$ is a measure algebra, then define a metric $d$ (and hence a topology and uniformity and whatnot) on $B$ by setting $d (x, y) = μ (x \oplus y)$ . Recall that a topological space is separable if it has a countable dense subset.
For example, if $M$ is the collection of Lebesgue measurable subsets of $[0, 1]$ , and $I$ is the collection of measure zero subsets of $[0, 1]$ , then the quotient Boolean algebra $M / I$ is complete and if $μ$ is the Lebesgue measure on $M / I$ , then $(M / I, μ)$ is a separable atomless probability measure algebra, and this measure algebra is unique.
Theorem: (Caratheodory) Every separable atomless probability measure algebra is isomorphic.
We have an even stronger uniqueness theorem for this measure algebra.
Theorem: Every Borel probability measure on a complete separable metric space without any atoms is isomorphic to the Lebesgue measure on the interval $[0, 1]$ .
Suppose now that $μ, ν$ are Borel probability measures on a complete separable metric space that possibly have atoms. Then there is some measure $η$ where $ν \times η$ is isomorphic to the Lebesgue probability measure. In this case, $\frac{d μ}{d ν} (x) = \frac{d (μ \times η)}{d (ν \times η)} (x, y)$ for almost all $x, y$ , so
$D_{K L} (μ \times η | | ν \times η) = \int log (\frac{d (μ \times η)}{d (ν \times η)} (x, y)) d (μ \times η) (x, y) = \int log (\frac{d μ}{d η} (x, y)) d (μ \times η) (x, y) = \int log (\frac{d μ}{d η} (x)) d μ (x)$
$= D_{K L} (μ | | ν)$ . In this case, we can pretend that $ν \times η$ is the Lebesgue measure on $[0, 1]$ and that $D_{K L} (μ \times η | | ν \times η)$ just measures the continuous entropy of $μ \times η$ with respect to the Lebesgue probabilty measure on $[0, 1]$ .
The KL-divergence also generalizes greatly. Given a function $h : [0, \infty) \to [0, \infty)$ ( $h$ is usually convex or concave), one can define another measure of similarity between probability measures such as $D_{h} (μ | | ν) = \int h (\frac{d μ}{d ν}) d ν$ known as the $h$ -divergence. One can also define a measure on $[0, \infty)$ from the Radon-Nikodym derivative by $(\frac{d μ}{d ν})_{*} ν$ where if $g : X \to Y$ and $ν$ is a measure on $X$ , then $g_{*} (ν)$ is the measure on $Y$ defined by $g_{*} (ν) (R) = ν (g^{- 1} [R])$ . The random variable $(\frac{d μ}{d ν})_{*} ν$ gives information about the similarity between the probability measures $μ, ν$ from which one can always recover the $h$ -divergence. In this case, $E [h (\frac{d μ}{d ν})_{*} ν)] = \int h (x) d (\frac{d μ}{d ν})_{*} ν (x) = \int h (\frac{d μ}{d ν} (x)) d ν (x) = D_{h} (μ | | ν)$ .