Ulisse Mini comments on Ulisse Mini’s Shortform

Ulisse Mini 28 Jul 2022 19:30 UTC
1 point
0
KL-divergence and map territory distinction

Crosspost from my blog

The cross-entropy is defined as the expected surprise when drawing from $p (x)$ , which we’re modeling as $q (x)$ . Our map is $q (x)$ while $p (x)$ is the territory.

$H (p, q) = \sum x p (x) log \frac{1}{q (x)}$

Now it should be intuitively clear that $H (p, q) \geq H (p, p)$ because an imperfect model $q (x)$ will (on average) surprise us more than the perfect model $p (x)$ .

To measure unnecessary surprise from approximating $p (x)$ by $q (x)$ we define

$D_{K L} (p ∥ q) = H (p, q) - H (p, p)$

This is KL-divergence! The average additional surprise from our map approximating the territory.

Now it’s time for an exercise, in the following figure $q^{*} (x)$ is the Gaussian that minimizes $D_{K L} (p ∥ q)$ or $D_{K L} (q ∥ p)$ , can you tell which is which?
Left is minimizing $D_{K L} (p ∥ q)$ while the right is minimizing $D_{K L} (q ∥ p)$ .

Reason as follows:
- If $p$ is the territory then the left $q^{*}$ is a better map (of $p$ ) than the right $q^{*}$ .
- If $p$ is the map, then the territory $q^{*}$ on the right leads to us being less surprised than the territory on the left, because on the on left $p$ will be very surprised at data in the middle, despite it being likely according to the territory $q^{*}$ .
On the left we fit the map to the territory, on the right we fit the territory to the map.