The cross-entropy is defined as the expected surprise when drawing from p(x), which we’re modeling as q(x). Our map is q(x) while p(x) is the territory.
H(p,q)=∑xp(x)log1q(x)
Now it should be intuitively clear that H(p,q)≥H(p,p) because an imperfect model q(x) will (on average) surprise us more than the perfect model p(x).
To measure unnecessary surprise from approximating p(x) by q(x) we define
DKL(p∥q)=H(p,q)−H(p,p)
This is KL-divergence! The average additional surprise from our map approximating the territory.
Now it’s time for an exercise, in the following figure q∗(x) is the Gaussian that minimizes DKL(p∥q) or DKL(q∥p), can you tell which is which?
Left is minimizing DKL(p∥q) while the right is minimizing DKL(q∥p).
Reason as follows:
If p is the territory then the left q∗ is a better map (of p) than the right q∗.
If p is the map, then the territory q∗ on the right leads to us being less surprised than the territory on the left, because on the on left p will be very surprised at data in the middle, despite it being likely according to the territory q∗.
On the left we fit the map to the territory, on the right we fit the territory to the map.
KL-divergence and map territory distinction
Crosspost from my blog
The cross-entropy is defined as the expected surprise when drawing from p(x), which we’re modeling as q(x). Our map is q(x) while p(x) is the territory.
H(p,q)=∑xp(x)log1q(x)
Now it should be intuitively clear that H(p,q)≥H(p,p) because an imperfect model q(x) will (on average) surprise us more than the perfect model p(x).
To measure unnecessary surprise from approximating p(x) by q(x) we define
DKL(p∥q)=H(p,q)−H(p,p)
This is KL-divergence! The average additional surprise from our map approximating the territory.
Now it’s time for an exercise, in the following figure q∗(x) is the Gaussian that minimizes DKL(p∥q) or DKL(q∥p), can you tell which is which?
Left is minimizing DKL(p∥q) while the right is minimizing DKL(q∥p).
Reason as follows:
If p is the territory then the left q∗ is a better map (of p) than the right q∗.
If p is the map, then the territory q∗ on the right leads to us being less surprised than the territory on the left, because on the on left p will be very surprised at data in the middle, despite it being likely according to the territory q∗.
On the left we fit the map to the territory, on the right we fit the territory to the map.