Paul W comments on From SLT to AIT: NN generalisation out-of-distribution

Paul W 20 Sep 2025 7:43 UTC
3 points
0
Hi! Interesting post!
Just to make sure I got it right: in Claim 1, if you substract $\sum_n H(P_{\mu}(\cdot | x_n))$ on both sides of inequality (1.1), you get that the sum of the KL-divergences $\sum_n KL(P_{\mu}(\cdot | x_n), P_{M_1}(\cdot | x_n, D_{<n})$ is smaller than the constant $C(\mu, M_1)$, right?
Then, dividing by N on both sides, you get that the average of said KL divergences goes to 0 as N goes to infinity, at least with rate 1/N, is that correct?
- Lucius Bushnaq 20 Sep 2025 8:46 UTC
  3 points
  1
  Parent
  Yes, subtracting $\sum_{n} H (P_{μ} (\cdot | x_{n}))$ from inequality (1.1) does yield $\sum_{n = 1}^{N} D_{KL} (P_{μ} (\cdot | x_{n}), P_{M_{1}} (\cdot ∣ x_{n}, D_{< n})) \leq C (μ, M_{1})$ . So, since the total KL divergence summed over the first $N$ data points is bounded by the same constant for any $N$ , and KL-divergences are never negative, $D_{KL} (P_{μ} (\cdot | x_{n}), P_{M_{1}} (\cdot ∣ x_{n}, D_{< n}))$ must go to zero for large $n$ fast enough for the sum to not diverge to infinity, which implies it has to go to zero faster than 1/n.
  Though note that in real life, where $N$ is finite, $D_{KL} (P_{μ} (\cdot | x_{n}), P_{M_{1}} (\cdot ∣ x_{n}, D_{< n}))$ can still go to zero very unevenly; it doesn’t have to be monotonic.
  
  For example, you might have $D_{KL} (P_{μ} (\cdot | x_{n}), P_{M_{1}} (\cdot ∣ x_{n}, D_{< n})) = 0$ from $n = 10^{3}$ to $n = 10^{6}$ , then suddenly see a small upward spike at $n = 10^{6} + 1$ . A way this might happen is if the first $10^{6}$ data points the inductor receives come from one data distribution, and the subsequent data points are drawn from a very different distribution. If there is a program $μ^{'}$ that is shorter than $μ$ (so $C (μ^{'}, M_{1}) < C (μ, M_{1})$ ) and that can predict the data labels for the first distribution but not the second distribution, whereas $μ$ can predict both distributions, the inductor would favour $μ^{'}$ over $μ$ and assign it higher probability until it starts seeing data from the second distribution. It might make up to $C (μ^{'}, M_{1})$ bits of prediction error early on before its posterior becomes largely dominated by predictions that match $μ^{'}$ at $n = 10^{3}$ . After that, the KL-divergence would go to zero for a while because everything is getting predicted accurately. Then, at $n = 10^{6} + 1$ , when we switch to the second data distribution, the KL-divergence would go up again for while, until the inductor has added another $\leq C (μ, M_{1}) - C (μ^{'}, M_{1})$ bits of prediction error to the total KL-divergence. From then on the inductor would make predictions that match $μ$ and so the KL-divergence would go back down to zero again and this time stay zero permanently.