Hi! Interesting post! Just to make sure I got it right: in Claim 1, if you substract $\sum_n H(P_{\mu}(\cdot | x_n))$ on both sides of inequality (1.1), you get that the sum of the KL-divergences $\sum_n KL(P_{\mu}(\cdot | x_n), P_{M_1}(\cdot | x_n, D_{<n})$ is smaller than the constant $C(\mu, M_1)$, right? Then, dividing by N on both sides, you get that the average of said KL divergences goes to 0 as N goes to infinity, at least with rate 1/N, is that correct?
Yes, subtracting ∑nH(Pμ(⋅|xn)) from inequality (1.1) does yield ∑Nn=1DKL(Pμ(⋅|xn),PM1(⋅∣xn,D<n))≤C(μ,M1). So, since the total KL divergence summed over the first N data points is bounded by the same constant for any N, and KL-divergences are never negative, DKL(Pμ(⋅|xn),PM1(⋅∣xn,D<n)) must go to zero for large n fast enough for the sum to not diverge to infinity, which implies it has to go to zero faster than 1/n.
Though note that in real life, where N is finite, DKL(Pμ(⋅|xn),PM1(⋅∣xn,D<n)) can still go to zero very unevenly; it doesn’t have to be monotonic.
For example, you might have DKL(Pμ(⋅|xn),PM1(⋅∣xn,D<n))=0 from n=103 to n=106, then suddenly see a small upward spike at n=106+1. A way this might happen is if the first 106 data points the inductor receives come from one data distribution, and the subsequent data points are drawn from a very different distribution. If there is a program μ′ that is shorter than μ (so C(μ′,M1)<C(μ,M1)) and that can predict the data labels for the first distribution but not the second distribution, whereas μ can predict both distributions, the inductor would favour μ′ over μ and assign it higher probability until it starts seeing data from the second distribution. It might make up to C(μ′,M1) bits of prediction error early on before its posterior becomes largely dominated by predictions that match μ′ at n=103. After that, the KL-divergence would go to zero for a while because everything is getting predicted accurately. Then, at n=106+1, when we switch to the second data distribution, the KL-divergence would go up again for while, until the inductor has added another ≤C(μ,M1)−C(μ′,M1) bits of prediction error to the total KL-divergence. From then on the inductor would make predictions that match μ and so the KL-divergence would go back down to zero again and this time stay zero permanently.
Hi! Interesting post!
Just to make sure I got it right: in Claim 1, if you substract $\sum_n H(P_{\mu}(\cdot | x_n))$ on both sides of inequality (1.1), you get that the sum of the KL-divergences $\sum_n KL(P_{\mu}(\cdot | x_n), P_{M_1}(\cdot | x_n, D_{<n})$ is smaller than the constant $C(\mu, M_1)$, right?
Then, dividing by N on both sides, you get that the average of said KL divergences goes to 0 as N goes to infinity, at least with rate 1/N, is that correct?
Yes, subtracting ∑nH(Pμ(⋅|xn)) from inequality (1.1) does yield ∑Nn=1DKL(Pμ(⋅|xn),PM1(⋅∣xn,D<n))≤C(μ,M1). So, since the total KL divergence summed over the first N data points is bounded by the same constant for any N, and KL-divergences are never negative, DKL(Pμ(⋅|xn),PM1(⋅∣xn,D<n)) must go to zero for large n fast enough for the sum to not diverge to infinity, which implies it has to go to zero faster than 1/n.
Though note that in real life, where N is finite, DKL(Pμ(⋅|xn),PM1(⋅∣xn,D<n)) can still go to zero very unevenly; it doesn’t have to be monotonic.
For example, you might have DKL(Pμ(⋅|xn),PM1(⋅∣xn,D<n))=0 from n=103 to n=106, then suddenly see a small upward spike at n=106+1. A way this might happen is if the first 106 data points the inductor receives come from one data distribution, and the subsequent data points are drawn from a very different distribution. If there is a program μ′ that is shorter than μ (so C(μ′,M1)<C(μ,M1)) and that can predict the data labels for the first distribution but not the second distribution, whereas μ can predict both distributions, the inductor would favour μ′ over μ and assign it higher probability until it starts seeing data from the second distribution. It might make up to C(μ′,M1) bits of prediction error early on before its posterior becomes largely dominated by predictions that match μ′ at n=103. After that, the KL-divergence would go to zero for a while because everything is getting predicted accurately. Then, at n=106+1, when we switch to the second data distribution, the KL-divergence would go up again for while, until the inductor has added another ≤C(μ,M1)−C(μ′,M1) bits of prediction error to the total KL-divergence. From then on the inductor would make predictions that match μ and so the KL-divergence would go back down to zero again and this time stay zero permanently.