My intuition is that, for the iid bits with uncomputable probability r, there will exist an optimal (up to a constant) solution of the form:
compute some real number ^r by running the shortest program computing this
generate bits iid from Ber(^r)
If this is the case, then the optimal algorithm will choose ^r to minimize K(^r)+nKL(Ber(r)||Ber(^r)) if n is the number of bits observed. As n goes to infinity, minimizing this objective causes ^r to converge to r.
Now I’m not sure how to prove that an optimal (up to a constant) hypothesis of the form above exists. Basically this is saying that K(the n bits)≥c+min^r(K(^r)+nKL(Ber(r)||Ber(^r))) for some constant c with high probability. Possibly the argument goes something like: for any non-conditionally-iid way of assigning probabilities to the bits, there is a better iid way (by averaging the probabilities). And if you are flipping the bits iid, then at some point you compute the probability of each flip, so you might as well just optimally compress some ^r. But I couldn’t quite formalize this argument well, because of the fact that hypotheses can encode correct correlations through overfitting.
My intuition is that, for the iid bits with uncomputable probability r, there will exist an optimal (up to a constant) solution of the form:
compute some real number ^r by running the shortest program computing this
generate bits iid from Ber(^r)
If this is the case, then the optimal algorithm will choose ^r to minimize K(^r)+nKL(Ber(r)||Ber(^r)) if n is the number of bits observed. As n goes to infinity, minimizing this objective causes ^r to converge to r.
Now I’m not sure how to prove that an optimal (up to a constant) hypothesis of the form above exists. Basically this is saying that K(the n bits)≥c+min^r(K(^r)+nKL(Ber(r)||Ber(^r))) for some constant c with high probability. Possibly the argument goes something like: for any non-conditionally-iid way of assigning probabilities to the bits, there is a better iid way (by averaging the probabilities). And if you are flipping the bits iid, then at some point you compute the probability of each flip, so you might as well just optimally compress some ^r. But I couldn’t quite formalize this argument well, because of the fact that hypotheses can encode correct correlations through overfitting.