I am not sure whether my take on this is correct, so I’d be thankful if someone corrects me if I am wrong:
I think that if the goal was only ‘predicting’ this bit-sequence after knowing the sequence itself, one could just state probability 1 for the known sequence.
In the OP instead, we regard the bit-sequence as stemming from some sequence-generator, of which only this part of the output is known. Here, we only have limited data such that singling out a highly complex model out of model-space has to be weighed against the models’ fit to the bit-sequence.
I am not sure whether my take on this is correct, so I’d be thankful if someone corrects me if I am wrong:
I think that if the goal was only ‘predicting’ this bit-sequence after knowing the sequence itself, one could just state probability 1 for the known sequence.
In the OP instead, we regard the bit-sequence as stemming from some sequence-generator, of which only this part of the output is known. Here, we only have limited data such that singling out a highly complex model out of model-space has to be weighed against the models’ fit to the bit-sequence.