I agree with your point in the canonical solomonoff sequence prediction case. I think your point is what I mean in my note by “you have pathology of not specifying even the hypothesis in the seq prediction case (like it’ll be better to drop bits and take the likelihood loss)”. I think this pathology is maybe not present in “function solomonoff” (I state this in the note as well but don’t really explain it), though I’m very much uncertain.
to state the hopeful “function solomonoff” story in more detail:
By “function solomonoff”, I mean that we have a data set of string pairs , and we think of one hypothesis as being a program that takes in an and outputs a probability distribution on strings from which is sampled. Let’s say that we are in the classification case so always (we’re distinguishing pictures which show dogs vs ones which don’t, say).
The “canonical loss” (from which one derives a posterior distribution via exponentiation) here would be the length of the program specifying the distribution plus the negative log likelihood assigned to summed over all . What I’m suggesting is this loss but with a higher coefficient on the length of the program than on the likelihood terms.
Suppose that the classification boundary is most simply given by what we will consider a “simple model” of complexity bits, together with “systematic human error” which changes the answer from the simple model on a fraction of the inputs, with the set of those inputs taking bits to specify.
If we turned this into sequence prediction by interleaving like , then I’d agree that if we penalize hypothesis length more steeply than likelihood, then: over getting a model which does not predict the errors, we would get a universal-like hypothesis, which in particular starts to predict the human errors after being conditioned on sufficiently many bits. So the idea of more steep penalization of hypothesis length doesn’t do what we want in the sequence prediction case. But I have some hope that the function case doesn’t have this pathology?
Some models of the given data in the function case:
the “good model”: The distribution is given by the simple model with a probability of flipping the answer on top (independently on each input). This gets complexity loss like plus something small for specifying the flip model, and its expected neg log likelihood is .
the “model that learns the errors”: This should generically take bits to specify, and it gets expected neg log likelihood.
the “50/50 random distribution” model: This takes bits to specify and has bit of expected neg log likelihood.
some “universal hypothesis model”: I’m not actually sure what this would even be in the function setting? If you handled the likelihood part by giving a global string of random bits which gets conditioned on other input-output pairs, then I agree we could write something bad just like in the sequence prediction case. But if each input gets its own private randomness, then I don’t see how to write down a universal hypothesis that gets good loss here.
So at least given these models, it looks like the “good model” could be a vertex of the convex hull of the set of attainable (hypothesis complexity, expected neg log likelihood) tuples? If it’s on the convex hull, it’s picked out by some loss of the form described (even in the limit of many data points, though we will need to increase the hypothesis term coefficient compared to the sum of log likelihoods term as the data set size increases, ie in the bayesian picture we will need to pick a stronger prior when the data set is larger in this example).
that said:
Maybe I’m just failing to construct the right “universal hypothesis” for this example?
It seems plausible that some other pathology is present that prevents nice behavior.
I haven’t spent that much time trying to come up with other pathological constructions or searching for a proof that sth like the good model is optimal for some hyperparameter setting.
I can see some other examples where this functional setup still doesn’t work nicely. I might write more about that in a later comment. The example here is definitely somewhat cherry-picked for the idea to work, though I also don’t consider it completely contrived.
I think it’s very unlikely this steeper penalization is anywhere close to a full solution to the philosophical problem here. I only have some hope that it works in some specific toy cases.
I agree with your point in the canonical solomonoff sequence prediction case. I think your point is what I mean in my note by “you have pathology of not specifying even the hypothesis in the seq prediction case (like it’ll be better to drop bits and take the likelihood loss)”. I think this pathology is maybe not present in “function solomonoff” (I state this in the note as well but don’t really explain it), though I’m very much uncertain.
to state the hopeful “function solomonoff” story in more detail:
By “function solomonoff”, I mean that we have a data set of string pairs , and we think of one hypothesis as being a program that takes in an and outputs a probability distribution on strings from which is sampled. Let’s say that we are in the classification case so always (we’re distinguishing pictures which show dogs vs ones which don’t, say).
The “canonical loss” (from which one derives a posterior distribution via exponentiation) here would be the length of the program specifying the distribution plus the negative log likelihood assigned to summed over all . What I’m suggesting is this loss but with a higher coefficient on the length of the program than on the likelihood terms.
Suppose that the classification boundary is most simply given by what we will consider a “simple model” of complexity bits, together with “systematic human error” which changes the answer from the simple model on a fraction of the inputs, with the set of those inputs taking bits to specify.
If we turned this into sequence prediction by interleaving like , then I’d agree that if we penalize hypothesis length more steeply than likelihood, then: over getting a model which does not predict the errors, we would get a universal-like hypothesis, which in particular starts to predict the human errors after being conditioned on sufficiently many bits. So the idea of more steep penalization of hypothesis length doesn’t do what we want in the sequence prediction case. But I have some hope that the function case doesn’t have this pathology?
Some models of the given data in the function case:
the “good model”: The distribution is given by the simple model with a probability of flipping the answer on top (independently on each input). This gets complexity loss like plus something small for specifying the flip model, and its expected neg log likelihood is .
the “model that learns the errors”: This should generically take bits to specify, and it gets expected neg log likelihood.
the “50/50 random distribution” model: This takes bits to specify and has bit of expected neg log likelihood.
some “universal hypothesis model”: I’m not actually sure what this would even be in the function setting? If you handled the likelihood part by giving a global string of random bits which gets conditioned on other input-output pairs, then I agree we could write something bad just like in the sequence prediction case. But if each input gets its own private randomness, then I don’t see how to write down a universal hypothesis that gets good loss here.
So at least given these models, it looks like the “good model” could be a vertex of the convex hull of the set of attainable (hypothesis complexity, expected neg log likelihood) tuples? If it’s on the convex hull, it’s picked out by some loss of the form described (even in the limit of many data points, though we will need to increase the hypothesis term coefficient compared to the sum of log likelihoods term as the data set size increases, ie in the bayesian picture we will need to pick a stronger prior when the data set is larger in this example).
that said:
Maybe I’m just failing to construct the right “universal hypothesis” for this example?
It seems plausible that some other pathology is present that prevents nice behavior.
I haven’t spent that much time trying to come up with other pathological constructions or searching for a proof that sth like the good model is optimal for some hyperparameter setting.
I can see some other examples where this functional setup still doesn’t work nicely. I might write more about that in a later comment. The example here is definitely somewhat cherry-picked for the idea to work, though I also don’t consider it completely contrived.
I think it’s very unlikely this steeper penalization is anywhere close to a full solution to the philosophical problem here. I only have some hope that it works in some specific toy cases.