Efficient Induction

(By which I mean, induction over efficient hypotheses.)

A standard prior is “uniform” over the class of computable functions (ie, uniform over infinite strings which have a prefix which compiles). Why is this a natural choice of prior? Well, we’ve looked at the universe for a really long time, and it seems to be computable. The Church-Turing thesis says we have no reason to choose a bigger support for our prior.

Why stop there? There is a natural generalization of the Church-Turing thesis, naturally called the extended Church-Turing thesis, which asserts that the universe is computable in polynomial time. In fact, we have a strong suspicion that physics should be local, which means more or less precisely that updates can be performed using a linear size logical circuit. Maybe we should restrict our prior further, looking only for small circuits.

(As we realized only slightly later, this extended hypothesis is almost certainly false, because in the real world probabilities are quantum. But if we replace “circuit” by “quantum circuit,” which is a much less arbitrary change than it seems at face value, then we are good. Are further changes forthcoming? I don’t know, but I suspect not.)

So we have two nice questions. First, what does a prior over efficiently computable hypotheses look like? Second, what sort of observation about the world could cause you to modify Solomonoff induction? For that matter, what sort of physical evidence ever convinced us that Solomonoff induction was a good idea in the first place, rather than a broader prior? I suspect that both of these questions have been tackled before, but google has failed me so now I will repeat some observations.

Defining priors over “polynomially large” objects is a little more subtle than usual Solomonoff induction. In some sense we need to penalize a hypothesis both for its description complexity and its computational complexity. Here is a first try:

A hypothesis consists of an initial state of some length N, and a logical circuit of size M which takes N bits to K+N bits. The universe evolves by repeatedly applying the logical circuit to compute both a “next observation” (the first K bits) and the new state of the universe (the last N bits). The probability of a hypothesis drops off exponentially with its length.

How reasonable is this? Why, not at all. The size of our hypothesis is the size of the universe, so it is going to take an awful lot of observations to surmount the improbability of living in a large universe. So what to do? Well, the reason this contradicts intuition is that we expect our physical theory (as well as the initial state of our system) to be uniform in some sense, so that it can hope to be described concisely even if the universe is large. Well luckily for us the notion of uniformity already exists for circuits, and in fact appears to be the correct notion. Instead of working with circuits directly, we specify a program which outputs that circuit (so if the laws of physics are uniform across space, the program just has to be told “tile this simple update rule at every point.”) So now it goes like this:

A hypothesis consists of a program which outputs an initial state of some finite length N, and a program which outputs a logical circuit of size M which takes N bits to K+N bits. The observations are defined as before. The probability of a hypothesis drops off exponentially with its length.

How reasonable is this? Well it doesn’t exploit the conjectured computational efficiency of the universe at all. There are three measures of complexity, and we are only using one of them. We have the length of the hypothesis, the size of the universe, and the computational complexity of the update rule. At least we now have these quantities in hand, so we can hope to incorporate them intelligently. One solution is to place an explicit bound on the complexity of the update rule in terms of the size of the universe. It is easy to see that this approach is doomed to fail. An alternative approach is to explicitly include terms dependent on all three complexity measures in the prior probability for each hypothesis.

There are some aesthetically pleasing solutions which I find really attractive. For example, make the hypothesis a space bounded Turing machine and also require it to specify the initial state of its R/​W tape. More simply but less aesthetically, you could just penalize a hypothesis based on the logarithm of its running time (since this bounds the size of its output), or on log M. I think this scheme gives known physical theories very good description complexity. Overall, it strikes me as an interesting way of thinking about things.

I don’t really know how to answer the second question (what sort of observation would cause us to restrict our prior in this way). I don’t really know where the description complexity prior comes from either; it feels obvious to me that the universe is computable, just like it feels obvious that the universe is efficiently computable. I don’t trust these feelings of obviousness, since they are coming from my brain. I guess the other justification is that we might as well stick to computable hypotheses, because we can’t use stronger hypotheses to generate predictions (our conscious thoughts at least appearing to be computable). The same logic does have something to say about efficiency: we can’t use inefficient hypotheses to generate predictions, so we might as well toss them out. But this would lead us to just keep all sufficiently efficient hypotheses and throw out the rest, which doesn’t work very well (since the efficiency of a hypothesis depends upon the size of the universe it posits; this basically involves putting a cap on the size of the universe). I don’t know of a similar justification which tells us to penalize hypotheses based on their complexity. The only heuristic is that it has worked well for screening out physical theories so far. Thats a pretty good thing to have going for you at least.

In sum, thinking about priors over more restricted sets of hypotheses can be interesting. If anyone knows of a source which approaches this problem more carefully, I would be interested to learn.