Thought experiments on simplicity in logical probability
A common feature of many proposed logical priors is a preference for simple sentences over complex ones. This is sort of like an extension of Occam’s razor into math. Simple things are more likely to be true. So, as it is said, “why not?”
Well, the analogy has some wrinkles—unlike hypothetical rules for the world, logical sentences do not form a mutually exclusive set. Instead, for every sentence A there is a sentence not-A with pretty much the same complexity, and probability 1-P(A). So you can’t make the probability smaller for all complex sentences, because their negations are also complex sentences! If you don’t have any information that discriminates between them, A and not-A will both get probability 1⁄2 no matter how complex they get.
But if our agent knows something that breaks the symmetry between A and not-A, like that A belongs to a mutually exclusive and exhaustive set of sentences with differing complexities, then it can assign higher probabilities to simpler sentences in this set without breaking the rules of probability. Except, perhaps, the rule about not making up information.
The question: is the simpler answer really more likely to be true than the more complicated answer, or is this just a delusion? If so, is it for some ontologically basic reason, or for a contingent and explainable reason?
There are two complications to draw your attention to. The first is in what we mean by complexity. Although it would be nice to use the Kolmogorov complexity of any sentence, which is the length of the shortest program that prints the sentence, such a thing is uncomputable by the kind of agent we want to build in the real world. The only thing our real-world agent is assured of seeing is the length of the sentence as-is. We can also find something in between Kolmogorov complexity and length by doing a brief search for short programs that print the sentence—this meaning is what is usually meant in this article, and I’ll call it “apparent complexity.”
The second complication is in what exactly a simplicity prior is supposed to look like. In the case of Solomonoff induction the shape is exponential—more complicated hypotheses are exponentially less likely. But why not a power law? Why not even a Poisson distribution? Does the difficulty of answering this question mean that thinking that simpler sentences are more likely is a delusion after all?
1: Suppose our agent knew from a trusted source that some extremely complicated sum could only be equal to A, or to B, or to C, which are three expressions of differing complexity. What are the probabilities?
Commentary: This is the most sparse form of the question. Not very helpful regarding the “why,” but handy to stake out the “what.” Do the probabilities follow a nice exponential curve? A power law? Or, since there are just the three known options, do they get equal consideration?
This is all based off intuition, of course. What does intuition say when various knobs of this situation are tweaked—if the sum is of unknown complexity, or of complexity about that of C? If there are a hundred options, or countably many? Intuitively speaking, does it seem like favoring simpler sentences is an ontologically basic part of your logical prior?
2: Consider subsequences of the digits of pi. If I give you a pair (n,m), you can tell me the m digits following the nth digit of pi. So if I start a sentence like “the subsequence of digits of pi (10100, 102) = ”, do you expect to see simpler strings of digits on the right side? Is this a testable prediction about the properties of pi?
Commentary: We know that there is always a short-ish program to produce the sequences, which is just to compute the relevant digits of pi. This sets a hard upper bound on the possible Kolmogorov complexity of sequences of pi (that grows logarithmically as you increase m and n), and past a certain m this will genuinely start restricting complicated sequences, and thus favoring “all zeros”—or does it?
After all, this is weak tea compared to an exponential simplicity prior, for which the all-zero sequence would be hojillions of times more likely than a messy one. On the other hand, an exponential curve allows sequences with higher Kolmogorov complexity than the computation of the digits of pi.
Does the low-level view outlined in the first paragraph above demonstrate that the exponential prior is bunk? Or can you derive one from the other with appropriate simplifications (keeping in mind Komogorov complexity vs. apparent complexity)? Does pi really contain more long simple strings than expected, and if not what’s going on with our prior?
3: Suppose I am writing an expression that I want to equal some number you know—that is, the sentence “my expression = your number” should be true. If I tell you the complexity of my expression, what can you infer about the likelihood of the above sentence?
Commentary: If we had access to Kolmogorov complexity of your number, then we could completely rule out answers that were too K-simple to work. With only an approximation, it seems like we can still say that simple answers are less likely up to a point. Then as my expression gets more and more complicated, there are more and more available wrong answers (and, outside of the system a bit, it becomes less and less likely that I know what I’m doing), and so probability goes down.
In the limit that my expression is much more complex than your number, does an elegant exponential distribution emerge from underlying considerations?