You seem to completely misunderstand Kolmogorov prior, confusing hypotheses with events.
Probability of events “A and B and C and D” is sum of Kolmogorov-derived probabilities of every hypothesis implying “A and B and C and D” that’s also consistent with data.
Probability of events “A or B or C or D” is sum of Kolmogorov-derived probabilities of every hypothesis implying “A or B or C or D” that’s also consistent with data.
Hypotheses in first set won’t be more complex than hypotheses in second set—the second set is simply far far larger.
Second probability consists of things like: (1/2)^length(“A”) + (1/2)^length(“B”) + (1/2)^length(“A and B”) + (1/2)^length(“A or B”) + (1/2)^length(“A and C”) + …
In the post, the word “hypothesis” is used to refer to events (which is the usual meaning in the context of Bayesian updating). They are not confused, you just didn’t use the intended interpretation.
In the post, the word “hypothesis” is used to refer to events
Then title “complexity of a hypothesis” makes no sense. Events (“hypotheses” in post) have no complexity. Descriptions (“hypotheses” in my comment) have complexity.
Connection is between complexity of descriptions and probability of events, and it has none of the problems described in the post.
Even after adjusting for terminology, I can’t understand the point of your comment. Maybe you could flesh it out more formally in your own preferred terms?
For example, imagine we’re receiving a sequence of integers. Consider the statements A = “all are even”, B = “all are perfect squares”, C = A and B, D = A or B. In your formalism, which of these are “hypotheses”? Which are “events”? Which “have complexity”, and which don’t?
I think you simply point out that I’m using incorrect terminology: I should’ve used “descriptions of events” in place of “hypotheses”, and “hypotheses” in place of “predictors” (or “algorithms”, or “worlds”). Okay, I can live with that.
NB: onlookers, please don’t downvote parent. Upvoted back to 0.
He essentially argued about definitions, but presented that as a disagreement about content. This is a standard error (whether by oversight or not), hence should be downvoted.
Your post still wouldn’t make any sense. Events/world have prior probability, but not complexity. Descriptions/algorithms/predictors have complexity, but not prior probability.
Infinite number of descriptions of different lengths predicts same event.
Does this imply you should assign credence 2^-m to any statement about the universe (“hypothesis”) that has length m? [...]
This confuses “A and B and C and D” as description (which has complexity but no prior probability) and “A and B and C and D” as subset of possible worlds (which has prior probability but no complexity).
Speaking about Kolmogorov complexity requires a lot of precision—we can select toy world but there’s just no way to use anything less than Turing complete description language.
Except if world can described in m bits, 2^-m serves as a lower bound on probability, as long as it’s consistent with data.
The hypothesis “the correct theory of everything is the lexicographically least algorithm with K-complexity 3^^^^3” is quite short, but the universal prior for it is astronomically low.
Is it? If you try to encode informal statements formally, they get very big. And doesn’t “lexicographically least algorithm with K-complexity 3^^^^3” violate Rice theorem? If it wasn’t for these problems, 2^-m would still be proper lower bound of prior probability of this.
Any statement about the world (“description”) does have a prior probability, which is equal to the sum of prior probabilities of worlds satisfying that statement. This is kind of obvious.
The prior probability of a statement is not connected to its “complexity” (whatever that means). This is also kind of obvious, but many people have managed to miss it, so I wrote the post.
A formal encoding of example 2 (as a predicate on algorithms expressed in ZFC, say) isn’t very long or difficult. The notion of K-complexity is easily definable, though not computable. Just like BB numbers—also easily definable, but not computable. What does Rice’s theorem have to do with it? Maybe I’m missing something?
Any statement about the world (“description”) does have a prior probability, which is equal to the sum of prior probabilities of worlds satisfying that statement. This is kind of obvious.
Programs used in Kolmogorov priors have to describe everything about the world. So it’s about programs like “print(2,4,6);” and “while(true){print(42);}” and their lengths. World “2 4 6” has probability induced by all programs that generate “2 4 6″.
You don’t normally have world-testing programs, just world-generating programs.
I shouldn’t be replying that late at night, as my clarity suffer a lot, and this subject demands a lot of clarity.
There are meaningful ways to describe Kolmogorov complexity of world-testing programs and their combinations, and there is meaningful equivalent of conditional probability. But you don’t use them in a prior directly. When you want to define prior with world-testing programs you need to do a moral equivalent of padding every program with enough entropy to fully describe the world. So program “numbers are 2, 4, 6” converts without entropy padding, program “numbers are even” needs a lot of extra entropy, and “numbers are whatever Jesus wants them to be” requires as much entropy as an empty program.
My excuse is that when you sum over every possible padding, as in (1/2)^length(“A and B” + padding bits 0000) + (1/2)^length(“A and B” + padding bits 0001) + … what you get will be simply (1/2)^length(“A and B”) times a small language-dependent constant for (1/2)^length(program to convert entropy bits and world-testing program into world-generating program).
So you have:
Length of world-testing program
Entropy needed to get world-generating program from world-testing program
Probability that world-testing program matches the world
For complete descriptions (no extra entropy needed) of world that don’t have any shorter descriptions, complexity of description is very closely related to Kolmogorov complexity—“sequence is 50 times 42” is shorter than “sequence is 9, 0, 2, 9, 3, 2, 7, 5, 8, 1, 4, 9, 8, 7, 1, 0, 6, 2, 3, 4, 4, 8, 5, 7, 4, 4, 7, 0, 1, 2, 8, 3, 9, 0, 9, 0, 3, 2, 8, 4, 6, 0, 4, 7, 8, 3, 4, 2, 3, 6, 3″, both are unambiguous and about maximally concise (I generated the second randomly) - so first is more likely than second.
For incomplete descriptions it’s more complicated, but mathematics should be straightforward enough.
Thanks, this angle looks interesting, and new to me. Could you explain some more how it’s supposed to work? For example, if the input is an infinite stream of numbers and my hypothesis is something like “next number will be 42” or “prime-numbered members will be prime”, should I pad it with infinite additional entropy?
Normally streams are assumed to be finitely long and have symbols from finite set to keep math neat.
Extension to infinitely long streams with finite complexity are conceptually fairly straightforward using trick very similar to non-normalizable Bayesian prior (just like uniform prior over real numbers is well defined even if it sums to infinity so it’s not really probability, or entropy of a molecule when atom locations are unbound continuous variables), but math can get considerably messier.
Finite or not, streams must be countable.
So just select some ordering and for stream-tester T and number N world generator is “i=0; for(w in streams) { if(T(w)) i += 1; if(i == N) { print(w); break; } }”. There are multiple ways to choose iteration order and encoding for N (simplest would be fixed bit size number, but you need more complicated encoding for infinite number of possibilities obviously), but they’re just a constant away from each other, so you can ignore this detail.
You seem to completely misunderstand Kolmogorov prior, confusing hypotheses with events.
Probability of events “A and B and C and D” is sum of Kolmogorov-derived probabilities of every hypothesis implying “A and B and C and D” that’s also consistent with data.
Probability of events “A or B or C or D” is sum of Kolmogorov-derived probabilities of every hypothesis implying “A or B or C or D” that’s also consistent with data.
Hypotheses in first set won’t be more complex than hypotheses in second set—the second set is simply far far larger.
Second probability consists of things like: (1/2)^length(“A”) + (1/2)^length(“B”) + (1/2)^length(“A and B”) + (1/2)^length(“A or B”) + (1/2)^length(“A and C”) + …
In the post, the word “hypothesis” is used to refer to events (which is the usual meaning in the context of Bayesian updating). They are not confused, you just didn’t use the intended interpretation.
Then title “complexity of a hypothesis” makes no sense. Events (“hypotheses” in post) have no complexity. Descriptions (“hypotheses” in my comment) have complexity.
Connection is between complexity of descriptions and probability of events, and it has none of the problems described in the post.
Even after adjusting for terminology, I can’t understand the point of your comment. Maybe you could flesh it out more formally in your own preferred terms?
For example, imagine we’re receiving a sequence of integers. Consider the statements A = “all are even”, B = “all are perfect squares”, C = A and B, D = A or B. In your formalism, which of these are “hypotheses”? Which are “events”? Which “have complexity”, and which don’t?
I think you simply point out that I’m using incorrect terminology: I should’ve used “descriptions of events” in place of “hypotheses”, and “hypotheses” in place of “predictors” (or “algorithms”, or “worlds”). Okay, I can live with that.
NB: onlookers, please don’t downvote parent. Upvoted back to 0.
He essentially argued about definitions, but presented that as a disagreement about content. This is a standard error (whether by oversight or not), hence should be downvoted.
Your post still wouldn’t make any sense. Events/world have prior probability, but not complexity. Descriptions/algorithms/predictors have complexity, but not prior probability.
Infinite number of descriptions of different lengths predicts same event.
This confuses “A and B and C and D” as description (which has complexity but no prior probability) and “A and B and C and D” as subset of possible worlds (which has prior probability but no complexity).
Speaking about Kolmogorov complexity requires a lot of precision—we can select toy world but there’s just no way to use anything less than Turing complete description language.
Except if world can described in m bits, 2^-m serves as a lower bound on probability, as long as it’s consistent with data.
Is it? If you try to encode informal statements formally, they get very big. And doesn’t “lexicographically least algorithm with K-complexity 3^^^^3” violate Rice theorem? If it wasn’t for these problems, 2^-m would still be proper lower bound of prior probability of this.
Any statement about the world (“description”) does have a prior probability, which is equal to the sum of prior probabilities of worlds satisfying that statement. This is kind of obvious.
The prior probability of a statement is not connected to its “complexity” (whatever that means). This is also kind of obvious, but many people have managed to miss it, so I wrote the post.
A formal encoding of example 2 (as a predicate on algorithms expressed in ZFC, say) isn’t very long or difficult. The notion of K-complexity is easily definable, though not computable. Just like BB numbers—also easily definable, but not computable. What does Rice’s theorem have to do with it? Maybe I’m missing something?
By complexity I mean program length.
Programs used in Kolmogorov priors have to describe everything about the world. So it’s about programs like “print(2,4,6);” and “while(true){print(42);}” and their lengths. World “2 4 6” has probability induced by all programs that generate “2 4 6″.
You don’t normally have world-testing programs, just world-generating programs.
Then why did you mention things like length(“A and B”) in your first comment? You can’t take the conjunction of two programs, can you? I’m confused.
I shouldn’t be replying that late at night, as my clarity suffer a lot, and this subject demands a lot of clarity.
There are meaningful ways to describe Kolmogorov complexity of world-testing programs and their combinations, and there is meaningful equivalent of conditional probability. But you don’t use them in a prior directly. When you want to define prior with world-testing programs you need to do a moral equivalent of padding every program with enough entropy to fully describe the world. So program “numbers are 2, 4, 6” converts without entropy padding, program “numbers are even” needs a lot of extra entropy, and “numbers are whatever Jesus wants them to be” requires as much entropy as an empty program.
My excuse is that when you sum over every possible padding, as in (1/2)^length(“A and B” + padding bits 0000) + (1/2)^length(“A and B” + padding bits 0001) + … what you get will be simply (1/2)^length(“A and B”) times a small language-dependent constant for (1/2)^length(program to convert entropy bits and world-testing program into world-generating program).
So you have:
Length of world-testing program
Entropy needed to get world-generating program from world-testing program
Probability that world-testing program matches the world
For complete descriptions (no extra entropy needed) of world that don’t have any shorter descriptions, complexity of description is very closely related to Kolmogorov complexity—“sequence is 50 times 42” is shorter than “sequence is 9, 0, 2, 9, 3, 2, 7, 5, 8, 1, 4, 9, 8, 7, 1, 0, 6, 2, 3, 4, 4, 8, 5, 7, 4, 4, 7, 0, 1, 2, 8, 3, 9, 0, 9, 0, 3, 2, 8, 4, 6, 0, 4, 7, 8, 3, 4, 2, 3, 6, 3″, both are unambiguous and about maximally concise (I generated the second randomly) - so first is more likely than second.
For incomplete descriptions it’s more complicated, but mathematics should be straightforward enough.
Thanks, this angle looks interesting, and new to me. Could you explain some more how it’s supposed to work? For example, if the input is an infinite stream of numbers and my hypothesis is something like “next number will be 42” or “prime-numbered members will be prime”, should I pad it with infinite additional entropy?
Normally streams are assumed to be finitely long and have symbols from finite set to keep math neat.
Extension to infinitely long streams with finite complexity are conceptually fairly straightforward using trick very similar to non-normalizable Bayesian prior (just like uniform prior over real numbers is well defined even if it sums to infinity so it’s not really probability, or entropy of a molecule when atom locations are unbound continuous variables), but math can get considerably messier.
Finite or not, streams must be countable.
So just select some ordering and for stream-tester T and number N world generator is “i=0; for(w in streams) { if(T(w)) i += 1; if(i == N) { print(w); break; } }”. There are multiple ways to choose iteration order and encoding for N (simplest would be fixed bit size number, but you need more complicated encoding for infinite number of possibilities obviously), but they’re just a constant away from each other, so you can ignore this detail.
Ah, sorry. I deleted my other reply, will reply here.