Many in this world retain beliefs whose flaws a ten-year-old could point out
Very true. Case in point: the belief that “minimum description length” or “Solomonoff induction” can actually predict anything. Choose a language that can describe MWI more easily than Copenhagen, and they say you should believe MWI; choose a language that can describe Copenhagen more easily than MWI, and they say you should believe Copenhagen. I certainly could have told you that when I was ten...
The argument in this post is precisely analogous to the following:
Bayesian reasoning cannot actually predict anything. Choose priors that result in the posterior for MWI being greater than that for Copenhagen, and it says you should believe MWI; choose priors that result in the posterior for Copenhagen being greater than that for MWI, and it says you should believe Copenhagen.
The thing is, though, choosing one’s own priors is kind of silly, and choosing one’s own priors with the purpose of making the posteriors be a certain thing is definitely silly. Priors should be chosen to be simple but flexible. Likewise, choosing a language with the express purpose of being able to express a certain concept simply is silly; languages should be designed to be simple but flexible.
It seems to me that you’re waving the problem away instead of solving it. For example, I don’t know of any general method for devising a “non-silly” prior for any given parametric inference problem. Analogously, what if your starting language accidentally contains a shorter description of Copenhagen than MWI?
If you’re just doing narrow AI, then look at your hypothesis that describes the world (e.g. “For any two people, they have some probability X of having a relationship we’ll call P. For any two people with relationship P, every day, they have a probability Y of causing perception A.”), then fill in every parameter (in this case, we have X and Y) with reasonable distributions (e.g. X and Y independent, each with a 1⁄3 chance of being 0, a 1⁄3 chance of being 1, and a 1⁄3 chance of being the uniform distribution).
Yes, I said “reasonable”. Subjectivity is necessary; otherwise, everyone would have the same priors. Just don’t give any statement an unusually low probability (e.g. a probability practically equal to zero that a certain physical constant is greater than Graham’s number), nor any statement an unusually high probability (e.g. a 50% probability that Christianity is true). I think good rules are that the language your prior corresponds to should not have any atoms that can be described reasonably easily (perhaps 10 atoms or less) using only other atoms, and that every atom should be mathematically useful.
If the starting language accidentally contains a shorter description of Copenhagen than MWI? Spiffy! Assuming there is no evidence either way, Copenhagen will be more likely than MWI. Now, correct me if I’m wrong, but MWI is essentially the idea that the set of things causing wavefunction collapse is empty, while Copenhagen states that it is not empty. Supposing we end up with a 1⁄3 chance of MWI being true and a 2⁄3 chance that it’s some other simple thing, is that really a bad thing? Your agent will end up designing devices that will work only if a certain subinterpretation of the Copenhagen interpretation is true and try them out. Eventually, most of the simple, easily-testable versions of the Copenhagen interpretation will be ruled out—if they are, in fact, false—and we’ll be left with two things: unlikely versions of the Copenhagen interpretation, and versions of the Copenhagen interpretation that are practically identical to MWI.
The minimum description length formulation doesn’t allow for that at all. You are not allowed to pick whatever language you want, you have to pick the optimal code. If in the most concise code possible, state ‘a’ has a smaller code than state ‘b’, then ‘a’ must be more probable than ‘b’, since the most concise codes possible assign the smallest codes to the most probable states.
So if you wanna know what state a system is in, and you have the ideal (or close to ideal) code for the states in that system, the probability of that state will be strongly inversely correlated with the length of the code for that state.
You are not allowed to pick whatever language you want, you have to pick the optimal code. If in the most concise code possible, state ‘a’ has a smaller code than state ‘b’, then ‘a’ must be more probable than ‘b’, since the most concise codes possible assign the smallest codes to the most probable states.
I haven’t read anything like this in my admittedly limited readings on Solomonoff induction. Disclaimer: I am only a mere mathematician in a different field, and have only read a few papers surrounding Solomonoff.
The claims I’ve seen revolve around “assembly language” (for some value of assembly language) being sufficiently simple that any biases inherent in the language are small (some people claim constant multiple on the basis that this is what happens when you introduce a symbol ‘short-circuiting’ a computation). I think a more correct version of Anti-reductionist’s argument should run, “we currently do not know how the choice of language affects SI; it is conceivable that small changes in the base language imply fantastically different priors.”
I don’t know the answer to that, and I’d be very glad to know if someone has proved it. However, I think it’s rather unlikely that someone has proved it, because 1) I expect it will be disproven (on the basis that model-theoretic properties tend to be fragile), and 2) given the current difficulties in explicitly calculating SI, finding an explicit, non-trivial counter-example would probably be difficult.
Note that
Choose a language that can describe MWI more easily than Copenhagen, and they say you should believe MWI; choose a language that can describe Copenhagen more easily than MWI, and they say you should believe Copenhagen.
is not such a counter-example, because we do not know if “sufficiently assembly-like” languages can be chosen which exhibit such a bias. I don’t think the above thought-experiment is worth pursuing, because I don’t think we even know a formal (on the level of assembly-like languages) description of either CI or MWI.
Yep, but that’s all the proof shows: the more concise your code, the stronger the inverse correlation between the probability of a state and the code length of that state.
Many in this world retain beliefs whose flaws a ten-year-old could point out
Very true. Case in point: the belief that “minimum description length” or “Solomonoff induction” can actually predict anything. Choose a language that can describe MWI more easily than Copenhagen, and they say you should believe MWI; choose a language that can describe Copenhagen more easily than MWI, and they say you should believe Copenhagen. I certainly could have told you that when I was ten...
The argument in this post is precisely analogous to the following:
Bayesian reasoning cannot actually predict anything. Choose priors that result in the posterior for MWI being greater than that for Copenhagen, and it says you should believe MWI; choose priors that result in the posterior for Copenhagen being greater than that for MWI, and it says you should believe Copenhagen.
The thing is, though, choosing one’s own priors is kind of silly, and choosing one’s own priors with the purpose of making the posteriors be a certain thing is definitely silly. Priors should be chosen to be simple but flexible. Likewise, choosing a language with the express purpose of being able to express a certain concept simply is silly; languages should be designed to be simple but flexible.
It seems to me that you’re waving the problem away instead of solving it. For example, I don’t know of any general method for devising a “non-silly” prior for any given parametric inference problem. Analogously, what if your starting language accidentally contains a shorter description of Copenhagen than MWI?
If you’re just doing narrow AI, then look at your hypothesis that describes the world (e.g. “For any two people, they have some probability X of having a relationship we’ll call P. For any two people with relationship P, every day, they have a probability Y of causing perception A.”), then fill in every parameter (in this case, we have X and Y) with reasonable distributions (e.g. X and Y independent, each with a 1⁄3 chance of being 0, a 1⁄3 chance of being 1, and a 1⁄3 chance of being the uniform distribution).
Yes, I said “reasonable”. Subjectivity is necessary; otherwise, everyone would have the same priors. Just don’t give any statement an unusually low probability (e.g. a probability practically equal to zero that a certain physical constant is greater than Graham’s number), nor any statement an unusually high probability (e.g. a 50% probability that Christianity is true). I think good rules are that the language your prior corresponds to should not have any atoms that can be described reasonably easily (perhaps 10 atoms or less) using only other atoms, and that every atom should be mathematically useful.
If the starting language accidentally contains a shorter description of Copenhagen than MWI? Spiffy! Assuming there is no evidence either way, Copenhagen will be more likely than MWI. Now, correct me if I’m wrong, but MWI is essentially the idea that the set of things causing wavefunction collapse is empty, while Copenhagen states that it is not empty. Supposing we end up with a 1⁄3 chance of MWI being true and a 2⁄3 chance that it’s some other simple thing, is that really a bad thing? Your agent will end up designing devices that will work only if a certain subinterpretation of the Copenhagen interpretation is true and try them out. Eventually, most of the simple, easily-testable versions of the Copenhagen interpretation will be ruled out—if they are, in fact, false—and we’ll be left with two things: unlikely versions of the Copenhagen interpretation, and versions of the Copenhagen interpretation that are practically identical to MWI.
(Do I get a prize for saying “e.g.” so much?)
Yes. Here is an egg and an EEG.
The minimum description length formulation doesn’t allow for that at all. You are not allowed to pick whatever language you want, you have to pick the optimal code. If in the most concise code possible, state ‘a’ has a smaller code than state ‘b’, then ‘a’ must be more probable than ‘b’, since the most concise codes possible assign the smallest codes to the most probable states.
So if you wanna know what state a system is in, and you have the ideal (or close to ideal) code for the states in that system, the probability of that state will be strongly inversely correlated with the length of the code for that state.
I haven’t read anything like this in my admittedly limited readings on Solomonoff induction. Disclaimer: I am only a mere mathematician in a different field, and have only read a few papers surrounding Solomonoff.
The claims I’ve seen revolve around “assembly language” (for some value of assembly language) being sufficiently simple that any biases inherent in the language are small (some people claim constant multiple on the basis that this is what happens when you introduce a symbol ‘short-circuiting’ a computation). I think a more correct version of Anti-reductionist’s argument should run, “we currently do not know how the choice of language affects SI; it is conceivable that small changes in the base language imply fantastically different priors.”
I don’t know the answer to that, and I’d be very glad to know if someone has proved it. However, I think it’s rather unlikely that someone has proved it, because 1) I expect it will be disproven (on the basis that model-theoretic properties tend to be fragile), and 2) given the current difficulties in explicitly calculating SI, finding an explicit, non-trivial counter-example would probably be difficult.
Note that
is not such a counter-example, because we do not know if “sufficiently assembly-like” languages can be chosen which exhibit such a bias. I don’t think the above thought-experiment is worth pursuing, because I don’t think we even know a formal (on the level of assembly-like languages) description of either CI or MWI.
Not Solomonoff, minimum description length, I’m coming from an information theory background, I don’t know very much about Solomonoff induction.
OP is talking about Solomonoff priors, no? Is there a way to infer on minimum description length?
What is OP?
EY
I meant Anti-reductionist, the person potato originally replied to… I suppose grandparent would have been more accurate.
He was talking about both.
So how do you predict with minimum description length?
With respect to the validity of reductionism, out of MML and SI, one theoretically predicts and the other does not. Obviously.
Aren’t you circularly basing your code on your probabilities but then taking your priors from the code?
Yep, but that’s all the proof shows: the more concise your code, the stronger the inverse correlation between the probability of a state and the code length of that state.