“Of course you are assuming a strong form of Bayesianism here. Why do we have to accept that strong form?”
Because it’s mathematically proven. You might as well ask “Why do we have to accept the strong form of arithmetic?”
“So, if some evidence slightly moves the expectation in a particular direction, but does not push it across the 50% line from wherever it started, what is the big whoop?”
Because (in this case especially!) small probabilities can have large consequences. If we invent a marvelous new cure for acne, with a 1% chance of death to the patient, it’s well below 50% and no specific person using the “medication” would expect to die, but no sane doctor would ever sanction such a “medication”.
“Why is 50% special here?”
People seem to have a little arrow in their heads saying whether they “believe in” or “don’t believe in” a proposition. If there are two possibilities, 50% is the point at which the little arrow goes from “not believe” to “believe”.
People seem to have a little arrow in their heads saying whether they “believe in” or “don’t believe in” a proposition. If there are two possibilities, 50% is the point at which the little arrow goes from “not believe” to “believe”.
And if I am following you, this is irrational. Correct?
More importantly, it’s physically proven. The fact that the math is consistent (and elegant!) would not have been so powerful if it wasn’t also true, particularly since Bayesianism implies some very surprising predictions.
Fortunately, it is the happy case that, to the best of my knowledge, no experiments thus far contradict Bayesianism, and not for the lack of trying, which is as much proof as physically possible.
Fortunately, it is the happy case that, to the best of my knowledge, no experiments thus far contradict Bayesianism, and not for the lack of trying, which is as much proof as physically possible.
Foundational issues like Bayesianism run into the old philosophy of science problems with a vengeance: which part of the total assortment of theory and observation do you choose to throw out? If someone proves a paradox in Bayesianism, do you shrug and start looking at alternatives—or do you ‘defy the evidence’ and patiently wait for an E.T. Jaynes to come along and explain how the paradox stems from taking an imprior limit or failing to take into account prior information etc.?
(I’ll adopt the seemingly rationalist trait of never taking questions as rhetorical, though both your questions strongly have that flavor).
A central part of the modern scientific method is due to Popper, who gave an essentially Bayesian answer to your first question. However, Science wouldn’t fall apart if it turned out that priors aren’t a physical reality. Occam’s razor is non-Bayesian, and it alone accounts for a large portion of our scientific intuitions. At the bottom line, the scientific method doesn’t have to be itself true in order to be effective in discovering truths and discarding falsehoods.
The concept of “proving a paradox” is unclear to me (almost a paradox in itself...). Paradoxes are mirages. Also, it seems that you have some specific piece of scientific history in mind, but I’m uncertain which.
Luckily, we did have Jaynes and others to promote what I believe to be both a compelling mathematical framework and a physical reality. Before them, well, it would be wishful to think I could hold on to Bayesian ideas in the face of apparent paradoxes. The shoulders of giants etc.
Occam’s Razor is non-Bayesian? Correct me if I’m wrong, but I thought it falls naturally out of Bayesian model comparison, from the normalization factors, or “Occam factors.” As I remember, the argument is something like: given two models with independent parameters {A} and {A,B}, the P(AB model) \propto P(AB are correct) and P(A model) \propto P(A is correct). Then P(AB model) ⇐ P(A model).
Even if the argument is wrong, I think the result ends up being that more plausible models tend to have fewer independent parameters.
You’re not really wrong. The thing is that “Occam’s razor” is a conceptual principle, not one mathematically defined law. A certain (subjectively very appealing) formulation of it does follow from Bayesianism.
P(AB model) \propto P(AB are correct) and P(A model) \propto P(A is correct). Then P(AB model) ⇐ P(A model).
Your math is a bit off, but I understand what you mean. If we have two sets of models, with no prior information to discriminate between their members, then the prior gives less probability to each model in the larger set than in the smaller one.
More generally, if deciding that model 1 is true gives you more information than deciding that model 2 is true, that means that the maximum entropy given model 1 is lower than that given model 2, which in turn means (under the maximum entropy principle) that model 1 was a-priori less likely.
Anyway, this is all besides the discussion that inspired my previous comment. My point was that even without Popper and Jaynes to enlighten us, science was making progress using other methods of rationality, among which is a myriad of non-Bayesian interpretations of Occam’s razor.
How does deciding one model is true give you more information? Did you mean “If a model allows you to make more predictions about future observations, then it is a priori less likely?”
How does deciding one model is true give you more information?
Let’s assume a strong version of Bayesianism, which entails the maximum entropy principle. So our belief is the one that has the maximum entropy, among those consistent with our prior information. If we now add the information that some model is true, this generally invalidate our previous belief, making the new maximum-entropy belief one of lower entropy. The reduction in entropy is the amount of information you gain by learning the model. In a way, this is a cost we pay for “narrowing” our belief.
The upside of it is that it tells us something useful about the future. Of course, not all information regarding the world is relevant for future observations. The part that doesn’t help control our anticipation is failing to pay rent, and should be evacuated. The part that does inform us about the future may be useful enough to be worth the cost we pay in taking in new information.
At what point does the decision “This is true” diverge from the observation “There is very strong evidence for this”, other than in cases where the model is accepted as true despite a lack of strong evidence?
I’m not discussing the case where a model goes from unknown to known- how does deciding to believe a model give you more information than knowing what the model is and the reason for the model. To better model an actual agent, one could replace all of the knowledge about why the model is true with the value of the strength of the supporting knowledge.
How does deciding that things always fall down give you more information than observing things fall down?
I believe the idea was to ask “hypothetically, if I found out that this hypothesis was true, how much new information would that give me?”
You’ll have two or more hypotheses, and one of them is the one that would (hypothetically) give you the least amount of new information. The one that would give you the least amount of new information should be considered the “simplest” hypothesis. (assuming a certain definition of “simplest”, and a certain definition of “information”)
“Of course you are assuming a strong form of Bayesianism here. Why do we have to accept that strong form?”
Because it’s mathematically proven. You might as well ask “Why do we have to accept the strong form of arithmetic?”
“So, if some evidence slightly moves the expectation in a particular direction, but does not push it across the 50% line from wherever it started, what is the big whoop?”
Because (in this case especially!) small probabilities can have large consequences. If we invent a marvelous new cure for acne, with a 1% chance of death to the patient, it’s well below 50% and no specific person using the “medication” would expect to die, but no sane doctor would ever sanction such a “medication”.
“Why is 50% special here?”
People seem to have a little arrow in their heads saying whether they “believe in” or “don’t believe in” a proposition. If there are two possibilities, 50% is the point at which the little arrow goes from “not believe” to “believe”.
And if I am following you, this is irrational. Correct?
It’s based on premises that may or may not be accurate. Just because it’s mathematically proven, doesn’t mean it’s true.
More importantly, it’s physically proven. The fact that the math is consistent (and elegant!) would not have been so powerful if it wasn’t also true, particularly since Bayesianism implies some very surprising predictions.
Fortunately, it is the happy case that, to the best of my knowledge, no experiments thus far contradict Bayesianism, and not for the lack of trying, which is as much proof as physically possible.
Foundational issues like Bayesianism run into the old philosophy of science problems with a vengeance: which part of the total assortment of theory and observation do you choose to throw out? If someone proves a paradox in Bayesianism, do you shrug and start looking at alternatives—or do you ‘defy the evidence’ and patiently wait for an E.T. Jaynes to come along and explain how the paradox stems from taking an imprior limit or failing to take into account prior information etc.?
(I’ll adopt the seemingly rationalist trait of never taking questions as rhetorical, though both your questions strongly have that flavor).
A central part of the modern scientific method is due to Popper, who gave an essentially Bayesian answer to your first question. However, Science wouldn’t fall apart if it turned out that priors aren’t a physical reality. Occam’s razor is non-Bayesian, and it alone accounts for a large portion of our scientific intuitions. At the bottom line, the scientific method doesn’t have to be itself true in order to be effective in discovering truths and discarding falsehoods.
The concept of “proving a paradox” is unclear to me (almost a paradox in itself...). Paradoxes are mirages. Also, it seems that you have some specific piece of scientific history in mind, but I’m uncertain which.
Luckily, we did have Jaynes and others to promote what I believe to be both a compelling mathematical framework and a physical reality. Before them, well, it would be wishful to think I could hold on to Bayesian ideas in the face of apparent paradoxes. The shoulders of giants etc.
Occam’s Razor is non-Bayesian? Correct me if I’m wrong, but I thought it falls naturally out of Bayesian model comparison, from the normalization factors, or “Occam factors.” As I remember, the argument is something like: given two models with independent parameters {A} and {A,B}, the P(AB model) \propto P(AB are correct) and P(A model) \propto P(A is correct). Then P(AB model) ⇐ P(A model).
Even if the argument is wrong, I think the result ends up being that more plausible models tend to have fewer independent parameters.
You’re not really wrong. The thing is that “Occam’s razor” is a conceptual principle, not one mathematically defined law. A certain (subjectively very appealing) formulation of it does follow from Bayesianism.
Your math is a bit off, but I understand what you mean. If we have two sets of models, with no prior information to discriminate between their members, then the prior gives less probability to each model in the larger set than in the smaller one.
More generally, if deciding that model 1 is true gives you more information than deciding that model 2 is true, that means that the maximum entropy given model 1 is lower than that given model 2, which in turn means (under the maximum entropy principle) that model 1 was a-priori less likely.
Anyway, this is all besides the discussion that inspired my previous comment. My point was that even without Popper and Jaynes to enlighten us, science was making progress using other methods of rationality, among which is a myriad of non-Bayesian interpretations of Occam’s razor.
Crystal clear. Sorry to distract from the point.
How does deciding one model is true give you more information? Did you mean “If a model allows you to make more predictions about future observations, then it is a priori less likely?”
Let’s assume a strong version of Bayesianism, which entails the maximum entropy principle. So our belief is the one that has the maximum entropy, among those consistent with our prior information. If we now add the information that some model is true, this generally invalidate our previous belief, making the new maximum-entropy belief one of lower entropy. The reduction in entropy is the amount of information you gain by learning the model. In a way, this is a cost we pay for “narrowing” our belief.
The upside of it is that it tells us something useful about the future. Of course, not all information regarding the world is relevant for future observations. The part that doesn’t help control our anticipation is failing to pay rent, and should be evacuated. The part that does inform us about the future may be useful enough to be worth the cost we pay in taking in new information.
I’ll expand on all of this in my sequence on reinforcement learning.
At what point does the decision “This is true” diverge from the observation “There is very strong evidence for this”, other than in cases where the model is accepted as true despite a lack of strong evidence?
I’m not discussing the case where a model goes from unknown to known- how does deciding to believe a model give you more information than knowing what the model is and the reason for the model. To better model an actual agent, one could replace all of the knowledge about why the model is true with the value of the strength of the supporting knowledge.
How does deciding that things always fall down give you more information than observing things fall down?
I believe the idea was to ask “hypothetically, if I found out that this hypothesis was true, how much new information would that give me?”
You’ll have two or more hypotheses, and one of them is the one that would (hypothetically) give you the least amount of new information. The one that would give you the least amount of new information should be considered the “simplest” hypothesis. (assuming a certain definition of “simplest”, and a certain definition of “information”)