Philosophy and the practice of Bayesian statistics
Andrew Gelman, Cosma Rohilla Shalizi
(Submitted on 19 Jun 2010)
A substantial school in the philosophy of science identifies Bayesian inference with inductive inference and even rationality as such, and seems to be strengthened by the rise and practical success of Bayesian statistics. We argue that the most successful forms of Bayesian statistics do not actually support that particular philosophy but rather accord much better with sophisticated forms of hypothetico-deductivism. We examine the actual role played by prior distributions in Bayesian models, and the crucial aspects of model checking and model revision, which fall outside the scope of Bayesian confirmation theory. We draw on the literature on the consistency of Bayesian updating and also on our experience of applied work in social science.
Clarity about these matters should benefit not just philosophy of science, but also statistical practice. At best, the inductivist view has encouraged researchers to fit and compare models without checking them; at worst, theorists have actively discouraged practitioners from performing model checking because it does not fit into their framework.
I guess everyone here already understands this stuff, but I’ll still try to summarize why “model checking” is an argument against “naive Bayesians” like Eliezer’s OB persona. Shalizi has written about this at length on his blog and elsewhere, as has Gelman, but maybe I can make the argument a little clearer for novices.
Imagine you have a prior, then some data comes in, you update and obtain a posterior that overwhelmingly supports one hypothesis. The Bayesian is supposed to say “done” at this point. But we’re actually not done. We have only “used all the information available in the sample” in the Bayesian sense, but not in the colloquial sense!
See, after locating the hypothesis, we can run some simple statistical checks on the hypothesis and the data to see if our prior was wrong. For example, plot the data as a histogram, and plot the hypothesis as another histogram, and if there’s a lot of data and the two histograms are wildly different, we know almost for certain that the prior was wrong. As a responsible scientist, I’d do this kind of check. The catch is, a perfect Bayesian wouldn’t. The question is, why?
But my sense is that the “substantial school in the philosophy of science [that] identifies Bayesian inference with inductive inference and even rationality as such”, as well as Eliezer’s OB persona, is talking more about a prior implicit in informal human reasoning than about anything that’s written down on paper. You can then see model checking as roughly comparing the parts of your prior that you wrote down to all the parts that you didn’t write down. Is that wrong?
I don’t think informal human reasoning corresponds to Bayesian inference with any prior. Maybe you mean “what informal human reasoning should be”. In that case I’d like a formal description of what it should be (ahem).
Gelman/Shalizi don’t seem to be arguing from the possibility that physics is noncomputable; they seem to think their argument (against Bayes as induction) works even under ordinary circumstances.
It seems to me that Wei Dai’s argument is flawed (and I may be overly arrogant in saying this; I haven’t even had breakfast this morning.)
He says that the probability of knowing an uncomputable problem would be evaluated at 0 originally; I don’t fundamentally see why “measure zero hypothesis” is equivalent to “impossible;” for example the hypothesis of “they’re making it up as they go along” having probability 2^(-S) based on the size of the set shrinks at a certain rate as evidence arrives; that means that given any finite amount of inference the AI should be able to distinguish between two possibilities (they are very good at computing or guessing vs all humans have been wrong about mathematics forever) unless new evidence comes in to support one over the other “humans have been wrong forever” should have a consistent probability mass which will grow in comparison to the other hypothesis “they are making it up.”
Nobody seems to propose this (although I may have missed it skimming some of the replies) and it seems like a relatively simple thing (to me) to adjust the AI’s prior distribution to give “impossible” things low but nonzero probability.
Wei Dai’s argument was specifically against the Solomonoff prior, which assigns probability 0 to the existence of halting problem oracles. If you have an idea how to formulate another universal prior that would give such “impossible” things positive probability, but still sum to 1.0 over all hypotheses, then by all means let’s hear it.
Yeah well it is certainly a good argument against that. The title of the thread is “is induction unformalizable” which point I’m unconvinced of.
If I were to formalize some kind of prior, I would probably use a lot of epsilons (since zero is not a probability); including an epsilon for “things I haven’t thought up yet.” On the other hand I’m not really an expert on any of these things so I imagine Wei Dai would be able to poke holes in anything I came up with anyway.
There’s no general way to have a “none of the above” hypothesis as part of your prior, because it doesn’t make any specific prediction and thus you can’t update its likelihood as data comes in. See the discussion with Cyan and others about NOTA somewhere around here.
Well then I guess I would hypothesize that solving the problem of a universal prior is equivalent to solving the problem of NOTA. I don’t really know enough to get technical here. If your point is that it’s not a good idea to model humans as Bayesians, I agree. If your point is that it’s impossible, I’m unconvinced. Maybe after I finish reading Jaynes I’ll have a better idea of the formalisms involved.
See, after locating the hypothesis, we can run some simple statistical checks on the hypothesis and the data to see if our prior was wrong. For example, plot the data as a histogram, and plot the hypothesis as another histogram, and if there’s a lot of data and the two histograms are wildly different, we know almost for certain that the prior was wrong. As a responsible scientist, I’d do this kind of check. The catch is, a perfect Bayesian wouldn’t. The question is, why?
I thought that what I’m about to say is standard, but perhaps it isn’t.
Bayesian inference, depending on how detailed you do it, does include such a check. You construct a Bayes network (as a directed acyclic graph) that connects beliefs with anticipated observations (or intermediate other beliefs), establishing marginal and conditional probabilities for the nodes. As your expectations are jointly determined by the beliefs that lead up to them, then getting a wrong answer will knock down the probabilities you assign to the beliefs leading up to them.
Depending on the relative strengths of the connections, you know whether to reject your parameters, your model, or the validity of the observation. (Depending on how detailed the network is, one input belief might be “i’m hallucinating or insane”, which may survive with the highest probability.) This determination is based on which of them, after taking this hit, has the lowest probability.
Pearl also has written Bayesian algorithms for inferring conditional (in)dependencies from data, and therefore what kinds of models are capable of capturing a phenomenon. He furthermore has proposed causal networks, which have explicit causal and (oppositely) inferential directions. In that case, you don’t turn a prior into a posterior: rather, the odds you assign to an event at a node are determined by the “incoming” causal “message”, and, from the other direction, the incoming inferential message.
But neither “model checking” nor Bayesian methods will come up with hypotheses for you. Model checking can attenuate the odds you assign to wrong priors, but so can Bayesian updating. The catch is that, for reasons of computation, a Bayesian might not be able to list all the possible hypotheses and arbitrarily restrict the hypothesis space, and potentially be left with only bad ones. But Bayesians aren’t alone in that either.
(Please tell me if this sounds too True Believerish.)
I thought that what I’m about to say is standard, but perhaps it isn’t. [...] Pearl also has written Bayesian algorithms
I have been googling for references to “computational epistemology”, “algorithmic epistemology”, “bayesian algorithms” and “epistemic algorithm” on LessWrong, and (other than my article) this is the only reference I was able to find to things in the vague category of (i) proposing that the community work on writing real, practical epistemic algorithms (i.e. in software), (ii) announcing having written epistemic algorithms or (iii) explaining how precisely to perform any epistemic algorithm in particular. (A runner-up is this post which aspires to “focus on the ideal epistemic algorithm” but AFAICT doesn’t really describe an algorithm.)
Oh wow, thanks. I think at the time I was overconfident that some more educated Bayesian had worked through the details of what I was describing. But the causality-related stuff is definitely covered by Judea Pearl (the Pearl I was referring to) in his book *Causality* (2000).
This sounds like a confusion between a theoretical perfect Bayesian and practical approximations. The perfect Bayesian wouldn’t have any use for model checking because from the start it always considers every hypothesis it is capable of formulating, whereas the prior used by a human scientist won’t ever even come close to encoding all of their knowledge.
(A more “Bayesian” alternative to model checking is to have an explicit “none of the above” hypothesis as part of your prior.)
NOTA is not well-specified in the general case, but in at least one specific case it’s been done. Jaynes’s student Larry Bretthorst made a useable NOTA hypothesis in a simplified version of a radar target identification problem (link to a pdf of the doc).
(Somewhat bizarrely, the same sort of approach could probably be made to work in certain problems in proteomics in which the data-generating process shares the key features of the data-generating process in Bretthorst’s simplified problem.)
If I’m not mistaken, such problems would contain some enumerated hypotheses—point peaks in a well-defined parameter space—and the NOTA hypothesis would be a uniformly thin layer over the rest of that space. Can’t tell what key features the data-generating process must have, though. Or am I failing reading comprehension again?
If I’m not mistaken, such problems would contain some enumerated hypotheses—point peaks in a well-defined parameter space—and the NOTA hypothesis would be a uniformly thin layer over the rest of that space
Yep.
Can’t tell what key features the data-generating process must have, though.
I think the key features that make the NOTA hypothesis feasible are (i) all possible hypotheses generate signals of a known form (but with free parameters), and (ii) although the space of all possible hypotheses is too large to enumerate, we have a partial library of “interesting” hypotheses of particularly high prior probability for which the generated signals are known even more specifically than in the general case.
See, after locating the hypothesis, we can run some simple statistical checks on the hypothesis and the data to see if our prior was wrong. For example, plot the data as a histogram, and plot the hypothesis as another histogram, and if there’s a lot of data and the two histograms are wildly different, we know almost for certain that the prior was wrong. As a responsible scientist, I’d do this kind of check. The catch is, a perfect Bayesian wouldn’t. The question is, why?
Model checking is completely compatible with “perfect Bayesianism.” In the practice of Bayesian statistics, how often is the prior distribution you use exactly the same as your actual prior distribution? The answer is never. Really, do you think your actual prior follows a gamma distribution exactly? The prior distribution you use in the computation is a model of your actual prior distribution. It’s a map of your current map. With this in mind, model checking is an extremely handy way to make sure that your model of your prior is reasonable.
However, a difference in the data and a simulation from your model doesn’t necessarily mean that you have an unreasonable model of your prior. You could just have really wrong priors. So you have to think about what’s going on to be sure. This does somewhat limit the role of model checking relative to what Gelman is pushing.
With this in mind, model checking is an extremely handy way to make sure that your model of your prior is reasonable.
You shouldn’t need real-world data to determine if your model of your own prior was reasonable or not. Something else is going on here. Model checking uses the data to figure out if your prior was reasonable, which is a reasonable but non-Bayesian idea.
Well, if you’re just checking your prior, then I suppose you don’t need real data at all. Make up some numbers and see what happens. What you’re really checking (if you’re being a Bayesian about it, i.e. not like Gelman and company) is not whether your data could come from a model with that prior, but rather whether the properties of the prior you chose seems to match up with the prior you’re modeling. For example, maybe the prior you chose forces two parameters, a and b, to be independent no matter what the data say. In reality, though, you think it’s perfectly reasonable for there to be some association between those two parameters. If you don’t already know that your prior is deficient in this way, posterior predictive checking can pick it up.
In reality, you’re usually checking both your prior and the other parts of your model at the same time, so you might as well use your data, but I could see using different fake data sets in order to check your prior in different ways.
Apologies if this has already been covered elsewhere, but isn’t a prior just a belief? The prior is by definition whatever it was rational to believe before the acquisition of new evidence (assuming a perfect Bayesian, anyway). I’m not quite sure what you mean when you propose that a prior could be wrong; either all priors are statements of belief and therefore true, or all priors are statements of probability that must be less accurate than a posterior that incorporates more evidence.
I suspect that there are additional steps I’m not considering.
The prior is by definition whatever it was rational to believe before the acquisition of new evidence (assuming a perfect Bayesian, anyway).
Nope, this isn’t part of the definition of the prior, and I don’t see how it could be. The prior is whatever you actually believe before any evidence comes in.
If you have a procedure to determine which priors are “rational” before looking at the evidence, please share it with us. Some people here believe religiously in maxent, others swear by the universal prior, I personally rather like reference priors, but the Bayesian apparatus doesn’t really give us a means of determining the “best” among those. I wrote about these topics here before. If you want the one-word summary, the area is a mess.
I want to believe that there is some optimal general prior, but it seems much more likely that we do not live in so convenient a world.
But if you can evaluate how good a prior is, then there has to be an optimal one (or several). You have to have something as your prior, and so whichever one is the best out of those you can choose is the one you should have. As for how certain you are that it’s the best, it’s (to some extent) turtles all the way down.
Instead of using “optimal general prior”, I should have said that I was pessimistic about the existence of a standard for evaluating priors (or, more properly, prior probability distributions) that is optimal in all circumstances, if that’s any clearer.
Having thought about the problem some more, though, I think my pessimism may have been premature.
A prior probability distribution is nothing more than a weighted set of hypotheses. A perfect Bayesian would consider every possible hypothesis, which is impossible unless hypotheses are countable, and they aren’t; the ideal for Bayesian reasoning as I understand it is thus unattainable, but this doesn’t mean that there are benefits to be found in moving toward that ideal.
So, perfect Bayesian or not, we have some set of hypotheses which need to be located before we can consider them and assign them a probabilistic weight. Before we acquire any rational evidence at all, there is necessarily only one factor that we can use to distinguish between hypotheses: how hard they are to locate. If it is also true that hypotheses which are easier to locate make more predictions and that hypotheses which make more predictions are more useful (and while I have not seen proofs of these propositions I’m inclined to suspect that they exist), then we are perfectly justified in assigning a probability to a hypothesis based on it’s locate-ability.
This reduces the problem of prior probability evaluation to the problem of locate-ability evaluation, to which it seems maxent and its fellows are proposed answers. It’s again possible there is no objectively best way to evaluate locate-ability, but I don’t yet see a reason for this to be so.
Again, if I’ve mis-thought or failed to justify a step in my reasoning, please call me on it.
If it is also true that hypotheses which are easier to locate make more predictions
This doesn’t sound right to me. Imagine you’re tossing a coin repeatedly. Hypothesis 1 says the coin is fair. Hypothesis 2 says the coin repeats the sequence HTTTHHTHTHTTTT over and over in a loop. The second hypothesis is harder to locate, but makes a stronger prediction.
The proper formalization for your concept of locate-ability is the Solomonoff prior. Unfortunately we can’t do inference based on it because it’s uncomputable.
Maxent and friends aren’t motivated by a desire to formalize locate-ability. Maxent is the “most uniform” distribution on a space of hypotheses; the “Jeffreys rule” is a means of constructing priors that are invariant under reparameterizations of the space of hypotheses; “matching priors” give you frequentist coverage guarantees, and so on.
Please don’t take my words for gospel just because I sound knowledgeable! At this point I recommend you to actually study the math and come to your own conclusions. Maybe contact user Cyan, he’s a professional statistician who inspired me to learn this stuff. IMO, discussing Bayesianism as some kind of philosophical system without digging into the math is counterproductive, though people around here do that a lot.
I’m in the process of digging into the math, so hopefully some point soon I’ll be able to back up my suspicions in a more rigorous way.
This doesn’t sound right to me. Imagine you’re tossing a coin repeatedly. Hypothesis 1 says the coin is fair. Hypothesis 2 says the coin repeats the sequence HTTTHHTHTHTTTT over and over in a loop. The second hypothesis is harder to locate, but makes a stronger prediction.
I was talking about the number of predictions, not their strength. So Hypothesis 1 predicts any sequence of coin-flips that converges on 50%, and Hypothesis 2 predicts only sequences that repeat HTTTHHTHTHTTTT. Hypothesis 1 explains many more possible worlds than Hypothesis 2, and so without evidence as to which world we inhabit, Hypothesis 1 is much more likely.
Since I’ve already conceded that being a Perfect Bayesian is impossible, I’m not surprised to hear that measuring locate-ability is likewise impossible (especially because the one reduces to the other). It just means that we should determine prior probabilities by approximating Solomonoff complexity as best we can.
Thanks for taking the time to comment, by the way.
Then let’s try this. Hypothesis 1 says the sequence will consist of only H repeated forever. Hypothesis 2 says the sequence will be either HTTTHHTHTHTTTT repeated forever, or TTHTHTTTHTHHHHH repeated forever. The second one is harder to locate, but describes two possible worlds rather than one.
Maybe your idea can be fixed somehow, but I see no way yet. Keep digging.
I’ve just reread Eliezer’s post on Occam’s Razor and it seems to have clarified my thinking a little.
I originally said:
If it is also true that hypotheses which are easier to locate make more predictions… then we are perfectly justified in assigning a probability to a hypothesis based on it’s locate-ability.
But I would now say:
If it is also true that hypotheses with a shorter minimum message length make more predictions relative to that minimum message length than do hypotheses with longer MMLs… then we are perfectly justified in assigning a probability to a hypothesis based on MML.
This solves the problem your counterexample presents: Hypothesis 1 describes only one possible world, but Hypothesis 2 requires say, ~30 more bits of information (for those particular strings of results, plus a disjunction) to describe only two possible worlds, making it 2^30 / 2 times less likely.
Then let’s try this. Hypothesis 1 says the sequence will consist of only H repeated forever. Hypothesis 2 says the sequence will be HTTTHHTHTHTTTT repeated forever, where the can take different values on each repetition. The second hypothesis is harder to locate but describes an infinite number of possible worlds :-)
The problem with this counterexample is that you can’t actually repeat something forever.
Even taking the case where we repeat each sequence 1000 times, which seems like it should be similar, you’ll end up with 1000 coin flips and 15000 coin flips for Hypothesis 1 and Hypothesis 2, respectively. So the odds of being in a world where Hypothesis 1 is true are 1 in 2^1000, but the odds of being in a world where Hypothesis 2 is true are 1 in 2^15000.
It’s an apples to balloons comparison, basically.
(I spent about twenty minutes staring at an empty comment box and sweating blood before I figured this out, for the record.)
I think this is still wrong. Take the finite case where both hypotheses are used to explain sequences of a billion throws. Then the first hypothesis describes one world, and the second one describes an exponentially huge number of worlds. You seem to think that the length of the sequence should depend on the length of the hypothesis, and I don’t understand why.
It’s again possible there is no objectively best way
I’m not sure I’m willing to grant that’s impossible in principle. Presumably, you need to find some way of choosing your priors, and some time later you can check your calibration, and you can then evaluate the effectiveness of one method versus another.
If there’s any way to determine whether you’ve won bets in a series, then it’s possible to rank methods for choosing the correct bet. And that general principle can continue all the way down. And if there isn’t any way of determining whether you’ve won, then I’d wonder if you’re talking about anything at all (weird thought experiments aside).
we can run some simple statistical checks on the hypothesis and the data to see if our prior was wrong. For example, plot the data as a histogram, and plot the hypothesis as another histogram, and if there’s a lot of data and the two histograms are wildly different, we know almost for certain that the prior was wrong.
That check should be part of updating your prior. If you updated and got a hypothesis that didn’t fit the data, you didn’t update very well. You need to take this into account when you’re updating (and you also need to take into account the possibility of experimental error: there’s a small chance the data are wrong).
Hopefully the Book Club will get around to covering that as part of Chapter 4.
I can’t recall that it has anything to do with “updating your prior”; Jaynes just says that if you get nonsense posterior probabilities, you need to go back and include additional hypotheses in the set you’re considering, and this changes the analysis.
See also the quote (I can’t be bothered to find it now but I posted it a while ago to a quotes thread) where Jaynes says probability theory doesn’t do the job of thinking up hypotheses for you.
http://arxiv.org/abs/1006.3868
I guess everyone here already understands this stuff, but I’ll still try to summarize why “model checking” is an argument against “naive Bayesians” like Eliezer’s OB persona. Shalizi has written about this at length on his blog and elsewhere, as has Gelman, but maybe I can make the argument a little clearer for novices.
Imagine you have a prior, then some data comes in, you update and obtain a posterior that overwhelmingly supports one hypothesis. The Bayesian is supposed to say “done” at this point. But we’re actually not done. We have only “used all the information available in the sample” in the Bayesian sense, but not in the colloquial sense!
See, after locating the hypothesis, we can run some simple statistical checks on the hypothesis and the data to see if our prior was wrong. For example, plot the data as a histogram, and plot the hypothesis as another histogram, and if there’s a lot of data and the two histograms are wildly different, we know almost for certain that the prior was wrong. As a responsible scientist, I’d do this kind of check. The catch is, a perfect Bayesian wouldn’t. The question is, why?
But my sense is that the “substantial school in the philosophy of science [that] identifies Bayesian inference with inductive inference and even rationality as such”, as well as Eliezer’s OB persona, is talking more about a prior implicit in informal human reasoning than about anything that’s written down on paper. You can then see model checking as roughly comparing the parts of your prior that you wrote down to all the parts that you didn’t write down. Is that wrong?
I don’t think informal human reasoning corresponds to Bayesian inference with any prior. Maybe you mean “what informal human reasoning should be”. In that case I’d like a formal description of what it should be (ahem).
Solomonoff induction, mebbe?
Wei Dai thought up a counterexample to that :-)
Gelman/Shalizi don’t seem to be arguing from the possibility that physics is noncomputable; they seem to think their argument (against Bayes as induction) works even under ordinary circumstances.
It seems to me that Wei Dai’s argument is flawed (and I may be overly arrogant in saying this; I haven’t even had breakfast this morning.)
He says that the probability of knowing an uncomputable problem would be evaluated at 0 originally; I don’t fundamentally see why “measure zero hypothesis” is equivalent to “impossible;” for example the hypothesis of “they’re making it up as they go along” having probability 2^(-S) based on the size of the set shrinks at a certain rate as evidence arrives; that means that given any finite amount of inference the AI should be able to distinguish between two possibilities (they are very good at computing or guessing vs all humans have been wrong about mathematics forever) unless new evidence comes in to support one over the other “humans have been wrong forever” should have a consistent probability mass which will grow in comparison to the other hypothesis “they are making it up.”
Nobody seems to propose this (although I may have missed it skimming some of the replies) and it seems like a relatively simple thing (to me) to adjust the AI’s prior distribution to give “impossible” things low but nonzero probability.
Wei Dai’s argument was specifically against the Solomonoff prior, which assigns probability 0 to the existence of halting problem oracles. If you have an idea how to formulate another universal prior that would give such “impossible” things positive probability, but still sum to 1.0 over all hypotheses, then by all means let’s hear it.
Yeah well it is certainly a good argument against that. The title of the thread is “is induction unformalizable” which point I’m unconvinced of.
If I were to formalize some kind of prior, I would probably use a lot of epsilons (since zero is not a probability); including an epsilon for “things I haven’t thought up yet.” On the other hand I’m not really an expert on any of these things so I imagine Wei Dai would be able to poke holes in anything I came up with anyway.
There’s no general way to have a “none of the above” hypothesis as part of your prior, because it doesn’t make any specific prediction and thus you can’t update its likelihood as data comes in. See the discussion with Cyan and others about NOTA somewhere around here.
Well then I guess I would hypothesize that solving the problem of a universal prior is equivalent to solving the problem of NOTA. I don’t really know enough to get technical here. If your point is that it’s not a good idea to model humans as Bayesians, I agree. If your point is that it’s impossible, I’m unconvinced. Maybe after I finish reading Jaynes I’ll have a better idea of the formalisms involved.
I thought that what I’m about to say is standard, but perhaps it isn’t.
Bayesian inference, depending on how detailed you do it, does include such a check. You construct a Bayes network (as a directed acyclic graph) that connects beliefs with anticipated observations (or intermediate other beliefs), establishing marginal and conditional probabilities for the nodes. As your expectations are jointly determined by the beliefs that lead up to them, then getting a wrong answer will knock down the probabilities you assign to the beliefs leading up to them.
Depending on the relative strengths of the connections, you know whether to reject your parameters, your model, or the validity of the observation. (Depending on how detailed the network is, one input belief might be “i’m hallucinating or insane”, which may survive with the highest probability.) This determination is based on which of them, after taking this hit, has the lowest probability.
Pearl also has written Bayesian algorithms for inferring conditional (in)dependencies from data, and therefore what kinds of models are capable of capturing a phenomenon. He furthermore has proposed causal networks, which have explicit causal and (oppositely) inferential directions. In that case, you don’t turn a prior into a posterior: rather, the odds you assign to an event at a node are determined by the “incoming” causal “message”, and, from the other direction, the incoming inferential message.
But neither “model checking” nor Bayesian methods will come up with hypotheses for you. Model checking can attenuate the odds you assign to wrong priors, but so can Bayesian updating. The catch is that, for reasons of computation, a Bayesian might not be able to list all the possible hypotheses and arbitrarily restrict the hypothesis space, and potentially be left with only bad ones. But Bayesians aren’t alone in that either.
(Please tell me if this sounds too True Believerish.)
I have been googling for references to “computational epistemology”, “algorithmic epistemology”, “bayesian algorithms” and “epistemic algorithm” on LessWrong, and (other than my article) this is the only reference I was able to find to things in the vague category of (i) proposing that the community work on writing real, practical epistemic algorithms (i.e. in software), (ii) announcing having written epistemic algorithms or (iii) explaining how precisely to perform any epistemic algorithm in particular. (A runner-up is this post which aspires to “focus on the ideal epistemic algorithm” but AFAICT doesn’t really describe an algorithm.)
Who is “Pearl”?
Oh wow, thanks. I think at the time I was overconfident that some more educated Bayesian had worked through the details of what I was describing. But the causality-related stuff is definitely covered by Judea Pearl (the Pearl I was referring to) in his book *Causality* (2000).
This sounds like a confusion between a theoretical perfect Bayesian and practical approximations. The perfect Bayesian wouldn’t have any use for model checking because from the start it always considers every hypothesis it is capable of formulating, whereas the prior used by a human scientist won’t ever even come close to encoding all of their knowledge.
(A more “Bayesian” alternative to model checking is to have an explicit “none of the above” hypothesis as part of your prior.)
NOTA is addressed in the paper as inadequate. What does it predict?
See here.
I don’t see how that’s possible. How do you compute the likelihood of the NOTA hypothesis given the data?
NOTA is not well-specified in the general case, but in at least one specific case it’s been done. Jaynes’s student Larry Bretthorst made a useable NOTA hypothesis in a simplified version of a radar target identification problem (link to a pdf of the doc).
(Somewhat bizarrely, the same sort of approach could probably be made to work in certain problems in proteomics in which the data-generating process shares the key features of the data-generating process in Bretthorst’s simplified problem.)
If I’m not mistaken, such problems would contain some enumerated hypotheses—point peaks in a well-defined parameter space—and the NOTA hypothesis would be a uniformly thin layer over the rest of that space. Can’t tell what key features the data-generating process must have, though. Or am I failing reading comprehension again?
Yep.
I think the key features that make the NOTA hypothesis feasible are (i) all possible hypotheses generate signals of a known form (but with free parameters), and (ii) although the space of all possible hypotheses is too large to enumerate, we have a partial library of “interesting” hypotheses of particularly high prior probability for which the generated signals are known even more specifically than in the general case.
Model checking is completely compatible with “perfect Bayesianism.” In the practice of Bayesian statistics, how often is the prior distribution you use exactly the same as your actual prior distribution? The answer is never. Really, do you think your actual prior follows a gamma distribution exactly? The prior distribution you use in the computation is a model of your actual prior distribution. It’s a map of your current map. With this in mind, model checking is an extremely handy way to make sure that your model of your prior is reasonable.
However, a difference in the data and a simulation from your model doesn’t necessarily mean that you have an unreasonable model of your prior. You could just have really wrong priors. So you have to think about what’s going on to be sure. This does somewhat limit the role of model checking relative to what Gelman is pushing.
You shouldn’t need real-world data to determine if your model of your own prior was reasonable or not. Something else is going on here. Model checking uses the data to figure out if your prior was reasonable, which is a reasonable but non-Bayesian idea.
Well, if you’re just checking your prior, then I suppose you don’t need real data at all. Make up some numbers and see what happens. What you’re really checking (if you’re being a Bayesian about it, i.e. not like Gelman and company) is not whether your data could come from a model with that prior, but rather whether the properties of the prior you chose seems to match up with the prior you’re modeling. For example, maybe the prior you chose forces two parameters, a and b, to be independent no matter what the data say. In reality, though, you think it’s perfectly reasonable for there to be some association between those two parameters. If you don’t already know that your prior is deficient in this way, posterior predictive checking can pick it up.
In reality, you’re usually checking both your prior and the other parts of your model at the same time, so you might as well use your data, but I could see using different fake data sets in order to check your prior in different ways.
Apologies if this has already been covered elsewhere, but isn’t a prior just a belief? The prior is by definition whatever it was rational to believe before the acquisition of new evidence (assuming a perfect Bayesian, anyway). I’m not quite sure what you mean when you propose that a prior could be wrong; either all priors are statements of belief and therefore true, or all priors are statements of probability that must be less accurate than a posterior that incorporates more evidence.
I suspect that there are additional steps I’m not considering.
Nope, this isn’t part of the definition of the prior, and I don’t see how it could be. The prior is whatever you actually believe before any evidence comes in.
If you have a procedure to determine which priors are “rational” before looking at the evidence, please share it with us. Some people here believe religiously in maxent, others swear by the universal prior, I personally rather like reference priors, but the Bayesian apparatus doesn’t really give us a means of determining the “best” among those. I wrote about these topics here before. If you want the one-word summary, the area is a mess.
Thanks for the links (and your post!), I now have a much clearer idea of the depths of my ignorance on this topic.
I want to believe that there is some optimal general prior, but it seems much more likely that we do not live in so convenient a world.
But if you can evaluate how good a prior is, then there has to be an optimal one (or several). You have to have something as your prior, and so whichever one is the best out of those you can choose is the one you should have. As for how certain you are that it’s the best, it’s (to some extent) turtles all the way down.
Instead of using “optimal general prior”, I should have said that I was pessimistic about the existence of a standard for evaluating priors (or, more properly, prior probability distributions) that is optimal in all circumstances, if that’s any clearer.
Having thought about the problem some more, though, I think my pessimism may have been premature.
A prior probability distribution is nothing more than a weighted set of hypotheses. A perfect Bayesian would consider every possible hypothesis, which is impossible unless hypotheses are countable, and they aren’t; the ideal for Bayesian reasoning as I understand it is thus unattainable, but this doesn’t mean that there are benefits to be found in moving toward that ideal.
So, perfect Bayesian or not, we have some set of hypotheses which need to be located before we can consider them and assign them a probabilistic weight. Before we acquire any rational evidence at all, there is necessarily only one factor that we can use to distinguish between hypotheses: how hard they are to locate. If it is also true that hypotheses which are easier to locate make more predictions and that hypotheses which make more predictions are more useful (and while I have not seen proofs of these propositions I’m inclined to suspect that they exist), then we are perfectly justified in assigning a probability to a hypothesis based on it’s locate-ability.
This reduces the problem of prior probability evaluation to the problem of locate-ability evaluation, to which it seems maxent and its fellows are proposed answers. It’s again possible there is no objectively best way to evaluate locate-ability, but I don’t yet see a reason for this to be so.
Again, if I’ve mis-thought or failed to justify a step in my reasoning, please call me on it.
This doesn’t sound right to me. Imagine you’re tossing a coin repeatedly. Hypothesis 1 says the coin is fair. Hypothesis 2 says the coin repeats the sequence HTTTHHTHTHTTTT over and over in a loop. The second hypothesis is harder to locate, but makes a stronger prediction.
The proper formalization for your concept of locate-ability is the Solomonoff prior. Unfortunately we can’t do inference based on it because it’s uncomputable.
Maxent and friends aren’t motivated by a desire to formalize locate-ability. Maxent is the “most uniform” distribution on a space of hypotheses; the “Jeffreys rule” is a means of constructing priors that are invariant under reparameterizations of the space of hypotheses; “matching priors” give you frequentist coverage guarantees, and so on.
Please don’t take my words for gospel just because I sound knowledgeable! At this point I recommend you to actually study the math and come to your own conclusions. Maybe contact user Cyan, he’s a professional statistician who inspired me to learn this stuff. IMO, discussing Bayesianism as some kind of philosophical system without digging into the math is counterproductive, though people around here do that a lot.
I’m in the process of digging into the math, so hopefully some point soon I’ll be able to back up my suspicions in a more rigorous way.
I was talking about the number of predictions, not their strength. So Hypothesis 1 predicts any sequence of coin-flips that converges on 50%, and Hypothesis 2 predicts only sequences that repeat HTTTHHTHTHTTTT. Hypothesis 1 explains many more possible worlds than Hypothesis 2, and so without evidence as to which world we inhabit, Hypothesis 1 is much more likely.
Since I’ve already conceded that being a Perfect Bayesian is impossible, I’m not surprised to hear that measuring locate-ability is likewise impossible (especially because the one reduces to the other). It just means that we should determine prior probabilities by approximating Solomonoff complexity as best we can.
Thanks for taking the time to comment, by the way.
Then let’s try this. Hypothesis 1 says the sequence will consist of only H repeated forever. Hypothesis 2 says the sequence will be either HTTTHHTHTHTTTT repeated forever, or TTHTHTTTHTHHHHH repeated forever. The second one is harder to locate, but describes two possible worlds rather than one.
Maybe your idea can be fixed somehow, but I see no way yet. Keep digging.
I’ve just reread Eliezer’s post on Occam’s Razor and it seems to have clarified my thinking a little.
I originally said:
But I would now say:
This solves the problem your counterexample presents: Hypothesis 1 describes only one possible world, but Hypothesis 2 requires say, ~30 more bits of information (for those particular strings of results, plus a disjunction) to describe only two possible worlds, making it 2^30 / 2 times less likely.
Then let’s try this. Hypothesis 1 says the sequence will consist of only H repeated forever. Hypothesis 2 says the sequence will be HTTTHHTHTHTTTT repeated forever, where the can take different values on each repetition. The second hypothesis is harder to locate but describes an infinite number of possible worlds :-)
If at first you don’t succeed, try, try again!
The problem with this counterexample is that you can’t actually repeat something forever.
Even taking the case where we repeat each sequence 1000 times, which seems like it should be similar, you’ll end up with 1000 coin flips and 15000 coin flips for Hypothesis 1 and Hypothesis 2, respectively. So the odds of being in a world where Hypothesis 1 is true are 1 in 2^1000, but the odds of being in a world where Hypothesis 2 is true are 1 in 2^15000.
It’s an apples to balloons comparison, basically.
(I spent about twenty minutes staring at an empty comment box and sweating blood before I figured this out, for the record.)
I think this is still wrong. Take the finite case where both hypotheses are used to explain sequences of a billion throws. Then the first hypothesis describes one world, and the second one describes an exponentially huge number of worlds. You seem to think that the length of the sequence should depend on the length of the hypothesis, and I don’t understand why.
That is an awesome counter-example, thank you. I think I may wait to ponder this further until I have a better grasp of the math involved.
I’m not sure I’m willing to grant that’s impossible in principle. Presumably, you need to find some way of choosing your priors, and some time later you can check your calibration, and you can then evaluate the effectiveness of one method versus another.
If there’s any way to determine whether you’ve won bets in a series, then it’s possible to rank methods for choosing the correct bet. And that general principle can continue all the way down. And if there isn’t any way of determining whether you’ve won, then I’d wonder if you’re talking about anything at all (weird thought experiments aside).
That check should be part of updating your prior. If you updated and got a hypothesis that didn’t fit the data, you didn’t update very well. You need to take this into account when you’re updating (and you also need to take into account the possibility of experimental error: there’s a small chance the data are wrong).
Hopefully the Book Club will get around to covering that as part of Chapter 4.
I can’t recall that it has anything to do with “updating your prior”; Jaynes just says that if you get nonsense posterior probabilities, you need to go back and include additional hypotheses in the set you’re considering, and this changes the analysis.
See also the quote (I can’t be bothered to find it now but I posted it a while ago to a quotes thread) where Jaynes says probability theory doesn’t do the job of thinking up hypotheses for you.