p-values are good actually
It is fashionable, on LessWrong and also everywhere else, to advocate for a transition away from p-values. p-values have many known issues. p-hacking is possible and difficult to prevent, testing one hypothesis at a time cannot even in principle be correct, et cetera. I should mention here, because I will not mention it again, that these critiques are correct and very important—people are not wrong to notice these problems and I don’t intend to dismiss them. Furthermore, it’s true that a perfect reasoner is a Bayesian reasoner, so why would we ever use an evaluative approach in science that can’t be extended into an ideal reasoning pattern?
Consider the following scenario: the Bad Chemicals Company sells a product called Dangerous Pesticide, which contains compounds which have recently been discovered to cause chronic halitosis. Alice and Bob want to know whether BCC knew about the dangers their product poses in advance of this public revelation. As a result of a lawsuit, several internal documents from BCC have been made public.
Alice thinks there’s a 30% chance that BCC knew about the halitosis problem in advance, whereas Bob thinks there’s a 90% chance. Both Alice and Bob agree that, if BCC didn’t know, there’s only a 5% chance that they would have produced internal research documents looking into potential causes of chronic halitosis in conjunction with Dangerous Pesticide. Now all Alice and Bob have to do is agree on the probability of such documents existing if BCC did know in advance, and they can do a Bayesian update! They won’t end up with identical posteriors, but if they agree about all of the relevant probabilities, they will necessarily agree more after collecting evidence than they did before.
But they can’t agree on how to update. Alice thinks that, if BCC knew, there’s a 95% chance that they’ll discover related internal documents. Bob, being a devout conspiracy theorist, thinks the chance is only 2% - if they knew about the problem in advance, then of course they would have been tipped off about the investigation in advance, they have spies everywhere and they’re not that sloppy, and why wouldn’t the government just classify the smoking gun documents to keep the public in the dark anyway? They’re already doing that about aliens, after all!
Alice thinks this is a bit ridiculous, but she knows the relevant agreement theorems, and Bob is at least giving probabilities and sticking to them, so she persists and subdivides the hypothesis space. She thinks there’s a 30% chance that BCC knew in advance, but only a 10% chance that they were tipped off. Bob thinks there’s a 90% chance they knew in advance, and an 85% chance they were tipped off. If they knew but were not tipped off, Alice and Bob manage to agree that there’s a 96% chance of discovering the relevant internal documents.
Now they just have to agree on the probability of discovering the related internal documents if there’s a conspiracy. But again, they fail to agree. You see, Bob explains, it all depends on whether the Rothschilds are involved—the Rothschilds are of course themselves vexed with chronic halitosis, which explains why they were so involved in the invention of the breathmint, and so if there were a secret coverup about the causes of halitosis, then of course the Rothschilds would have caught wind of this through their own secret information networks and intervened, and that’s not even getting into the relevant multi-faction dynamics! At this point Alice leaves and conducts her investigation privately, deciding that reaching agreement with Bob is more trouble than it’s worth.
My point is: we can guarantee reasonable updates when Bayesian reasoners agree on how to update on every hypothesis, but it’s extremely hard to come to such an agreement, and even reasoners who agree about the probability of some hypothesis X can disagree about the probability distribution “underneath” X, such that they disagree wildly about P(E|X). In practice we don’t exhaustively enumerate every sub-hypothesis, instead we make assumptions about causal mechanisms and so feel justified in saying that this sort of enumeration is not necessary. If we want to determine the gravitational constant, for example, it’s helpful to assume that the speed at which a marble falls does not meaningfully depend on its color.
And yet how can we do this? In the real world we rarely care about reaching rational agreement with Bob, and indeed we often have good reasons to suspect that this is impossible. But we do care about, for example, reaching rational agreement with those who believe that dark matter is merely a measurement gap, or with those who believe that AI cannot meaningfully progress beyond human intelligence with current paradigms. Disagreement about how to assign the probability mass underneath a hypothesis is the typical case. How could we reasonably come to agreement in a Bayesian framework when we cannot even in principle enumerate the relevant hypotheses, when we suspect that the correct explanation is not known to anybody at all?
Here’s one idea: enumerate, in exhaustive detail, just one hypothesis. Agree about one way the world could be—we don’t need to decide whether the Rothschilds have bad breath, let’s just live for a moment in the simple world where they aren’t involved. Agree on the probability of seeing certain types of evidence if the world is exactly that way. If we cannot agree, identify the source of the disagreement and introduce more specificity. Design a repeatable experiment which, if our single hypothesis is wrong, might give different-from-expected results, and repeat that experiment until we get results that could not plausibly be explained by our preferred hypothesis. With enough repetition, even agents who have wildly different probability distributions on the complement should be able to agree that the one distinguished hypothesis is probably wrong. A one-in-a-hundred coincidence might still be the best explanation for a given result, but a one-in-a-hundred-trillion coincidence basically never is.
Not always, not only, but when you want your results to be legible and relevant to people with wildly different beliefs about the hypothesis space, you should at some point conduct a procedure along these lines.
That is to say, in the typical course of scientific discovery, you should compute a p-value.
No, a p-value is simply the probability of the observation under a kind of gerry-mandered hypothesis. You can communicate the same thing in a bayesian fashion, you just need to specify the class of hypothesis you are declaring the “null hypothesis”. In doing so, your choice can be exposed to scrutiny, and making anything of this kind explicit really helps with making any sense of what is actually going on.
Yes, of course any bayesian analysis will need to require creating classes of hypotheses and assigning odds-ratios to them. So does doing any kind of analysis with p-values, it’s just that with p-values you are elevating one such class of hypotheses to a special status of “null hypothesis” and claim objectivity when none such exists.
Yes this is what a p-value is? Perhaps I am confused, are you saying something here that I’m not saying? (Although note that we cannot perform an update this way, we need conditional probabilities for the evidence under both the hypothesis under consideration and its complement.)
Only in the sense that one hypothesis can form a class! It’s extremely reasonable to say something of the form “the normal understanding of this phenomenon implies this outcome distribution, the true outcome was very unlikely under that distribution, thus we should think harder about the part of the normal understanding that addresses this situation”, and we do not need to really come up with an alternate hypothesis to do this. I agree that you shouldn’t compute p-values for hypotheses that you don’t have reason to believe in advance will be prominent in the minds of people you want to communicate your results to, if that doesn’t address what you’re saying about objectivity (actually even if it does) then I’m pretty confused by the last clause here.
By calling something the “p-value” you are elevating the null hypothesis to a special status (and usually leaving various things about it underspecified). Just don’t do that. When reporting an experimental result, just report the probability of the observed data under multiple hypotheses. This would generally not be considered giving a “p-value”.
Like, it seems like your post is trying to say something like “ah, no, don’t do bayesian statistics, p-values are better sometimes actually”. But no, bayesian statistics in this sense is just better and more straightforward, as far as I can tell, and you get the things you would get from a p-value by default if you did any bayesian statistics.
Of course when you call something a p-value you should in your mind add “(for a particular choice of null hypothesis)”, calling something “the” p-value doesn’t really make sense. But I don’t think it’s correct to say that this elevates the null hypothesis to a special status; in many (perhaps even most) cases, there’s a hypothesis which already has special status, and addressing that hypothesis in particular is productive in a way that cataloguing and grouping several alternate hypotheses is not.
This is precisely the opposite of the truth. As I tried to explain in the original post, roughly the entire benefit of reporting p-values instead of full bayesian updates is that if we try to do a full update, we will necessarily underspecify many hypothesis classes, but by focusing on one hypothesis (in the cases where there’s a hypothesis that it makes sense to focus on), especially if that hypothesis is deliberately “simple” (something like, this intervention does not have an effect on this observable), we can fully specify it. This is exactly the problem that p-values solve.
I don’t think this is what my post says or that this is a plausible reading of it. A p-value is a conditional probability and since a principled bayesian update involves considering all of the relevant conditional probabilities, of course it contains all of the information that a p-value gives, and indeed the only way to interpret a p-value is in this light. But we can hardly do principled bayesian updates for ourselves and we can’t effectively communicate them, and moreover, if there are competing explanations for the data, we often can’t tell which explanation is best or if a correct explanation is among the hypotheses we’re really considering. In these cases, the correct amount of our internal update to report is often just a p-value.
The concrete examples I have in mind are the discoveries of CMBR and of the muon—Penzias and Wilson were not aware of the prediction of relic radiation, and nobody had even hypothesized the muon. It turned out that some Big Bang theorists at the time had started thinking about the possibility of microwave radiation left over from the Big Bang, and so Penzias and Wilson were eventually made aware of this work and realized what they had discovered, but when they conducted their experiment, what was relevant was not the full bayesian update but one conditional: their observations simply were not consistent with a steady state universe. They did not need to know about the alternative hypotheses to know that this was important. In the other example, Yukawa had predicted the existence of mesons before the muon was discovered, in particular he had predicted what we now call the pi meson, and since the mass of the muon matched the predicted mass of the pi meson, many people at the time guessed that Anderson and Neddermeyer had observed the pi meson. But they had not! Of course it wasn’t necessarily a mistake to update toward the best available hypothesis, but nonetheless, it was incorrect. The relevant discovery here was not that Yukawa’s prediction fit the newly-observed particle better than any other available explanation, it was simply that the newly-observed particle could not be any previously-observed particle, and so something new had been discovered.
You give as an example a situation which is inherently not repeatable, where were forced to make do with reasoning under significant uncertainty and with very limited information, to decide what’s going on out of an incredibly wide hypothesis space. You correctly point out this is hard.
You then say that in a situation where we can perform repeated experiments to exclude one hypothesis P-values work ok.
But in that exact situation Bayesian reasoning works fine. Sure you might not agree on which alternative hypothesis is true but so long as both of you agree there are any alternative hypothesese that make it more likely to see the given results, after a few rounds you’ll have extremely low credence in the original hypothesis.
Bayesian reasoning does work fine here, but if you were trying to communicate how you changed your mind about the original hypothesis, you wouldn’t report all your updates, because you wouldn’t (and shouldn’t) go through the process of enumerating all of the alternatives you considered and the likelihoods under those alternatives and your priors and justifying an estimate of the likelihood under alternatives like “or something I haven’t thought of”. If you’re interested in a distinguished hypothesis, which you almost always are (hypotheses like “this intervention has no effect” or “the normal explanation of how this process works is correct” are basically always available), then the most important thing you should report is the probability of the evidence under that distinguished hypothesis, since the updates on that hypothesis should agree even if the auxiliary updates do not and that’s the hypothesis your peers will tend to care about the most.
A company knowing or not knowing something is not binary. Reality is complex. Companies have plenty of communication about potential issues that don’t definitely demonstrate that a problem exists. A situation where a low-level employee has given information is not the same as when the information is known to company leadership. There’s a huge difference between the company knowing about the issue a decade before the “public revelation” and them knowing a month before it because the lead researcher asked them to look over their numbers.
Proving whether or not a chemical causes a given issue isn’t easy. The scenario where an independent researchers is able to provide a definite proof that a chemical causes a certain illness and a company never got any idea that there’s a possible link that could be investigated before the official publication seems to me not how these things usually play out.
Most of the time what we care about isn’t a binary outcome. When deciding whether or not to take a drug, we care a lot about the strength of the effect of the drug. Measuring effect sizes is important.
I mostly agree with this, principled and robust estimations of effect size are hard and also important. Maybe someday I’ll write a primer on Judea Pearl covering Bayesian networks and causal graphs, which in my mind is the framework that unifies these approaches, but that would take me a while and would require some more research, so I didn’t get into it here.