Today I experienced the Sequences post Beautiful Probability for the first time. I will begin by quoting the quotation that Eliezer began with:
Let me introduce this issue by borrowing a complaint of the late great Bayesian Master, E. T. Jaynes (1990):
“Two medical researchers use the same treatment independently, in different hospitals. Neither would stoop to falsifying the data, but one had decided beforehand that because of finite resources he would stop after treating N=100 patients, however many cures were observed by then. The other had staked his reputation on the efficacy of the treatment, and decided he would not stop until he had data indicating a rate of cures definitely greater than 60%, however many patients that might require. But in fact, both stopped with exactly the same data: n = 100 [patients], r = 70 [cures]. Should we then draw different conclusions from their experiments?” (Presumably the two control groups also had equal results.)
This quote captured me completely—I found myself caught up thinking through the implications before I could even finish reading the rest of the post. It gave me an unexpected insight into the internal logic and motivations behind the frequentist approach to statistics. There is a genuine difference in the information conveyed by the two scenarios Jaynes describes.
To make the discussion more approachable, let’s introduce names. Let’s call the first researcher George. He stopped after treating 100 patients regardless of the results. The second researcher, whom we’ll call Bessel, continued until the results supported his belief. This small act of naming helps us think about their mindsets and intentions. Similarly I’m going to call Frequentist the method that assigns to the experiments different p-values, and Bayesian the method that assigns them the same likelihood ratios.
The Frequentist Approach
Computing p-values is what Mr. Frequentist is all about. This method is centered on pre-defined procedures that allow us to interpret data in a principled and repeatable way: “What is the chance that using the same method would give this result under the null hypothesis?” Necessarily this approach says a lot about the method being used. While this mindset may seem flawed, it reflects an important goal: measure the a priori expectations of how often a method will yield false positives.
When the rules are followed, the frequentist framework offers a safeguard against self-deception and overinterpretation. Even if it doesn’t capture all the nuances of belief and uncertainty, it helps maintain epistemic discipline—especially when evaluating experimental methods in advance.
So, what difference does Mr. Frequentist see between the two experiments? In George’s case we have no information except the final results. For Bessel on the other hand, once we understand the method that was used to determine the results, we know that at every intermediate step before the final result the cure rate was less than 70%.
What It Feels Like to Be a Frequentist
George’s result is evaluated at face value: 70 out of 100 patients were cured.
To the frequentist, however, Bessel’s result isn’t just one experiment—it looks like a sequence of 99 experiments that failed to produce the result he wanted, followed by one that did. Even though he stopped at 100 patients just like George, the manner in which he arrived at the data changes its meaning.
It feels wrong to treat the final dataset from Bessel as equivalent to George’s, because Bessel ignored the negative results of 99 experiments to get to one that was positive. If we assume an agnostic prior, George’s method is less likely to be fooled than Bessel’s approach. The higher p-value for Bessel is a way of punishing Bessel for bad experimental design.
Viewed this way, frequentism is a kind of procedural epistemology—a way of knowing that emerges not from belief, but from method. It’s a mindset that values long-term calibration, error control, and clarity about what an analysis can and cannot say.
George and Bessel may have produced identical datasets, but the paths they took to get there matter. With no view of a priori expectations besides the null hypothesis, frequentists are forced to fall back on analyzing methods rather than priors.
The frequentist approach, even if flawed in certain respects, still serves as a valuable heuristic. It teaches us to be wary of overfitting to outcomes, to ask about the process behind the numbers, and to maintain a healthy skepticism when interpreting results. Its insistence on method over outcome protects us from the temptation to rationalize or cherry-pick. I’d rather a scientist work with p-values than with their intuition alone.
Mr. Bayesian on Mr. Frequentist
The Bayesian perspective can still view Bessel’s results as a series of 100 experiments. Because we believe in things like time and sample independence, however, the final total of the 100th experiment screens off the results of the first 99. There is no information in the first 99 results that is not contained in the final total. So Mr. Bayesian assigns them equivalent likelihood ratios. Or, if they believe that both applied the treatment the same way, assigns a joint likelihood ratio based on r = 140 and n = 200.
Bayesian likelihood ratios are a better measure of reporting final results than p-values. But frequentist approaches have been a useful tool, especially when comparing experimental methods beforehand. Before we resign p-values permanently to the dust bin we need to be able to replace its strength in evaluating experimental design.
A Look Inside a Frequentist
Today I experienced the Sequences post Beautiful Probability for the first time. I will begin by quoting the quotation that Eliezer began with:
This quote captured me completely—I found myself caught up thinking through the implications before I could even finish reading the rest of the post. It gave me an unexpected insight into the internal logic and motivations behind the frequentist approach to statistics. There is a genuine difference in the information conveyed by the two scenarios Jaynes describes.
To make the discussion more approachable, let’s introduce names. Let’s call the first researcher George. He stopped after treating 100 patients regardless of the results. The second researcher, whom we’ll call Bessel, continued until the results supported his belief. This small act of naming helps us think about their mindsets and intentions. Similarly I’m going to call Frequentist the method that assigns to the experiments different p-values, and Bayesian the method that assigns them the same likelihood ratios.
The Frequentist Approach
Computing p-values is what Mr. Frequentist is all about. This method is centered on pre-defined procedures that allow us to interpret data in a principled and repeatable way: “What is the chance that using the same method would give this result under the null hypothesis?” Necessarily this approach says a lot about the method being used. While this mindset may seem flawed, it reflects an important goal: measure the a priori expectations of how often a method will yield false positives.
When the rules are followed, the frequentist framework offers a safeguard against self-deception and overinterpretation. Even if it doesn’t capture all the nuances of belief and uncertainty, it helps maintain epistemic discipline—especially when evaluating experimental methods in advance.
So, what difference does Mr. Frequentist see between the two experiments? In George’s case we have no information except the final results. For Bessel on the other hand, once we understand the method that was used to determine the results, we know that at every intermediate step before the final result the cure rate was less than 70%.
What It Feels Like to Be a Frequentist
George’s result is evaluated at face value: 70 out of 100 patients were cured.
To the frequentist, however, Bessel’s result isn’t just one experiment—it looks like a sequence of 99 experiments that failed to produce the result he wanted, followed by one that did. Even though he stopped at 100 patients just like George, the manner in which he arrived at the data changes its meaning.
It feels wrong to treat the final dataset from Bessel as equivalent to George’s, because Bessel ignored the negative results of 99 experiments to get to one that was positive. If we assume an agnostic prior, George’s method is less likely to be fooled than Bessel’s approach. The higher p-value for Bessel is a way of punishing Bessel for bad experimental design.
Viewed this way, frequentism is a kind of procedural epistemology—a way of knowing that emerges not from belief, but from method. It’s a mindset that values long-term calibration, error control, and clarity about what an analysis can and cannot say.
George and Bessel may have produced identical datasets, but the paths they took to get there matter. With no view of a priori expectations besides the null hypothesis, frequentists are forced to fall back on analyzing methods rather than priors.
The frequentist approach, even if flawed in certain respects, still serves as a valuable heuristic. It teaches us to be wary of overfitting to outcomes, to ask about the process behind the numbers, and to maintain a healthy skepticism when interpreting results. Its insistence on method over outcome protects us from the temptation to rationalize or cherry-pick. I’d rather a scientist work with p-values than with their intuition alone.
Mr. Bayesian on Mr. Frequentist
The Bayesian perspective can still view Bessel’s results as a series of 100 experiments. Because we believe in things like time and sample independence, however, the final total of the 100th experiment screens off the results of the first 99. There is no information in the first 99 results that is not contained in the final total. So Mr. Bayesian assigns them equivalent likelihood ratios. Or, if they believe that both applied the treatment the same way, assigns a joint likelihood ratio based on r = 140 and n = 200.
Bayesian likelihood ratios are a better measure of reporting final results than p-values. But frequentist approaches have been a useful tool, especially when comparing experimental methods beforehand. Before we resign p-values permanently to the dust bin we need to be able to replace its strength in evaluating experimental design.