“Two medical researchers use the same treatment independently, in different hospitals. Neither would stoop to falsifying the data, but one had decided beforehand that because of finite resources he would stop after treating N=100 patients, however many cures were observed by then. The other had staked his reputation on the efficacy of the treatment, and decided he would not stop until he had data indicating a rate of cures definitely greater than 60%, however many patients that might require. But in fact, both stopped with exactly the same data: n = 100 [patients], r = 70 [cures]. Should we then draw different conclusions from their experiments?” (Presumably the two control groups also had equal results.)
It wasn’t immediately obvious to me whether we should treat the data in these two cases identically or not. I wrote this up in case it helps clarify the situation to others.
Consider that your hypotheses are of the form “the treatment cures x% of people” for various values of x.[1] Denote this by Hx.
Updating the hypotheses in the first researcher’s case is completely standard Bayes theorem. How well did the hypotheses predict the data?Hx predicts it with probability
P(data|Hx)=rx(n−r)1−x=70x⋅301−x.
So then your likelihood ratio for Hx vs.Hy is simply
P(data|Hx):P(data|Hy).
Calculating these probabilities and likelihood ratios is not easy to do in one’s head, so it’s best to use calculators and computers for this.
Example calculations with numbers
As an illustration, for x=80% and y=50%, the ratio is 224:1. This might sound surprisingly large—we are practically certain[2] that out of these two, 80% is a more correct hypothesis than 50%!
Another example: H70% is the hypothesis most favored by the data[3], and the odds ratio compared to H80% is 17:1. This is decently large, so the treatment probably wouldn’t cure (more than) 80% people if we were to conduct it on a much larger sample (again, barring prior beliefs to the contrary).
How about the second researcher? This is trickier. The text reads
“The other had staked his reputation on the efficacy of the treatment, and decided he would not stop until he had data indicating a rate of cures definitely greater than 60%, however many patients that might require.”
This could mean many things, but let’s choose the following simple “stopping condition”: the researcher only stops after more than 65% of the experiment participants have been cured. If it’s less than that, the researcher continues.
Examples of what this means
If the first person is cured, great! We have 100% cure rate.
If not, test the second person. If they are cured, we have a 50% cure rate. Not enough, so test a third person. If they are cured, too, we have a cure rate of 2/3≈67%, which is just enough, so we can stop there.
Otherwise keep on sampling.
Okay! Calculating P(data|Hx) seems much harder now. Let’s just use a computer simulation:
Code
Run the code with n=100,r=70,h=0.7 and…
…oops, the probability is 0.
Right: it’s impossible to get exactly 100 participants with 70 cures this way. Think: if we already had 99participants with 70 cures, we would have already stopped! Same if we had had 69 cured people.
Okay, we need to adjust the simulation. Maybe the researcher wanted to have a “round” number of participants in the study, so that the results would appear “more natural”. So the researcher wanted to have the sample size be divisible by 20. Edit the code:
Edited code
Okay, let’s compute, setting again n=100,r=70,h=0.7 to compute P(data|H70%). Result: 0.000770. Pretty small—glad we ran a lot of simulations!
Let’s also compute P(data|H80%) so that we can look at the likelihoods. Here the result is 0.000049, for a ratio of 770:49, i.e.
16:1.
Performing the same update on the first researcher’s data gives you a ratio of 17:1. Pretty close! And if you use more compute, you get closer and closer matches. The update really is the same!
There is a certain point of view from this answer is not surprising at all, where you think “of course you get the same result”. But this point of view is not quite natural to people, and communicating it isn’t trivial. Here’s my shot.
Let’s play a game of “spot the differences”:
Code 1
Code 2
Answer
Line 12 has been changed.
And, of course, code 1 corresponds to the update with the first researcher, and code 2 corresponds to the update with the second researcher.
That was an easy one. Ready for round 2?
Code 3
Code 4
Answer
These are again the probabilities and likelihood ratios for the first and second researcher. This time, however, instead of thinking simply about how many participants our study had and how many were cured, we take in the full data of which people were cured and which weren’t (the “observed_data”, which I’m shuffling here, since we don’t know the order in which they were sampled).
It might seem a priori plausible that the second code will simply return probabilities of 0 and 0, resulting in a division-by-zero error. But I rigged the random number seed so that the shuffling works out right, that the probabilities are positive. This reflects how in real life, the second researcher’s procedure really might stop after exactly 100 participants.
Given this information, can you see it now? Can you spot the difference between code 3 and code 4?
There isn’t one.
If data == data_current holds true, then it doesn’t matter whether line 11 reads “while len(current_data) < len(data)” as opposed to “while sum(data_current) ⇐ 0.65*len(data_current) or len(data_current) % 20 != 0″. All exact matches will have been produced by sampling 100 random numbers correctly. That’s it. If you sample them correctly, it doesn’t matter what you write in line 11, and if you don’t sample them correctly, then you won’t get an exact match anyways.
The programs produce the exact same output (except for changes in random number generation).
The key thing is: the stopping condition doesn’t discriminate between hypotheses. If you look at codes 3 and 4, what matters for the stopping condition is the data,not the hypothesis.[4] Therefore, if you want to update your probabilities on hypotheses, you need only focus on the data.
These are of course very simple hypotheses which don’t consider why the cure works for some people and not others, but attributes those to unknown magic random factors. A true, curious scientist would think “but why does the cure work for some people but not others?” and come up with follow-up hypotheses, not least because this could allow for improving the treatment. But the argument here would go through with more complicated hypotheses, too.
Unless we had strong prior belief to the contrary. (And, of course, in real life we will have uncertainty about the methodology of such a study. Is this really all the data that was collected, or were there dropouts in the study? Was the sample representative of the whole population—how was it selected? These are often much more important than the uncertainty stemming from limited sample sizes.)
Indeed, it would be ontologically confused to say that the stopping condition depends on the hypotheses. Hypotheses are not the sort of thing that could affect when a researcher stops collecting data.
It wasn’t immediately obvious to me whether we should treat the data in these two cases identically or not. I wrote this up in case it helps clarify the situation to others.
Consider that your hypotheses are of the form “the treatment cures x% of people” for various values of x.[1] Denote this by Hx.
Updating the hypotheses in the first researcher’s case is completely standard Bayes theorem. How well did the hypotheses predict the data?Hx predicts it with probability
P(data|Hx)=rx(n−r)1−x=70x⋅301−x.
So then your likelihood ratio for Hx vs.Hy is simply
P(data|Hx):P(data|Hy).
Calculating these probabilities and likelihood ratios is not easy to do in one’s head, so it’s best to use calculators and computers for this.
Example calculations with numbers
As an illustration, for x=80% and y=50%, the ratio is 224:1. This might sound surprisingly large—we are practically certain[2] that out of these two, 80% is a more correct hypothesis than 50%!
Another example: H70% is the hypothesis most favored by the data[3], and the odds ratio compared to H80% is 17:1. This is decently large, so the treatment probably wouldn’t cure (more than) 80% people if we were to conduct it on a much larger sample (again, barring prior beliefs to the contrary).
How about the second researcher? This is trickier. The text reads
“The other had staked his reputation on the efficacy of the treatment, and decided he would not stop until he had data indicating a rate of cures definitely greater than 60%, however many patients that might require.”
This could mean many things, but let’s choose the following simple “stopping condition”: the researcher only stops after more than 65% of the experiment participants have been cured. If it’s less than that, the researcher continues.
Examples of what this means
If the first person is cured, great! We have 100% cure rate.
If not, test the second person. If they are cured, we have a 50% cure rate. Not enough, so test a third person. If they are cured, too, we have a cure rate of 2/3≈67%, which is just enough, so we can stop there.
Otherwise keep on sampling.
Okay! Calculating P(data|Hx) seems much harder now. Let’s just use a computer simulation:
Code
Run the code with n=100,r=70,h=0.7 and…
…oops, the probability is 0.
Right: it’s impossible to get exactly 100 participants with 70 cures this way. Think: if we already had 99participants with 70 cures, we would have already stopped! Same if we had had 69 cured people.
Okay, we need to adjust the simulation. Maybe the researcher wanted to have a “round” number of participants in the study, so that the results would appear “more natural”. So the researcher wanted to have the sample size be divisible by 20. Edit the code:
Edited code
Okay, let’s compute, setting again n=100,r=70,h=0.7 to compute P(data|H70%). Result: 0.000770. Pretty small—glad we ran a lot of simulations!
Let’s also compute P(data|H80%) so that we can look at the likelihoods. Here the result is 0.000049, for a ratio of 770:49, i.e.
16:1.
Performing the same update on the first researcher’s data gives you a ratio of 17:1. Pretty close! And if you use more compute, you get closer and closer matches. The update really is the same!
There is a certain point of view from this answer is not surprising at all, where you think “of course you get the same result”. But this point of view is not quite natural to people, and communicating it isn’t trivial. Here’s my shot.
Let’s play a game of “spot the differences”:
Code 1
Code 2
Answer
Line 12 has been changed.
And, of course, code 1 corresponds to the update with the first researcher, and code 2 corresponds to the update with the second researcher.
That was an easy one. Ready for round 2?
Code 3
Code 4
Answer
These are again the probabilities and likelihood ratios for the first and second researcher. This time, however, instead of thinking simply about how many participants our study had and how many were cured, we take in the full data of which people were cured and which weren’t (the “observed_data”, which I’m shuffling here, since we don’t know the order in which they were sampled).
It might seem a priori plausible that the second code will simply return probabilities of 0 and 0, resulting in a division-by-zero error. But I rigged the random number seed so that the shuffling works out right, that the probabilities are positive. This reflects how in real life, the second researcher’s procedure really might stop after exactly 100 participants.
Given this information, can you see it now? Can you spot the difference between code 3 and code 4?
There isn’t one.
If data == data_current holds true, then it doesn’t matter whether line 11 reads “while len(current_data) < len(data)” as opposed to “while sum(data_current) ⇐ 0.65*len(data_current) or len(data_current) % 20 != 0″. All exact matches will have been produced by sampling 100 random numbers correctly. That’s it. If you sample them correctly, it doesn’t matter what you write in line 11, and if you don’t sample them correctly, then you won’t get an exact match anyways.
The programs produce the exact same output (except for changes in random number generation).
The key thing is: the stopping condition doesn’t discriminate between hypotheses. If you look at codes 3 and 4, what matters for the stopping condition is the data, not the hypothesis.[4] Therefore, if you want to update your probabilities on hypotheses, you need only focus on the data.
These are of course very simple hypotheses which don’t consider why the cure works for some people and not others, but attributes those to unknown magic random factors. A true, curious scientist would think “but why does the cure work for some people but not others?” and come up with follow-up hypotheses, not least because this could allow for improving the treatment. But the argument here would go through with more complicated hypotheses, too.
Unless we had strong prior belief to the contrary. (And, of course, in real life we will have uncertainty about the methodology of such a study. Is this really all the data that was collected, or were there dropouts in the study? Was the sample representative of the whole population—how was it selected? These are often much more important than the uncertainty stemming from limited sample sizes.)
As should be intuitive, but which can be formally proven via calculus, or informally demonstrated by computer calculations.
Indeed, it would be ontologically confused to say that the stopping condition depends on the hypotheses. Hypotheses are not the sort of thing that could affect when a researcher stops collecting data.