I am saying that Yudkowsky is just plain wrong here, because omitting info is not the same as outright lying.
This is silly. Obviously, Yudkowsky isn’t going to go off on a tangent about all the ways people can lie indirectly, and how a Bayesian ought to account for such possibilities—that’s not the topic. In a scientific paper, it is implicit that all relevant information must be disclosed—not doing so is lying. Similarly, a scientific journal must ethically publish papers based on quality, not conclusion. They’re lying if they don’t. As for authors just not submitting papers with undesirable conclusions—well, that’s a known phenomenon, that one should account for, along with the possibility that a cosmic ray has flipped a bit in the memory of the computer that you used for data analysis, and the possibility that you misremembered something about one of the studies, and a million other possibilities that one can’t possibly discuss in every blog post.
This is never the scenario, though. It is very easy to tell that the coin is not 90% biased no matter what statistics you use.
You misunderstand. H is some hypothesis, not necessarily about coins. Your goal is to convince the Bayesian that H is true with probability greater than 0.9. This has nothing to do with whether some coin lands heads with probability greater than 0.9.
I can get a lot of mileage out of designing my experiment very carefully to target that specific threshold (though of course I can never guarantee success, so I have to try multiple colors of jelly beans until I succeed).
I don’t think so, except, as I mentioned, that you obviously will do an experiment that could conceivably give evidence meeting the threshold—I suppose that you can think about exactly which experiment is best very carefully, but that isn’t going to lead to anyone making wrong conclusions.
The person evaluating the evidence knows that you’re going to try multiple colors. A frequentist would handle this with some sort of p-value correction. A Bayesian handles this by a small prior probability of the drug working, which may partly be based on the knowledge that if drugs of this class (set of colors) had a high probability of working, there would probably already be evidence of this. But this has nothing to do with the point about the stopping rule for coin flips not affecting the likelihood ratio, and hence the Bayesian conclusion, whereas it does affect the p-value.
This is silly. Obviously, Yudkowsky isn’t going to go off on a tangent about all the ways people can lie indirectly, and how a Bayesian ought to account for such possibilities—that’s not the topic. In a scientific paper, it is implicit that all relevant information must be disclosed—not doing so is lying. Similarly, a scientific journal must ethically publish papers based on quality, not conclusion. They’re lying if they don’t.
You’re welcome to play semantic games if you wish, but that’s not how most people use the word “lying” and not how most people understand Yudkowsky’s post.
By this token, p-values also can never be hacked, because doing so is lying. (I can just define lying to be anything that hacks the p-values, which is what you seem to be doing here when you say that not publishing a paper amounts to lying.)
You misunderstand. H is some hypothesis, not necessarily about coins. Your goal is to convince the Bayesian that H is true with probability greater than 0.9. This has nothing to do with whether some coin lands heads with probability greater than 0.9.
You’re switching goalposts. Yudkowsky was talking exclusively about how I can affect the likelihood ratio. You’re switching to talking about how I can affect your posterior. Obviously, your posterior depends on your prior, so with sufficiently good prior you’ll be right about everything. This is why I didn’t understand you originally: you (a) used H for “hypothesis” instead of for “heads” as in the main post; and (b) used 0.9 for a posterior probability instead of using 10:1 for a likelihood ratio.
I don’t think so, except, as I mentioned, that you obviously will do an experiment that could conceivably give evidence meeting the threshold—I suppose that you can think about exactly which experiment is best very carefully, but that isn’t going to lead to anyone making wrong conclusions.
To the extent you’re saying something true here, it is also true for p values. To the extent you’re saying something that’s not true for p values, it’s also false for likelihood ratios (if I get to pick the alternate hypothesis).
The person evaluating the evidence knows that you’re going to try multiple colors.
No, they don’t. That is precisely the point of p-hacking.
But this has nothing to do with the point about the stopping rule for coin flips not affecting the likelihood ratio, and hence the Bayesian conclusion, whereas it does affect the p-value.
The stopping rule is not a central example of p-hacking and never was. But even for the stopping rule for coin flips, if you let me choose the alternate hypothesis instead of keeping it fixed, I can manipulate the likelihood ratio. And note that this is the more realistic scenario in real experiments! If I do an experiment, you generally don’t know the precise alternate hypothesis in advance—you want to test if the coin is fair, but you don’t know precisely what bias it will have if it’s unfair.
If we fix the two alternate hypotheses in advance, and if I have to report all data, then I’m reduced to only hacking by choosing the experiment that maximizes the chance of luckily passing your threshold via fluke. This is unlikely, as you say, so it’s a weak form of “hacking”. But this is also what I’m reduced to in the frequentist world! Bayesianism doesn’t actually help. They key was (a) you forced me to disclose all data, and (b) we picked the alternate hypothesis in advance instead of only having a null hypothesis.
(In fact I’d argue that likelihood ratios are fundamentally frequentist, philosophically speaking, so long as we have two fixed hypotheses in advance. It only becomes Bayesian once you apply it to your priors.)
If I do an experiment, you generally don’t know the precise alternate hypothesis in advance—you want to test if the coin is fair, but you don’t know precisely what bias it will have if it’s unfair.
Yes. But as far as I can see this isn’t of any particular importance to this discussion. Why do you think it is?
If we fix the two alternate hypotheses in advance, and if I have to report all data, then I’m reduced to only hacking by choosing the experiment that maximizes the chance of luckily passing your threshold via fluke. This is unlikely, as you say, so it’s a weak form of “hacking”. But this is also what I’m reduced to in the frequentist world! Bayesianism doesn’t actually help. They key was (a) you forced me to disclose all data, and (b) we picked the alternate hypothesis in advance instead of only having a null hypothesis.
Actually, a frequentist can just keep collecting more data until they get p<0.05, then declare the null hypothesis to be rejected. No lying or suppression of data required. They can always do this, even if the null hypothesis is true: After collecting n data points, they have a 0.05 chance of seeing p<0.05. If they don’t, they then collect nK more data points, where Kis big enough that whatever happened with the first n data points makes little difference to the p-value, so there’s still about a 0.05 chance that p<0.05. If that doesn’t produce a rejection, they collect nK2 more data points, and so on until they manage to get p<0.05, which is guaranteed to happen eventually with probability 1.
But they aren’t guaranteed to eventually get a Bayesian to think the null hypothesis is likely to be false, when it is actually true.
Yes. But as far as I can see this isn’t of any particular importance to this discussion. Why do you think it is?
It’s the key of my point, but you’re right that I should clarify the math here. Consider this part:
Actually, a frequentist can just keep collecting more data until they get p<0.05, then declare the null hypothesis to be rejected. No lying or suppression of data required. They can always do this, even if the null hypothesis is true: After collecting n data points, they have a 0.05 chance of seeing p<0.05. If they don’t, they then collect nK more data points, where Kis big enough that whatever happened with the first n data points makes little difference to the p-value, so there’s still about a 0.05 chance that p<0.05. If that doesn’t produce a rejection, they collect nK2 more data points, and so on until they manage to get p<0.05, which is guaranteed to happen eventually with probability 1.
This is true for one hypothesis. It is NOT true if you know the alternative hypothesis. That is to say: suppose you are checking the p-value BOTH for the null hypothesis bias=0.5, AND for the alternate hypothesis bias=0.55. You check both p-values and see which is smaller. Now it is no longer true that you can keep collecting more data until their desired hypothesis wins; if the truth is bias=0.5, then after enough flips, the alternative hypothesis will never win again, and will always have astronomically small p-value.
To repeat: yes, you can disprove bias=0.5 with p<0.05; but at the time this happens, the alternative hypothesis of bias=0.55 might be disproven at p<10^{-100}. You are no longer guaranteed to win when there are two hypotheses rather than one.
But they aren’t guaranteed to eventually get a Bayesian to think the null hypothesis is likely to be false, when it is actually true.
Importantly, this is false! This statement is wrong if you have only one hypothesis rather than two.
More specifically, I claim that if a sequence of coin flip outcomes disproves bias=0.5 at some p-value p, then for the same sequence of coin flips, there exists a bias b such that the likelihood ratio between bias b and bias 0.5 is O(1/p):1. I’m not sure what the exact constant in the big-O notation is (I was trying to calculate it, and I think it’s at most 10). Suppose it’s 10. Then if you have p=0.001, you’ll have likelihood ratio 100:1 for some bias.
Therefore, to get the likelihood ratio as high as you wish, you could employ the following strategy. First, flip coins until the p value is very low, as you described. Then stop, and analyze the sequence of coin flips to determine the special bias b in my claimed theorem above. Then publish a paper claiming “the bias of the coin is b rather than 0.5, here’s my super high likelihood ratio”. This is guaranteed to work (with enough coinflips).
(Generally, if the number of coin flips is N, the bias b will be on the order of 1/2±O(1/√N), so it will be pretty close to 1⁄2; but once again, this is no different for what happens with the frequentist case, because to ensure the p-value is small you’ll have to accept the effect size being small.)
OK. I think we may agree on the technical points. The issue may be with the use of the word “Bayesian”.
Me: But they aren’t guaranteed to eventually get a Bayesian to think the null hypothesis is likely to be false, when it is actually true.
You: Importantly, this is false! This statement is wrong if you have only one hypothesis rather than two.
I’m correct, by the usual definition of “Bayesian”, as someone who does inference by combining likelihood and prior. Bayesians always have more than one hypothesis (outside trivial situations where everything is known with certainty), with priors over them. In the example I gave, one can find a b such that the likelihood ratio with 0.5 is large, but the set of such b values will likely have low prior probability, so the Bayesian probably isn’t fooled. In contrast, a frequentist “pure significance test” does involve only one explicit hypothesis, though the choice of test statistic must in practice embody some implicit notion of what the alternative might be.
Beyond this, I’m not really interested in debating to what extent Yudkowsky did or did not understand all nuances of this problem.
A platonically perfect Bayesian given complete information and with accurate priors cannot be substantially fooled. But once again this is true regardless of whether I report p-values or likelihood ratios. p-values are fine.
I am saying that Yudkowsky is just plain wrong here, because omitting info is not the same as outright lying.
This is silly. Obviously, Yudkowsky isn’t going to go off on a tangent about all the ways people can lie indirectly, and how a Bayesian ought to account for such possibilities—that’s not the topic. In a scientific paper, it is implicit that all relevant information must be disclosed—not doing so is lying. Similarly, a scientific journal must ethically publish papers based on quality, not conclusion. They’re lying if they don’t. As for authors just not submitting papers with undesirable conclusions—well, that’s a known phenomenon, that one should account for, along with the possibility that a cosmic ray has flipped a bit in the memory of the computer that you used for data analysis, and the possibility that you misremembered something about one of the studies, and a million other possibilities that one can’t possibly discuss in every blog post.
This is never the scenario, though. It is very easy to tell that the coin is not 90% biased no matter what statistics you use.
You misunderstand. H is some hypothesis, not necessarily about coins. Your goal is to convince the Bayesian that H is true with probability greater than 0.9. This has nothing to do with whether some coin lands heads with probability greater than 0.9.
I can get a lot of mileage out of designing my experiment very carefully to target that specific threshold (though of course I can never guarantee success, so I have to try multiple colors of jelly beans until I succeed).
I don’t think so, except, as I mentioned, that you obviously will do an experiment that could conceivably give evidence meeting the threshold—I suppose that you can think about exactly which experiment is best very carefully, but that isn’t going to lead to anyone making wrong conclusions.
The person evaluating the evidence knows that you’re going to try multiple colors. A frequentist would handle this with some sort of p-value correction. A Bayesian handles this by a small prior probability of the drug working, which may partly be based on the knowledge that if drugs of this class (set of colors) had a high probability of working, there would probably already be evidence of this. But this has nothing to do with the point about the stopping rule for coin flips not affecting the likelihood ratio, and hence the Bayesian conclusion, whereas it does affect the p-value.
You’re welcome to play semantic games if you wish, but that’s not how most people use the word “lying” and not how most people understand Yudkowsky’s post.
By this token, p-values also can never be hacked, because doing so is lying. (I can just define lying to be anything that hacks the p-values, which is what you seem to be doing here when you say that not publishing a paper amounts to lying.)
You’re switching goalposts. Yudkowsky was talking exclusively about how I can affect the likelihood ratio. You’re switching to talking about how I can affect your posterior. Obviously, your posterior depends on your prior, so with sufficiently good prior you’ll be right about everything. This is why I didn’t understand you originally: you (a) used H for “hypothesis” instead of for “heads” as in the main post; and (b) used 0.9 for a posterior probability instead of using 10:1 for a likelihood ratio.
To the extent you’re saying something true here, it is also true for p values. To the extent you’re saying something that’s not true for p values, it’s also false for likelihood ratios (if I get to pick the alternate hypothesis).
No, they don’t. That is precisely the point of p-hacking.
The stopping rule is not a central example of p-hacking and never was. But even for the stopping rule for coin flips, if you let me choose the alternate hypothesis instead of keeping it fixed, I can manipulate the likelihood ratio. And note that this is the more realistic scenario in real experiments! If I do an experiment, you generally don’t know the precise alternate hypothesis in advance—you want to test if the coin is fair, but you don’t know precisely what bias it will have if it’s unfair.
If we fix the two alternate hypotheses in advance, and if I have to report all data, then I’m reduced to only hacking by choosing the experiment that maximizes the chance of luckily passing your threshold via fluke. This is unlikely, as you say, so it’s a weak form of “hacking”. But this is also what I’m reduced to in the frequentist world! Bayesianism doesn’t actually help. They key was (a) you forced me to disclose all data, and (b) we picked the alternate hypothesis in advance instead of only having a null hypothesis.
(In fact I’d argue that likelihood ratios are fundamentally frequentist, philosophically speaking, so long as we have two fixed hypotheses in advance. It only becomes Bayesian once you apply it to your priors.)
If I do an experiment, you generally don’t know the precise alternate hypothesis in advance—you want to test if the coin is fair, but you don’t know precisely what bias it will have if it’s unfair.
Yes. But as far as I can see this isn’t of any particular importance to this discussion. Why do you think it is?
If we fix the two alternate hypotheses in advance, and if I have to report all data, then I’m reduced to only hacking by choosing the experiment that maximizes the chance of luckily passing your threshold via fluke. This is unlikely, as you say, so it’s a weak form of “hacking”. But this is also what I’m reduced to in the frequentist world! Bayesianism doesn’t actually help. They key was (a) you forced me to disclose all data, and (b) we picked the alternate hypothesis in advance instead of only having a null hypothesis.
Actually, a frequentist can just keep collecting more data until they get p<0.05, then declare the null hypothesis to be rejected. No lying or suppression of data required. They can always do this, even if the null hypothesis is true: After collecting n data points, they have a 0.05 chance of seeing p<0.05. If they don’t, they then collect nK more data points, where Kis big enough that whatever happened with the first n data points makes little difference to the p-value, so there’s still about a 0.05 chance that p<0.05. If that doesn’t produce a rejection, they collect nK2 more data points, and so on until they manage to get p<0.05, which is guaranteed to happen eventually with probability 1.
But they aren’t guaranteed to eventually get a Bayesian to think the null hypothesis is likely to be false, when it is actually true.
It’s the key of my point, but you’re right that I should clarify the math here. Consider this part:
This is true for one hypothesis. It is NOT true if you know the alternative hypothesis. That is to say: suppose you are checking the p-value BOTH for the null hypothesis bias=0.5, AND for the alternate hypothesis bias=0.55. You check both p-values and see which is smaller. Now it is no longer true that you can keep collecting more data until their desired hypothesis wins; if the truth is bias=0.5, then after enough flips, the alternative hypothesis will never win again, and will always have astronomically small p-value.
To repeat: yes, you can disprove bias=0.5 with p<0.05; but at the time this happens, the alternative hypothesis of bias=0.55 might be disproven at p<10^{-100}. You are no longer guaranteed to win when there are two hypotheses rather than one.
Importantly, this is false! This statement is wrong if you have only one hypothesis rather than two.
More specifically, I claim that if a sequence of coin flip outcomes disproves bias=0.5 at some p-value p, then for the same sequence of coin flips, there exists a bias b such that the likelihood ratio between bias b and bias 0.5 is O(1/p):1. I’m not sure what the exact constant in the big-O notation is (I was trying to calculate it, and I think it’s at most 10). Suppose it’s 10. Then if you have p=0.001, you’ll have likelihood ratio 100:1 for some bias.
Therefore, to get the likelihood ratio as high as you wish, you could employ the following strategy. First, flip coins until the p value is very low, as you described. Then stop, and analyze the sequence of coin flips to determine the special bias b in my claimed theorem above. Then publish a paper claiming “the bias of the coin is b rather than 0.5, here’s my super high likelihood ratio”. This is guaranteed to work (with enough coinflips).
(Generally, if the number of coin flips is N, the bias b will be on the order of 1/2±O(1/√N), so it will be pretty close to 1⁄2; but once again, this is no different for what happens with the frequentist case, because to ensure the p-value is small you’ll have to accept the effect size being small.)
OK. I think we may agree on the technical points. The issue may be with the use of the word “Bayesian”.
Me: But they aren’t guaranteed to eventually get a Bayesian to think the null hypothesis is likely to be false, when it is actually true.
You: Importantly, this is false! This statement is wrong if you have only one hypothesis rather than two.
I’m correct, by the usual definition of “Bayesian”, as someone who does inference by combining likelihood and prior. Bayesians always have more than one hypothesis (outside trivial situations where everything is known with certainty), with priors over them. In the example I gave, one can find a b such that the likelihood ratio with 0.5 is large, but the set of such b values will likely have low prior probability, so the Bayesian probably isn’t fooled. In contrast, a frequentist “pure significance test” does involve only one explicit hypothesis, though the choice of test statistic must in practice embody some implicit notion of what the alternative might be.
Beyond this, I’m not really interested in debating to what extent Yudkowsky did or did not understand all nuances of this problem.
A platonically perfect Bayesian given complete information and with accurate priors cannot be substantially fooled. But once again this is true regardless of whether I report p-values or likelihood ratios. p-values are fine.