I played around with a small python script to see what happens in slightly more complicated settings.
Simple observations:
no matter the distribution, if noise has a fatter tail / is bigger than signal, you’re screwed (you can’t trust the top studies at all);
no matter the distribution, if signal has a fatter tail / is bigger than noise, you’re in business (you can trust the top studies);
in the critical regime where both distributions are the same, expected quality = performance / 2 seems to be true;
if noise amount is correlated with signal in a simple proportional way, then you’re in business, because the high noise studies will also be the best ones. (But this is a weird assumption...)
This would mean the only critical information is often “is noise bigger than signal—in particular around the tails”. If noise is smaller than signal (even by a factor of 2), then you can probably trust the RCTs blindly, no matter the shape of the underlying distributions, except in weird circumstances.
The practical takeaways are:
ignore everything that has probably higher noise than signal
take seriously everything that has probably bigger signal than noise and don’t bother with corrective terms
If you’re interested, “When is Goodhart catastrophic?” characterizes some conditions on the noise and signal distributions (or rather, their tails) that are sufficient to guarantee being screwed (or in business) in the limit of many studies.
The downside is that because it doesn’t make assumptions about the distributions (other than independence), it sadly can’t say much about the non-limiting cases.
Those advocating for more speculative interventions point to calculations suggesting that the expected value of their interventions is extremely large. What implications, if any, does the question “How much do you believe your results?” have for this debate?
I think it highlights the importance of plausibility arguments. If you think the underlying Quality distribution is gaussian, any claim of huge impact is going to be hard to stomach. What plausibility arguments do is say “hey, there are some really powerful interventions on the technological horizon, and so here’s the evidence that the underlying Quality distribution has some really impactful interventions in it.” They’re the strong evidence for broad facts that we might take as background knowledge, but that serve to underpin a lot of the later reasoning we might try to do.
Curated. There’s a certain kind of delight that comes from a post that doesn’t introduce new data or new concepts, just takes some simple equations, and explores them that you feel you learned something new, interesting, and important. Perhaps the reality is it that statistics are just not that intuitive, and marginal time and effort gaining more understanding and mastery here is worth spending.
I think this is a pedagogical Version of Andrew Gelmans shrinkage Triology
The most important paper also has a blog post, The very short version is if you z score the published effects, then then you can derive a prior for the 20.000+ effects from the Cochrane database. A Cauchy distribution fits very well. The Cauchy distribution has very fat tails, so you should regress small effects heavily towards the null and regress very large effects very little.
Here is a fun figure of the effects, Medline is published stuff, so no effects between −2 and 2 as they would be ‘insignificant’, In the Cochrane collaboration they also hunted down unpublished results.
Here you see the Cochrane prior In red, you can imagine drawing a lot of random point from the red and then “adding 1 sigma of random noise”, which “smears out” the effect creating the blue inflated effects we observe.
Notice this only works if you have standardized effects, if you observe that breast feeding makes you 4 time richer with sigma=2, then you have z=2 which is a tiny effect as you need 1.96 to reach significance at the 5% level in frequentest statistics, and you should thus regress it heavily towards the null, where if you observe that breast feeding makes you 1% richer with sigma=0.01% then this is a huge effect and it should be regressed towards the null very little
We are interested in finding the constant β such that E[Quality]=β⋅Performance.
How do we know that the expected Quality should be linear wrt Performance? I did the math, and I agree with you that it is true (at least in the gaussian case), but if you have an intuition about it I’d love to hear it!
I think one other implication of this is “if you convince Mom you’re ok using photos you very carefully staged, at least don’t think you used to be okay when you look at them in the future”)
I played around with a small python script to see what happens in slightly more complicated settings.
Simple observations:
no matter the distribution, if noise has a fatter tail / is bigger than signal, you’re screwed (you can’t trust the top studies at all);
no matter the distribution, if signal has a fatter tail / is bigger than noise, you’re in business (you can trust the top studies);
in the critical regime where both distributions are the same, expected quality = performance / 2 seems to be true;
if noise amount is correlated with signal in a simple proportional way, then you’re in business, because the high noise studies will also be the best ones. (But this is a weird assumption...)
This would mean the only critical information is often “is noise bigger than signal—in particular around the tails”. If noise is smaller than signal (even by a factor of 2), then you can probably trust the RCTs blindly, no matter the shape of the underlying distributions, except in weird circumstances.
The practical takeaways are:
ignore everything that has probably higher noise than signal
take seriously everything that has probably bigger signal than noise and don’t bother with corrective terms
If you’re interested, “When is Goodhart catastrophic?” characterizes some conditions on the noise and signal distributions (or rather, their tails) that are sufficient to guarantee being screwed (or in business) in the limit of many studies.
The downside is that because it doesn’t make assumptions about the distributions (other than independence), it sadly can’t say much about the non-limiting cases.
Getting close to the decade anniversary for Why the Tails Come Apart, and this is a very closely related issue to regressional Goodhart.
I think it highlights the importance of plausibility arguments. If you think the underlying Quality distribution is gaussian, any claim of huge impact is going to be hard to stomach. What plausibility arguments do is say “hey, there are some really powerful interventions on the technological horizon, and so here’s the evidence that the underlying Quality distribution has some really impactful interventions in it.” They’re the strong evidence for broad facts that we might take as background knowledge, but that serve to underpin a lot of the later reasoning we might try to do.
Curated. There’s a certain kind of delight that comes from a post that doesn’t introduce new data or new concepts, just takes some simple equations, and explores them that you feel you learned something new, interesting, and important. Perhaps the reality is it that statistics are just not that intuitive, and marginal time and effort gaining more understanding and mastery here is worth spending.
I think this is a pedagogical Version of Andrew Gelmans shrinkage Triology
The most important paper also has a blog post, The very short version is if you z score the published effects, then then you can derive a prior for the 20.000+ effects from the Cochrane database. A Cauchy distribution fits very well. The Cauchy distribution has very fat tails, so you should regress small effects heavily towards the null and regress very large effects very little.
Here is a fun figure of the effects, Medline is published stuff, so no effects between −2 and 2 as they would be ‘insignificant’, In the Cochrane collaboration they also hunted down unpublished results.
Here you see the Cochrane prior In red, you can imagine drawing a lot of random point from the red and then “adding 1 sigma of random noise”, which “smears out” the effect creating the blue inflated effects we observe.
Notice this only works if you have standardized effects, if you observe that breast feeding makes you 4 time richer with sigma=2, then you have z=2 which is a tiny effect as you need 1.96 to reach significance at the 5% level in frequentest statistics, and you should thus regress it heavily towards the null, where if you observe that breast feeding makes you 1% richer with sigma=0.01% then this is a huge effect and it should be regressed towards the null very little
See also: Tweedie’s formula
Great post, thank you! This could help explain the general intuition about “too good to be true”.
Great post, thanks a lot!
Quick math question:
How do we know that the expected Quality should be linear wrt Performance? I did the math, and I agree with you that it is true (at least in the gaussian case), but if you have an intuition about it I’d love to hear it!
I think that would give you a line that predicts x given y, rather than y given x.
I think one other implication of this is “if you convince Mom you’re ok using photos you very carefully staged, at least don’t think you used to be okay when you look at them in the future”)
Nice post! I’m curious to learn how this relates to, e.g., total least-squares and instrumental variables.
Would you consider sharing the code used to generate these plots?