Interesting. I think you’re probably right that our model should have a parameter for “researcher quality”, and if a researcher is able to correctly predict the outcome of an experiment, that should cause an update in the direction of that researcher being more knowledgable (and their prior judgements should therefore carry more weight—including for this particular experiment!)
But the story you’re telling doesn’t seem entirely compatible with your comment earlier in this thread. Earlier you wrote: “However, it is often the case that you could get a lot more high-quality evidence that basically settles the question, if you put in many hours of work.” But in this recent comment you wrote: “the experiment provides the last little bit of evidence needed to confirm [the hypothesis]”. In the earlier comment, it sounds like you’re talking about a scenario where most of the evidence comes in the form of data; in the later comment, it sounds like you’re talking about a scenario where most of the evidence was necessary “just to think of the correct answer—to promote it to your attention” and the experiment only provides “the last little bit” of evidence.
So I think the philosophical puzzle is still unsolved. A few more things to ponder if someone wants to work on solving it:
If Bob is known to be an excellent researcher, can we trust HARKing if it comes from him? Does the mechanism by which hindsight bias works matter? (Here is one possible mechanism.)
In your simplified model above, there’s no possibility of a result that is “just noise” and not explained by any particular hypothesis. But noise appears to be a pretty big problem (see: replication crisis?) In current scientific practice, the probability that a result could have been obtained through noise is a number of great interest that’s almost always calculated (the p-value). How should this number be factored in, if at all?
Note that p-values can be used in Bayesian calculations. For example, in a simplified universe where either the null is true or the alternative is true, p(alternative|data) = p(data|alternative)p(alternative) / (p(data|alternative)p(alternative) + p(data|null)p(null))
My solution was focused on a scenario where we’re considering relatively obvious hypotheses and subject to lots of measurement noise, but you convinced me this is inadequate in general.
I’m unsatisfied with the discussion around “Alice didn’t think of all of them”. I know nothing about relativity, but I imagine a big part of Einstein’s contribution was his discovery of a relatively simple hypothesis which explained all the data available to him. (By “relatively simple”, I mean a hypothesis that didn’t have hundreds of free parameters.) Presumably, Einstein had access to the same data as other contemporary physicists, so it feels weird to explain his contribution in terms of having access to more evidence.
In other words, it feels like the task of searching hypothesis space should be factored out from the task of Bayesian updating. This seems closely related to puzzles around “realizability”—through your search of hypothesis space, you’re essentially “realizing” a particular hypothesis on the fly, which isn’t how Bayesian updating is formally supposed to work. (But it is how deep learning works, for example.)
But the story you’re telling doesn’t seem entirely compatible with your comment earlier in this thread.
The earlier comment was comparing experiments to “armchair reasoning”, while the later comment was comparing experiments to “all prior knowledge”. I think the typical case is:
Amount of evidence in “all prior knowledge” >> Amount of evidence in an experiment >> Amount of evidence from “armchair reasoning”.
If Bob is known to be an excellent researcher, can we trust HARKing if it comes from him?
I would pay a little more attention, but not that much more, and would want an experimental confirmation anyway. It seems to be that the world is complex enough and humans model it badly enough (for the sorts of things academia is looking at) that past evidence of good priors on one question doesn’t imply good priors on a different question.
(This is an empirical belief; I’m not confident in it.)
In your simplified model above, there’s no possibility of a result that is “just noise” and not explained by any particular hypothesis.
I expect that if you made a more complicated model where each hypothesis H had a likelihood p(D∣H), and p(D∣H) was high for N hypotheses and low for the rest, you’d get a similar conclusion, while accounting for results that are just noise.
I know nothing about relativity, but I imagine a big part of Einstein’s contribution was his discovery of a relatively simple hypothesis which explained all the data available to him.
I agree that relativity is an example that doesn’t fit my story, where most of the work was in coming up with the hypothesis. (Though I suspect you could argue that relativity shouldn’t have been believed before experimental confirmation.) I claim that it is the exception, not the rule.
Also, I do think it is often a valuable contribution to even think of a plausible hypothesis that fits the data, even if you should assign it a relatively low probability of being true. I’m just saying that if you want to reach the truth, this work must be supplemented by experiments / gathering good data.
In other words, it feels like the task of searching hypothesis space should be factored out from the task of Bayesian updating.
Bayesian updating does not work well when you don’t have the full hypothesis space. Given that you know that you don’t have the full hypothesis space, you should not be trying to approximate Bayesian updating over the hypothesis space you do have.
Bayesian updating does not work well when you don’t have the full hypothesis space.
Do you have any links related to this? Technically speaking, the right hypothesis is almost never in our hypothesis space (“All models are wrong, but some are useful”). But even if there’s no “useful” model in your hypothesis space, it seems Bayesian updating fails gracefully if you have a reasonably wide prior distribution for your noise parameters as well (then the model fitting process will conclude that the value of your noise parameter must be high).
No, I haven’t read much about Bayesian updating. But I can give an example.
Consider the following game. I choose a coin. Then, we play N rounds. In each round, you make a bet about whether or not the coin will come up Heads or Tails at 1:2 odds which I must take (i.e. if you’re right I give you $2 and if I’m right you give me $1). Then I flip the coin and the bet resolves.
If your hypothesis space is “the coin has some bias b of coming up Heads or Tails”, then you will eagerly accept this game for large enough N—you will quickly learn the bias b from experiments, and then you can keep getting money in expectation.
However, if it turns out I am capable of making the coin come up Heads or Tails as I choose, then I will win every round. If you keep doing Bayesian updating on your misspecified hypothesis space, you’ll keep flip-flopping on whether the bias is towards Heads or Tails, and you will quickly converge to near-certainty that the bias is 50% (since the pattern will be HTHTHTHT...), and yet I will be taking a dollar from you every round. Even if you have the option of quitting, you will never exercise it because you keep thinking that the EV of the next round is positive.
Noise parameters can help (though the bias b is kind of like a noise parameter here, and it didn’t help). I don’t know of a general way to use noise parameters to avoid issues like this.
Interesting. I think you’re probably right that our model should have a parameter for “researcher quality”, and if a researcher is able to correctly predict the outcome of an experiment, that should cause an update in the direction of that researcher being more knowledgable (and their prior judgements should therefore carry more weight—including for this particular experiment!)
But the story you’re telling doesn’t seem entirely compatible with your comment earlier in this thread. Earlier you wrote: “However, it is often the case that you could get a lot more high-quality evidence that basically settles the question, if you put in many hours of work.” But in this recent comment you wrote: “the experiment provides the last little bit of evidence needed to confirm [the hypothesis]”. In the earlier comment, it sounds like you’re talking about a scenario where most of the evidence comes in the form of data; in the later comment, it sounds like you’re talking about a scenario where most of the evidence was necessary “just to think of the correct answer—to promote it to your attention” and the experiment only provides “the last little bit” of evidence.
So I think the philosophical puzzle is still unsolved. A few more things to ponder if someone wants to work on solving it:
If Bob is known to be an excellent researcher, can we trust HARKing if it comes from him? Does the mechanism by which hindsight bias works matter? (Here is one possible mechanism.)
In your simplified model above, there’s no possibility of a result that is “just noise” and not explained by any particular hypothesis. But noise appears to be a pretty big problem (see: replication crisis?) In current scientific practice, the probability that a result could have been obtained through noise is a number of great interest that’s almost always calculated (the p-value). How should this number be factored in, if at all?
Note that p-values can be used in Bayesian calculations. For example, in a simplified universe where either the null is true or the alternative is true,
p(alternative|data) = p(data|alternative)p(alternative) / (p(data|alternative)p(alternative) + p(data|null)p(null))My solution was focused on a scenario where we’re considering relatively obvious hypotheses and subject to lots of measurement noise, but you convinced me this is inadequate in general.
I’m unsatisfied with the discussion around “Alice didn’t think of all of them”. I know nothing about relativity, but I imagine a big part of Einstein’s contribution was his discovery of a relatively simple hypothesis which explained all the data available to him. (By “relatively simple”, I mean a hypothesis that didn’t have hundreds of free parameters.) Presumably, Einstein had access to the same data as other contemporary physicists, so it feels weird to explain his contribution in terms of having access to more evidence.
In other words, it feels like the task of searching hypothesis space should be factored out from the task of Bayesian updating. This seems closely related to puzzles around “realizability”—through your search of hypothesis space, you’re essentially “realizing” a particular hypothesis on the fly, which isn’t how Bayesian updating is formally supposed to work. (But it is how deep learning works, for example.)
The earlier comment was comparing experiments to “armchair reasoning”, while the later comment was comparing experiments to “all prior knowledge”. I think the typical case is:
Amount of evidence in “all prior knowledge” >> Amount of evidence in an experiment >> Amount of evidence from “armchair reasoning”.
I would pay a little more attention, but not that much more, and would want an experimental confirmation anyway. It seems to be that the world is complex enough and humans model it badly enough (for the sorts of things academia is looking at) that past evidence of good priors on one question doesn’t imply good priors on a different question.
(This is an empirical belief; I’m not confident in it.)
I expect that if you made a more complicated model where each hypothesis H had a likelihood p(D∣H), and p(D∣H) was high for N hypotheses and low for the rest, you’d get a similar conclusion, while accounting for results that are just noise.
I agree that relativity is an example that doesn’t fit my story, where most of the work was in coming up with the hypothesis. (Though I suspect you could argue that relativity shouldn’t have been believed before experimental confirmation.) I claim that it is the exception, not the rule.
Also, I do think it is often a valuable contribution to even think of a plausible hypothesis that fits the data, even if you should assign it a relatively low probability of being true. I’m just saying that if you want to reach the truth, this work must be supplemented by experiments / gathering good data.
Bayesian updating does not work well when you don’t have the full hypothesis space. Given that you know that you don’t have the full hypothesis space, you should not be trying to approximate Bayesian updating over the hypothesis space you do have.
Do you have any links related to this? Technically speaking, the right hypothesis is almost never in our hypothesis space (“All models are wrong, but some are useful”). But even if there’s no “useful” model in your hypothesis space, it seems Bayesian updating fails gracefully if you have a reasonably wide prior distribution for your noise parameters as well (then the model fitting process will conclude that the value of your noise parameter must be high).
No, I haven’t read much about Bayesian updating. But I can give an example.
Consider the following game. I choose a coin. Then, we play N rounds. In each round, you make a bet about whether or not the coin will come up Heads or Tails at 1:2 odds which I must take (i.e. if you’re right I give you $2 and if I’m right you give me $1). Then I flip the coin and the bet resolves.
If your hypothesis space is “the coin has some bias b of coming up Heads or Tails”, then you will eagerly accept this game for large enough N—you will quickly learn the bias b from experiments, and then you can keep getting money in expectation.
However, if it turns out I am capable of making the coin come up Heads or Tails as I choose, then I will win every round. If you keep doing Bayesian updating on your misspecified hypothesis space, you’ll keep flip-flopping on whether the bias is towards Heads or Tails, and you will quickly converge to near-certainty that the bias is 50% (since the pattern will be HTHTHTHT...), and yet I will be taking a dollar from you every round. Even if you have the option of quitting, you will never exercise it because you keep thinking that the EV of the next round is positive.
Noise parameters can help (though the bias b is kind of like a noise parameter here, and it didn’t help). I don’t know of a general way to use noise parameters to avoid issues like this.
Thanks for the example!