My take: if you are somehow going from the “real” prior probability (i.e. the figure for a true random draw from the uniform distribution on the hypothesis space, which Adam estimated in his comment as 10^-60, although I expect it could be even lower depending on exactly what hypothesis space we’re talking about) all the way to 10^-3 (the 1/1000 figure you give), you are already jumping a large number of orders of magnitude, and it seems to me unjustified to assert you can only jump this many orders of magnitude, but no further. Indeed, if you can jump from 10^-60 to 10^-3, why can you not in principle jump slightly farther, and arrive at probability estimates that are non-negligible even from an everyday perspective, such as 10^-2 or even 10^-1?
And it seems to me that you must be implicitly asserting something like this, if you give the probability of a random proposed theory being successful as 1 in 1000 rather than 1 in 10^60. Where did that 1/1000 number come from? It certainly doesn’t look to me like it came out of any principled estimate for how much justified Bayesian update can be wrung out of the evidence historically available, where that estimate just happened to arrive at ~570 decibels but no more; in fact it seems like that 1000 number basically was chosen to roughly match the number of hypotheses you think were plausibly put forth before the correct one showed up. If so, then this is… pretty obviously not proper procedure, in my view.
For myself, I basically find Eliezer’s argument in Einstein’s Speed as convincing as I did when I first read it, and for basically all the same reasons: finding the right theory and promoting it to the range where it first deserves attention but before it becomes an obvious candidate for most of the probability mass requires hitting a narrow target in update-space, and humans are not in general known for their precision. With far greater likelihood, if somebody identified the correct-in-retrospect theory, the evidence available to them at the time was sufficient from a Bayesian perspective to massively overdetermine that theory’s correctness, and it was only their non-superintelligence that caused them to update so little and so late. Hitting a narrow range is implausible; overshooting that range, on the other hand, significantly less so.
At this point you may protest that the 1/1000 probability you give is not meant as an estimate for the actual probability a Bayes-optimal predictor would assign after updating on the evidence; instead it’s whatever probability is justified for a human to assign, knowing that they are likely missing much of the picture, and that this probability is bounded from above at 10^-3 or thereabouts, at least for the kind of hard scientific problems the OP is discussing.
To be blunt: I find this completely unpersuasive. Even ignoring the obvious question from before (why 10^-3?), I can see no a priori reason why someone could not find themselves in an epistemic state where (from the inside at least) the evidence they have implies a much higher probability of correctness. From this epistemic state they might then find themselves producing statements like
I believe myself to be writing a book on economic theory which will largely revolutionize—not I suppose, at once but in the course of the next ten years—the way the world thinks about its economic problems. I can’t expect you, or anyone else, to believe this at the present stage. But for myself I don’t merely hope what I say—in my own mind, I’m quite sure.
—John Maynard Keynes
statements which, if you insist on maintaining that 10^-3 upper bound (and why so, at this point?), certainly become much harder to explain without resorting to some featureless “overconfidence” thingy; and that has been discussed in detail.
Again, I’m not claiming that this is true in general. I think it is plausible to reach, idk, 90%, maybe higher, that a specific idea will revolutionize the world, even before getting any feedback from anyone else or running experiments in the world. (So I feel totally fine with the statement from Keynes that you quoted.)
I would feel very differently about this specific case if there was an actual statement from Sadi of the form “I believe that this particular theorem is going to revolutionize thermodynamics” (and he didn’t make similar statements about other things that were not revolutionary).
it seems like that 1000 number basically was chosen to roughly match the number of hypotheses you think were plausibly put forth before the correct one showed up. If so, then this is… pretty obviously not proper procedure, in my view.
I totally agree that’s what I did, but it seems like a perfectly fine procedure. Idk where the disconnect is, but maybe you’re thinking of “1000” as coming from a weirdly opinionated prior, rather than from my posterior.
From my perspective, I start out having basically no idea what the “justifiable prior” on that hypothesis is. (If you want, you could imagine that my prior on the “justifiable prior” was uniform over log-10 odds of −60 to 10; my prior is more opinionated than that but the extra opinions don’t matter much.) Then, I observe that the hypothesis we got seems to be kinda ad hoc with no great story even in hindsight for why it worked while other hypotheses didn’t. My guess is then that it was about as probable (in foresight) as the other hypotheses around at the time, and combined with the number of hypotheses (~1000) and the observation that one of them worked, you get the probability of 1/1000.
(I guess a priori you could have imagined that hypotheses should either have probability approximately 10^-60 or approximately 1, since you already have all the bits you need to deduce the answer, but it seems like in practice even the most competent people frequently try hypotheses that end up being wrong / unimportant, so that can’t be correct.)
As a different example, consider machine learning. Suppose you tell me that <influential researcher> has a new idea for RL sample efficiency they haven’t tested, and you want me to tell you the probability it would lead to a 5x improvement in sample efficiency on Atari. It seems like the obvious approach to estimate this probability is to draw the graph of how much sample efficiency improved from previous ideas from that researcher (and other similar researchers, to increase sample size), and use that to estimate P(effect size > 5x | published), and then apply an ad hoc correction for publication bias. I claim that my reasoning above is basically analogous to this reasoning.
My take: if you are somehow going from the “real” prior probability (i.e. the figure for a true random draw from the uniform distribution on the hypothesis space, which Adam estimated in his comment as 10^-60, although I expect it could be even lower depending on exactly what hypothesis space we’re talking about) all the way to 10^-3 (the 1/1000 figure you give), you are already jumping a large number of orders of magnitude, and it seems to me unjustified to assert you can only jump this many orders of magnitude, but no further. Indeed, if you can jump from 10^-60 to 10^-3, why can you not in principle jump slightly farther, and arrive at probability estimates that are non-negligible even from an everyday perspective, such as 10^-2 or even 10^-1?
And it seems to me that you must be implicitly asserting something like this, if you give the probability of a random proposed theory being successful as 1 in 1000 rather than 1 in 10^60. Where did that 1/1000 number come from? It certainly doesn’t look to me like it came out of any principled estimate for how much justified Bayesian update can be wrung out of the evidence historically available, where that estimate just happened to arrive at ~570 decibels but no more; in fact it seems like that 1000 number basically was chosen to roughly match the number of hypotheses you think were plausibly put forth before the correct one showed up. If so, then this is… pretty obviously not proper procedure, in my view.
For myself, I basically find Eliezer’s argument in Einstein’s Speed as convincing as I did when I first read it, and for basically all the same reasons: finding the right theory and promoting it to the range where it first deserves attention but before it becomes an obvious candidate for most of the probability mass requires hitting a narrow target in update-space, and humans are not in general known for their precision. With far greater likelihood, if somebody identified the correct-in-retrospect theory, the evidence available to them at the time was sufficient from a Bayesian perspective to massively overdetermine that theory’s correctness, and it was only their non-superintelligence that caused them to update so little and so late. Hitting a narrow range is implausible; overshooting that range, on the other hand, significantly less so.
At this point you may protest that the 1/1000 probability you give is not meant as an estimate for the actual probability a Bayes-optimal predictor would assign after updating on the evidence; instead it’s whatever probability is justified for a human to assign, knowing that they are likely missing much of the picture, and that this probability is bounded from above at 10^-3 or thereabouts, at least for the kind of hard scientific problems the OP is discussing.
To be blunt: I find this completely unpersuasive. Even ignoring the obvious question from before (why 10^-3?), I can see no a priori reason why someone could not find themselves in an epistemic state where (from the inside at least) the evidence they have implies a much higher probability of correctness. From this epistemic state they might then find themselves producing statements like
statements which, if you insist on maintaining that 10^-3 upper bound (and why so, at this point?), certainly become much harder to explain without resorting to some featureless “overconfidence” thingy; and that has been discussed in detail.
Again, I’m not claiming that this is true in general. I think it is plausible to reach, idk, 90%, maybe higher, that a specific idea will revolutionize the world, even before getting any feedback from anyone else or running experiments in the world. (So I feel totally fine with the statement from Keynes that you quoted.)
I would feel very differently about this specific case if there was an actual statement from Sadi of the form “I believe that this particular theorem is going to revolutionize thermodynamics” (and he didn’t make similar statements about other things that were not revolutionary).
I totally agree that’s what I did, but it seems like a perfectly fine procedure. Idk where the disconnect is, but maybe you’re thinking of “1000” as coming from a weirdly opinionated prior, rather than from my posterior.
From my perspective, I start out having basically no idea what the “justifiable prior” on that hypothesis is. (If you want, you could imagine that my prior on the “justifiable prior” was uniform over log-10 odds of −60 to 10; my prior is more opinionated than that but the extra opinions don’t matter much.) Then, I observe that the hypothesis we got seems to be kinda ad hoc with no great story even in hindsight for why it worked while other hypotheses didn’t. My guess is then that it was about as probable (in foresight) as the other hypotheses around at the time, and combined with the number of hypotheses (~1000) and the observation that one of them worked, you get the probability of 1/1000.
(I guess a priori you could have imagined that hypotheses should either have probability approximately 10^-60 or approximately 1, since you already have all the bits you need to deduce the answer, but it seems like in practice even the most competent people frequently try hypotheses that end up being wrong / unimportant, so that can’t be correct.)
As a different example, consider machine learning. Suppose you tell me that <influential researcher> has a new idea for RL sample efficiency they haven’t tested, and you want me to tell you the probability it would lead to a 5x improvement in sample efficiency on Atari. It seems like the obvious approach to estimate this probability is to draw the graph of how much sample efficiency improved from previous ideas from that researcher (and other similar researchers, to increase sample size), and use that to estimate P(effect size > 5x | published), and then apply an ad hoc correction for publication bias. I claim that my reasoning above is basically analogous to this reasoning.