I agree with your final paragraph – I’m fine with assuming there is a true probability. That said, I think there’s an important difference between how accurate a prediction was, which can be straight-forwardly defined as its similarity to the true probability, and how good of a job the predictor did.
If we’re just talking about the former, then I don’t disagree with anything you’ve said, except that I would question calling it an “epistemically good” prediction – “epistemically good” sounds to me like it refers to performance. Either way, mere accuracy seems like the less interesting thing of the two.
If we’re talking about the latter, then using the true probability as a comparison is problematic even in principle because it might not correspond to any intuitive notion of a good prediction. I see two separate problems:
There could be hidden variables. Suppose there is an election between candidate A and candidate B. Unbeknownst to everyone, candidate A has a brain tumor that will dramatically manifest itself three days before election day. Given this, the true probability that A wins is very low. But that can’t mean people who assign low probabilities to A winning all did a good job – by assumption, their prediction was unrelated to the reason the probability was low.
Even if there are no hidden variables, it might be that accuracy doesn’t monotonically increase with improved competence. Say there’s another election (no brain tumor involved). We can imagine that all of the following is true:
Naive people will assign about 50⁄50 odds
Smart people will recognize that candidate A will have better debate performance and will assign 60⁄40 odds
Very smart people will recognize that B’s poor debate performance will actually help them because it makes them relatable, so they will assign 30⁄70 odds
Extremely smart people will recognize that the economy is likely to crash before election day which will hurt B’s chances more than everything else and will assign 80⁄20 odds. This is similar to the true probability.
In this case, going from smart to very smart actually makes your prediction worse, even though you picked up on a real phenomenon.
I personally think it might be possible to define the quality of a single prediction in a way that includes the true probability, but but I don’t think it’s straight-forward.
I have never used Headspace, but I can say that I found it highly valuable to repeat the introductory course on Waking Up, which does fit your assessment that it moves too fast to learn the concepts the first time.
Also, I apologize for the statement that I “understand you perfectly” a few posts back. It was stupid and I’ve edited it out.
Ok this confirms you haven’t understood what I’m claiming.
I’m arguing against this claim:
I don’t think there is any difference in those lists!
I’m saying that it is harder to make a list where all predictions seem obviously false and have half of them come true than it is to make a list where half of all predictions seem obviously false and half seem obviously true and have half of them come true. That’s the only thing I’m claiming is true. I know you’ve said other things and I haven’t addressed them; that’s because I wanted to get consensus on this thing before talking about anything else.
A list of predictions that all seem extremely unlikely to come true according to common wisdom.
I agree that for the examples you’re naming (e.g., demanding strong evidence/resisting social pressure), there is a failure mode that looks like you’re going too far (e.g., being excessively dogmatic/being contrarian).
However, I don’t think that this failure mode actually results from identifying the underlying principle and then taking it to the extreme, and I think that’s an important point to clarify. For example, in the first case, the principle I see is something like “demand strong evidence for strongly held beliefs” or even more generally “believe things only as strongly as evidence suggests.” I don’t think it’s obvious that this principle can be taken too far. In particular, I think the following
A famous spoof article jokes that we don’t know parachutes are reliable because we don’t have a randomised controlled trial.
is not an example of doing that. Rather, the mistake here is something like, “equating rationality with academic science.” We don’t have a formally conducted study on the effectiveness of parachutes, and if you think that’s the only evidence that counts, you might mistrust parachutes. But, as a matter of fact, we have excellent evidence to believe that parachutes work, and believing this evidence is perfectly rational. So you cannot arrive at a mistrust of parachutes by having high standards for evidence, you can only arrive at it by being wrong about what kind of evidence does and doesn’t count.
Again, I only mean this as a clarification, not as a counterpoint. It is still absolutely possible to go wrong in the ways you describe, and avoiding that is important.
Well, now you’ve changed what you’re arguing for. You initially said that it doesn’t matter which way predictions are stated, and then you said that both lists are the same.
(Edit: deleted a line based on tone. Apologies.)
Everything except your last two paragraphs argues that a single 50% prediction can be flipped, which I agree with. (Again: for every n predictions, there are 2n ways to phrase them and precisely 2 of them are maximally bold. If you have a single prediction, then 2n=2. There are only two ways, both are maximally bold and thus equally bold.)
When it comes to a list of 50% predictions, it’s impossible to evaluate the impressiveness only by looking at how many came true, since it’s arbitrary which way they are phrased
I have proposed a rule that dictates how they are phrased. If this rule is followed, it is not arbitrary how they are phrased. That’s the point.
Again, please consider the following list:
The price of a barrel of oil at the end of 2020 will be between $50.95 and $51.02 (50%)
Tesla’s stock price at the end of the year 2020 is between 512$ and 514$ (50%)
You have said that there is no difference between both lists. But this is obviously untrue. I hereby offer you 2000$ if you provide me with a list of this kind and you manage to have, say, at least 10 predictions where between 40% and 60% come true. Would you offer me 2000$ if I presented you with a list of this kind:
Tesla’s stock price at the end of the year 2020 is below 512$ or above 514$ (50%)
and between 40% and 60% come true? If so, I will PM you one immediately.
I think you’re stuck at the fact that a 50% prediction also predicts the negated statement with 50%, therefore you assume that the entire post must be false, and therefore you’re not trying to understand the point the post is making. Right now, you’re arguing for something that is obviously untrue. Everyone can make a list of the second kind, no-one can make a list of the first kind. Again, I’m so certain about this that I promise you 2000$ if you prove me wrong.
As has been noted, the impressiveness of the predictions has nothing to do with which way round they are stated; predicting P at 50% is exactly as impressive as predicting ¬P at 50% because they are literally the same.
If that were true, then the list
⋯ (more extremely narrow 50% predictions)
and the list
⋯ (more extremely narrow 50% predictions where every other one is flipped)
would be equally impressive if half of them came true. Unless you think that’s the case, it immediately follows that the way predictions are stated matters for impressiveness.
It doesn’t matter in case of a single 50% prediction, because in that case, one of the phrasings follows the rule I propose, and the other follows the inverse of the rule, which is the other way to maximize boldness. As soon as you have two 50% predictions, there are four possible phrasings and only two of them maximize boldness. (And with n predictions, 2n possible phrasings and only 2 of them maximize boldness.)
The person you’re referring to left an addendum in a second comment (as a reply to the first) acknowledging that phrasing matters for evaluation.
I’m very competitive and my self-worth is mostly derived from social comparison, a trait which at worst can cause me to value winning over maintaining relationships, or cause me to avoid people who have higher status than me to avoid upward comparison. In reading LW and rationalist blogs, I think I’ve turned away from useful material that takes longer for me to grasp because it makes me feel inferior. I sometimes binge on low-quality material, sometimes even seeking out highly downvoted posts; I suspect I do this because it allows me to mentally jeer at people or ideas I know are incorrect.
I want to share that I have done this as well. In my case, I would be slightly more charitable and claim that the motivation was not to jeer at people who say incorrect things but to derive a feeling that I myself am doing okay. LessWrong has very high standards and there are a lot of impressive people here, which can make it terrifying for those of us who have the deeply rooted instinct to compare ourselves to whatever people we see around us. So if I see something downvoted, it gives me reassurance that I at least must be above some vaguely defined bar.
Fixed. And thanks!
I might have been unclear, but I didn’t mean to conflate them. The post is meant to be just about impressiveness. I’ve stated in the end that impressiveness is boldness ⋅ accuracy (which I probably should have called calibration). It’s possible to have perfect accuracy and zero boldness by making predictions about random number generators.
I disagree that 50% predictions can’t tell you anything about calibration. Suppose I give you 200 statements with baseline probabilities, and you have to turn them into predictions by assigning them your own probabilities while following the rule. Once everything can be evaluated, the results on your 50% group will tell me something about how well calibrated you are.
(Edit: I’ve changed the post to say impressiveness = calibration ⋅ boldness)
“Always phrase predictions such that the confidence is above the baseline probability”—This really seems like it should not matter. I don’t have a cohesive argument against it at this stage, but reversing should fundamentally be the same prediction.
So I’ve thought about this a bit more. It doesn’t matter how someone states their probabilities. However, in order to use your evaluation technique we just need to transform the probabilities so that all of them are above the baseline.
Yes, I think that’s exactly right. Statements are symmetric: 50% that X happens ⟺50% that ¬X happens. But evaluation is not symmetric. So you can consider each prediction as making two logically equivalent claims (X happens with p probability and ¬X happens with 1−p probability) plus stating on which one of the two you want to be evaluated on. But this is important because the two claims will miss the “correct” probability in different directions. If 50% confidence is too high for X (Tesla stock price is in narrow range) then 50% is too low for ¬X (Tesla stock price outside narrow range).
(Plu in any case it’s not clear that we can always agree on a baseline probability)
I think that’s the reason why calibration is inherently impressive to some extent. If it was actually boldness multiplied by calibration, then you should not be impressed at all whenever the boldness pile and confidence pile have identical height. And I think that’s correct in theory; if I just make predictions about dice all day, you shouldn’t be impressed at all regardless of the outcome. But since it takes some skill to estimate the baseline for all practical purposes, boldness doesn’t go to zero.
Oh, sorry! I’ve taken the reference to your prediction out and referred only to BetFair as the baseline.
Yes, and in particular, by Scott saying that 50% predictions are “technically meaningless.”
I confidently reject the Doomsday argument, so it doesn’t have any implications.
I might be confused here, but it seems to me that it’s easy to interpret the arguments in this post as evidence in the wrong direction.
I see the following three questions as relevant:
1. How much sets human brains apart from other brains?
2. How much does the thing that humans have and animals don’t matter?
3. How much does better architecture matter for AI?
Questions #2 and #3 seem positively correlated – if the thing that humans have is important, it’s evidence that architectural changes matter a lot. However, holding #2 constant, #1 and #3 seem negatively correlated – the less stuff there is that makes humans special, the smaller the improvements to architecture that are required to achieve greater performance.
Since this post is arguing primarily about #1, the way it affects #3 is potentially confusing.
Strong upvote from me. This new technology has helped me view the existing content from a different angle.
Is there a reason why it wouldn’t be strongly correlated?
Your “serious” modifier sounds to me like you’re envisioning the consensus among masses to change while smart people are more sober. I was largely assuming that, in the worlds where Aubrey’s prediction is true, actual life expectancy does, in fact, increase along with the awareness shift. Note that it’s expectancy rather than actual life span.
Pensions might be a good pointer.