How to evaluate (50%) predictions

Rafael Harth10 Apr 2020 17:12 UTC

134 points

Rationality Forecasting & Prediction World Modeling

I commonly hear (sometimes from very smart people) that 50% predictions are meaningless. I think that this is wrong, and also that saying it hints at the lack of a coherent principle by which to evaluate whether or not a set of predictions is meaningful or impressive. Here is my attempt at describing such a principle.

What are predictions?

Consider the space of all possible futures:

If you make a prediction, you do this:

You carve out a region of the future space and declare that it occurs with some given percentage. When it comes to evaluating the prediction, the future has arrived at a particular point within the space, and it should be possible to assess whether that point lies inside or outside of the region. If it lies inside, the prediction came true; if it lies outside, the prediction came false. If it’s difficult to see whether it’s inside or outside, the prediction was ambiguous.

Now consider the following two predictions:

A coin I flip comes up heads (50%)
Tesla’s stock price at the end of the year 2020 is between 512$ and 514$ (50%)

Both predictions have 50% confidence, and both divide the future space into two parts (as all predictions do). Suppose both predictions come true. No sane person would look at them and be equally impressed. This demonstrates that confidence and truth value are not sufficient to evaluate how impressive a prediction is. Instead, we need a different property that somehow measures ‘impressiveness’. Suppose for simplicity that there is some kind of baseline probability that reflects the common knowledge about the problem. If we represent this baseline probability by the size of the areas, then the coin flip prediction can be visualized like so:

And the Tesla prediction like so:

The coin flip prediction is unimpressive because it assigns 50% to a subset of feature space whose baseline probability is also 50%. Conversely, the Tesla prediction is impressive because it assigns 50% to a subset of future space with a tiny baseline probability. Thus, the missing property is the “boldness” of the prediction, i.e., the (relative) difference between the stated confidence and the baseline probability.

Importantly, note that we can play the same game at every percentage point, e.g.:

A number I randomize on random.org falls between 15 and 94 – 80%

Even though this is an 80% prediction, it is still unimpressive because there is no difference between the stated confidence and the baseline probability.

What’s special about 50%?

In January, Kelsey Piper predicted that Joe Biden would be the Democratic Nominee with 60% confidence. If this prediction seems impressive now, we can probably agree that this is not so because it’s 60% rather than 50%. Instead, it’s because most of us would have put it much lower than even 50%. For example, BetFair gave him only ~15% back in March.

So we have one example where a 50% prediction would have been impressive and another (the random.org one) where an 80% prediction is thoroughly unimpressive. This shows that the percentage being 50% is neither necessary nor sufficient for a prediction being unimpressive. Why, then, do people say stuff like “50% predictions aren’t meaningful?”

Well, another thing they say is, “you could have phrased the predictions the other way.” But there are reasons to object to that. Consider the Tesla prediction:

Tesla’s stock price at the end of the year 2020 is between 512$ and 514$ (50%)

As-is, this is very impressive (if it comes true). But now suppose that, instead of phrasing it in this way, we first flip a coin. If the coin comes up heads, we flip the prediction, i.e.:

Tesla’s stock price at the end of the year 2020 is below 512$ or above 514$ (50%)

Whereas, if it comes up tails, we leave the prediction unchanged.

What is the probability that we are correct from the point of view we have before the flip? Well, at some point it will be possible to evaluate the prediction, and then we will either at a point outside of the small blob or at a point inside of the small blob. In the first case, we are correct if we flipped the prediction (left picture). In the latter case, we are correct if we didn’t flip the prediction (right picture). In other words, in the first case, we have a 50% chance of being correct, and in the latter case, we also have a 50% chance of being correct. Formally, for any probability $p$ that the future lands in the small blob, the chance for our prediction to be correct is exactly

$(1 - p) \cdot \frac{1}{2} + p \cdot \frac{1}{2} = \frac{1}{2}$

Importantly, notice that this remains true regardless of how the original prediction divides the future space. The division just changes $p$ , but the above yields $\frac{1}{2}$ for every value of $p$ .

Thus, given an arbitrary prediction, if we flip a coin, flip the prediction iff the coin came up heads and leave it otherwise, we have successfully constructed a perfect 50% prediction.

Note: if the coin flip thing seems fishy (you might object that, in the Tesla example, we either end up with an overconfident prediction or an underconfident prediction, and they can’t somehow add up to a 50% prediction), you can alternatively think of a set of predictions where we randomly flip half of them. In this case, there’s no coin involved, and the effect is the same: half of all predictions will come true (in expectation) regardless of their original probabilities. Feel free to re-frame every future mention of coin flips in this way.

This trick is not restricted to 50% predictions, though. To illustrate how it works for other percentage points, suppose we are given a prediction which we know has an 80% probability of coming true. First off, there are three simple things we can do, namely

leave it unchanged for a perfect 80% prediction
flip it for a perfect 20% prediction
do the coin flip thing from above to turn it into a perfect 50% prediction. Importantly, note that we would only flip the prediction statement, not the stated confidence.

(Again, if you object to the coin flip thing, think of two 80% predictions where we randomly choose one and flip it.)

In the third case, the formula

$(1 - p) \cdot \frac{1}{2} + p \cdot \frac{1}{2} = \frac{1}{2}$

from above becomes

$0.2 \cdot \frac{1}{2} + 0.8 \cdot \frac{1}{2} = \frac{1}{2}$

This is possible no matter what the original probability is; it doesn’t have to be 80%.

Getting slightly more mathy now, we can also throw a biased coin that comes up heads with probability $q \neq \frac{1}{2}$ and, again, flip the prediction iff that biased coin came up heads. (You can still replace the coin flip; if $q = \frac{1}{3}$ , think of flipping every third prediction in a set.) In that case, the probability of our prediction coming true is

$0.2 \cdot q + 0.8 \cdot (1 - q)$

This term takes values in the interval $[0.2, 0.8]$ . Here’s the graph:

Thus, by flipping our prediction with some probability other than $\frac{1}{2}$ , we can obtain every probability within $[0.2, 0.8]$ . In particular, we can transform an 80% probability into a 20% probability, a 30% probability, a 60% probability, a 78.3% probability, but we cannot make it an 83% or a 13% probability.

Finally, the formula with a variable prior probability and a variable flip chance is $(1 - p) q + p (1 - q)$ , and its graph looks like this:

If you fix $p$ , you’ll notice that, by changing $q$ , you get the $y$ -value fluctuating between $p$ and $1 - p$ . For $q = \frac{1}{2}$ , the $y$ -value is a constant at $\frac{1}{2}$ . (When I say $y$ -value, I mean the result of the formula which corresponds to the height in the above picture.)

So it is always possible to invert a given probability or to push it toward 50% by arbitrarily introducing uncertainty (this is sort of like throwing information away). On the other hand, it is never possible to pull it further away from 50% (you cannot create new information). If the current probability is known, we can obtain any probability we want (within $[p, 1 - p]$ ); if not, we don’t know how the graph looks/where we are on the 3d graph. In that case, the only probability we can safely target is 50% because flipping with $\frac{1}{2}$ probability (aka flipping every other prediction in a set) turns every prior probability into 50%.

And this, I would argue, is the only thing that is special about 50%. And it doesn’t mean 50% predictions are inherently meaningless; it just means that cheating is easier – or, to be more precise, cheating is possible without knowing the prior probability. (Another thing that makes 50% seem special is that it’s sometimes considered a universal baseline, but this is misguided.)

As an example, suppose we are given 120 predictions, each one with a correct probability of 80%. If we choose 20 of them at random and flip those, 70% of all predictions will come true in expectation. This number is obtained by solving $0.2 q + 0.8 (1 - q) = 0.7$ for $q$ ; this yields $q = \frac{1}{6}$ , so we need to flip one out of every six predictions.

What’s the proper way to phrase predictions?

Here is a simple rule that shuts the door to this kind of “cheating”:

Always phrase predictions such that the confidence is above the baseline probability.

Thus, you should predict

Joe Biden will be the Democratic nominee (60%)

rather than

Joe Biden will not be the Democratic nominee (40%)

because 60% is surprisingly high for this prediction, and similarly

The price of a barrel of oil at the end of 2020 will be between $50.95 and $51.02 (20%)

rather than

The price of a barrel of oil at the end of 2020 will not be between $50.95 and $51.02 (80%)

because 20% is surprisingly high for this prediction. The 50% mark isn’t important; what matters is the confidence of the prediction relative to the baseline/common wisdom.

This rule prevents you from cheating because it doesn’t allow flipping predictions. In reality, there is no universally accessible baseline, so there is no formal way to detect this. But that doesn’t mean you won’t notice. The list:

The price of a barrel of oil at the end of 2020 will be between $50.95 and $51.02 (50%)
Tesla’s stock price at the end of the year 2020 is between 512$ and 514$ (50%)
$\dots$ (more extremely narrow 50% predictions)

which follows the rule looks very different from this list (where half of all predictions are flipped):

The price of a barrel of oil at the end of 2020 will be between $50.95 and $51.02 (50%)
Tesla’s stock price at the end of the year 2020 is below 512$ or above 514$ (50%)
$\dots$ (more extremely narrow 50% predictions where every other one is flipped)

and I would be much more impressed if the first list has about half of its predictions come true than if the second list manages the same.

Other than preventing cheating, there is also a more fundamental reason to follow this rule. Consider what happens when you make and evaluate a swath of predictions. The common way to do this is to group them into a couple of specific percentage points (such as 50%, 60%, 70%, 80%, 95%, 99%) and then evaluate each group separately. To do this, we would look at all predictions in the 70% group, count how many have come true, and compare that number to the optimum, which is $0.7 \cdot # predictions in that group$ .

Now think of such a prediction like this:

Namely, there is a baseline probability (blue pie, ~60%) and a stated confidence (green pie, 70%). When we add such a prediction to our 70% group, we can think of that like so:

We accumulate a confidence pile (green) that measures how many predictions we claim will come true, and a common wisdom pile (blue) that measures how many predictions ought to come true according to common wisdom. After the first prediction, the confidence pile says, “0.7 predictions will come true,” whereas the common wisdom pile says, “0.6 predictions will come true.”

Now we add the second (70% confidence, ~45% common wisdom):

At this point, the confidence pile says, “1.4 predictions will come true,” whereas the common wisdom pile says, “1.05 predictions will come true.”

If we keep doing this for all 70% predictions, we eventually end up with two large piles:

The confidence pile may say, “70 predictions will come true,” whereas the common wisdom pile may say, “48.7 predictions will come true.”

Then (once predictions can be evaluated) comes a third pile, the reality pile:

The reality pile counts how many predictions did, in fact, come true. Now consider what this result means. We’ve made lots of predictions at 70% confidence for which common wisdom consistently assigns lower probabilities. In the end, (slightly more than) 70% of them came true. This means we have systematically beaten common wisdom. This ought to be impressive.

One way to think about this is that the difference between the confidence and common wisdom piles is a measure for the boldness of the entire set of predictions. Then, the rule that [each prediction be phrased in such a way that the confidence is above the baseline probability] is equivalent to choosing one of two ways that maximize this boldness. (The other way would be to invert the rule.)

If the rule is violated, the group of 70% predictions might yield a confidence pile of a height similar to that of the common wisdom pile. Then, seeing that the reality pile matches them is much less impressive. To illustrate this, let’s return to the example from above. In both cases, assume exactly one of the two predictions comes true.

Following the rule:

The price of a barrel of oil at the end of 2020 will be between $50.95 and $51.02 (50%)
Tesla’s stock price at the end of the year 2020 will be between 512$ and 514$ (50%)

Bold, therefore impressive.

Violating the rule:

The price of a barrel of oil at the end of 2020 will be between $50.95 and $51.02 (50%)
Tesla’s stock price at the end of the year 2020 will be below 512$ or above 514$ (50%)

Not bold at all, therefore unimpressive. And that would be the reason to object to the claim that you could just phrase 50% predictions in the opposite way.

Note that the 50% group is special insofar as predictions don’t change groups when you rephrase them, but the principle nonetheless applies to other percentage points.

Summary/Musings

According to this model, when you make predictions, you should follow the confidence $>$ baseline rule; and when you evaluate predictions, you should

estimate their boldness (separately for each group at a particular percentage point)
be impressed according to the product of calibration $\cdot$ boldness (where calibration is how closely the reality pile matches the confidence pile, which is what people commonly focus on)

Boldness is not formal because we don’t have universally accessible baseline probabilities for all statements lying around (50% is a non-starter), and I think that’s the primary reason why this topic is confusing. However, baselines are essential for evaluation, so it’s much better to make up your own baselines and use those than to use a model that ignores baselines (that can give absurd results). It does mean that the impressiveness of predictions has an inherent subjective component, but this strikes me as a fairly intuitive conclusion.

In practice, I think people naturally follow the rule to some extent – they tend to predict things they’re interested in and then overestimate their probability – but certainly not perfectly. The rule also implies that one should have separate groups for 70% and 30% predictions, which is currently not common practice.

What links here?