Scott Alexander is a darling of the Bayesian rationalist community, he has a lot
more epistemic humility than most, despite being an impressively well-calibrated predictor.

In this series we will try to achieve 2 things:

(this post) We try to understand what a likelihood function is, and use it
to evaluate predictions

(next post) We Make a
Bayesian calibration model, and get an uncertainty estimate over our calibration.

The likelihood function

Let’s first look at Bayes Theorem

p(θ∣y)=p(y∣θ)p(θ)p(y)

In common parlance, the 4 parts of Bayes Theorem are called:

posterior=likelihood×priordata

What we want is our posterior, the probability of some model parameters (often
θ) given some data (y). We construct a model with two things, a
prior function which describes what we believe before seeing the data, and a
likelihood function (p(y∣θ)) which given a model (θ, drawn from the
prior) scores the data.

The simplest and most relevant likelihood function is the Bernoulli

p(y|θ)=θy(1−θ)1−y

Here y is 1 when our prediction turns out to be correct and is 0 otherwise.
And θ represents our model. Our model for now is just ‘what Scott
predicted.’

As an example, let’s take a prediction of θ=0.6. If the prediction turns
out to be true (y=1), then the Bernoulli likelihood function is equal to 0.6:

The likelihood function says that there was a 40% chance you were wrong.
Which is the same a predicting not θ with 40%.

If a person makes 3 predictions θ=[0.6,0.6,0.7] and the outcomes
were y=[1,0,1], then the likelihood of all 3 observations is simply the
product of the 3 Bernoulli likelihoods:

p(y∣θ)=3∏i=1p(yi∣θi)=0.6×(1−0.6)×0.7=0.168

Better predictions will have higher numbers.

It can be useful to divide by the null predictor to compare against random performance:

p(y∣θ=0.5)=N∏i=1p(yi∣θi=0.5)=0.5N

So the likelihood of the 3 above predictions are 0.1680.53≈1.34
times more likely than random. Making this person slightly better than random.

How good a predictor is Scott

Because Scott has made a lot of predictions, and because we will later implement
a ‘calibration’ model of Scott, let’s try to compare the likelihood of his 2019
predictions with the null model which predicts everything with 50% (which
implicitly mean that it also predicts it doesn’t happen with 50%).

First we import numeric and scientific python libraries

import numpy as np
import scipy as sp
import scipy.stats

Then we code Scott Alexanders 2019 prediction as [Guess, Outcome].

Because Outcome is what we want to predict, we put that in the y variable,
and put Guess in the predictor variable x.

So 7 billion times more likely! There are two reasons why this number is so large: 1) Scott made a lot of predictions and 2) Scott is a very good predictor.
It is easy to become a better predictor than Scott if you simply make a lot of predictions
about things that are easy to predict. The hard part is being as well-calibrated as Scott.

Prediction vs Calibration

Predictor:

A good predictor is a person who predicts better than random:

∏p(y∣θ)>>0.5N

A bad predictor is a person who predicts close to random:

∏p(y∣θ)≈0.5N

A terrible predictor is one who are worse than random:

∏p(y∣θ)<0.5N

It may be hard to understand how you can be worse than random, and that
of course takes skill, but if Scott had flipped all his guesses, his likelihood
ratio would be 17×109 which is much less than 1.

Now that we all agree that Scott is a good predictor, we can finally introduce
what we want to talk about: How well-calibrated is Scott and how do we measure
that?

Calibrated:

A well-calibrated predictor makes predictions that match the outcome frequency.

Example

Person A predicts 100 things with 60% confidence, 61 of them turns out to
occur, because 61100≈0.6 this person is very well-calibrated.

Person B predicts 100 things with 80% confidence, 67 of them turns out to
occur, because 67100≠0.8 this person is not very well-calibrated.

Because 67 > 61, is Person B the better predictor, even though they’re not
as well-calibrated? Let’s evaluate the likelihood of their claims.

Person A’s prediction is equivalent to 61 ‘correct’ 60% predictions and 39
‘correct’ 40% predictions, yielding the following likelihood:

0.661×0.439≈8.86×10−30

Person B’s prediction is equivalent to 67 ‘correct’ 80% predictions and 33
‘correct’ 20% predictions, yielding the following likelihood

0.867×0.233≈2.76×10−30

Because 8.86×10−30>2.76×10−30 Person A is also a slightly better
predictor than person B. To understand why, let’s consider Person C:

Person C predicts 100 things with 100% confidence and 99 of them turn out to occur. Thus, he will spend an eternity in Probability hell for assigning 0% probability to something that actually occurred. This is also reflected in the likelihood of his predictions, which is zero:

199×01=0

As renato points out in the comments, the likelihood tracks a combination of how many you got right and how well calibrated you are. Thus for your predictions to get more likely, you can either “git good” or “get calibrated”, where get calibrated seems like the more achievable goal. In the next post we will make a model that tracks calibration independent of prediction, this post is a teaser to introduce the necessary concepts for none statisticians.

Summary so far

We can improve the likelihood of our predictions by being both well-calibrated
and very knowledgeable. The next post in this series will focus on measuring calibration.

How good a predictor you are can be evaluated by the product of your likelihood
function. Is there a better way to evaluate this? Yes, make a model!

We can also make a model to find out how well-calibrated we are.
That is what we will explore in the next post.

## Prediction and Calibration—Part 1

Link post

Scott Alexander is a darling of the Bayesian rationalist community, he has a lot more epistemic humility than most, despite being an impressively well-calibrated predictor.

In this series we will try to achieve 2 things:

(this post) We try to understand what a likelihood function is, and use it to evaluate predictions

(next post) We Make a Bayesian calibration model, and get an uncertainty estimate over our calibration.

## The likelihood function

Let’s first look at Bayes Theorem

p(θ∣y)=p(y∣θ)p(θ)p(y)

In common parlance, the 4 parts of Bayes Theorem are called:

posterior=likelihood×priordata

What we want is our posterior, the probability of some model parameters (often θ) given some data (y). We construct a model with two things, a prior function which describes what we believe before seeing the data, and a likelihood function (p(y∣θ)) which given a model (θ, drawn from the prior) scores the data.

The simplest and most relevant likelihood function is the Bernoulli

p(y|θ)=θy(1−θ)1−y

Here y is 1 when our prediction turns out to be correct and is 0 otherwise. And θ represents our model. Our model for now is just ‘what Scott predicted.’

As an example, let’s take a prediction of θ=0.6. If the prediction turns out to be true (y=1), then the Bernoulli likelihood function is equal to 0.6:

p(y=1|θ=0.6)=θy(1−θ)1−y=0.61(1−0.6)1−1=0.61×0.40=0.6

And if the prediction turned out wrong (y=0), then:

p(y=0|θ=0.6)=θy(1−θ)1−y=0.60(1−0.6)1−0=0.60×0.41=0.4

The likelihood function says that there was a 40% chance you were wrong. Which is the same a predicting not θ with 40%.

If a person makes 3 predictions θ=[0.6,0.6,0.7] and the outcomes were y=[1,0,1], then the likelihood of all 3 observations is simply the product of the 3 Bernoulli likelihoods:

p(y∣θ)=3∏i=1p(yi∣θi)=0.6×(1−0.6)×0.7=0.168

Better predictions will have higher numbers.

It can be useful to divide by the null predictor to compare against random performance:

p(y∣θ=0.5)=N∏i=1p(yi∣θi=0.5)=0.5N

So the likelihood of the 3 above predictions are 0.1680.53≈1.34 times more likely than random. Making this person slightly better than random.

## How good a predictor is Scott

Because Scott has made a lot of predictions, and because we will later implement a ‘calibration’ model of Scott, let’s try to compare the likelihood of his 2019 predictions with the null model which predicts everything with 50% (which implicitly mean that it also predicts it doesn’t happen with 50%).

First we import numeric and scientific python libraries

Then we code Scott Alexanders 2019 prediction as [Guess, Outcome].

Because Outcome is what we want to predict, we put that in the y variable, and put Guess in the predictor variable x.

The person who made 3 predictions and got 2 correct was slightly better than random. How much better than random is Scott?

Let’s take the product of all his predictions.

So 7 billion times more likely! There are two reasons why this number is so large: 1) Scott made a lot of predictions and 2) Scott is a very good predictor. It is easy to become a better predictor than Scott if you simply make a lot of predictions about things that are easy to predict. The hard part is being as well-calibrated as Scott.

## Prediction vs Calibration

Predictor:A good predictor is a person who predicts better than random:

∏p(y∣θ)>>0.5N

A bad predictor is a person who predicts close to random:

∏p(y∣θ)≈0.5N

A terrible predictor is one who are worse than random:

∏p(y∣θ)<0.5N

It may be hard to understand how you can be worse than random, and that of course takes skill, but if Scott had flipped all his guesses, his likelihood ratio would be 17×109 which is much less than 1.

Now that we all agree that Scott is a good predictor, we can finally introduce what we want to talk about: How well-calibrated is Scott and how do we measure that?

Calibrated:A well-calibrated predictor makes predictions that match the outcome frequency.

ExamplePerson A predicts 100 things with 60% confidence, 61 of them turns out to occur, because 61100≈0.6 this person is very well-calibrated.

Person B predicts 100 things with 80% confidence, 67 of them turns out to occur, because 67100≠0.8 this person is not very well-calibrated.

Because 67 > 61, is Person B the better predictor, even though they’re not as well-calibrated? Let’s evaluate the likelihood of their claims.

Person A’s prediction is equivalent to 61 ‘correct’ 60% predictions and 39 ‘correct’ 40% predictions, yielding the following likelihood:

0.661×0.439≈8.86×10−30

Person B’s prediction is equivalent to 67 ‘correct’ 80% predictions and 33 ‘correct’ 20% predictions, yielding the following likelihood

0.867×0.233≈2.76×10−30

Because 8.86×10−30>2.76×10−30 Person A is also a slightly better predictor than person B. To understand why, let’s consider Person C:

Person C predicts 100 things with 100% confidence and 99 of them turn out to occur. Thus, he will spend an eternity in Probability hell for assigning 0% probability to something that actually occurred. This is also reflected in the likelihood of his predictions, which is zero:

199×01=0

As renato points out in the comments, the likelihood tracks a combination of how many you got right and how well calibrated you are. Thus for your predictions to get more likely, you can either “git good” or “get calibrated”, where get calibrated seems like the more achievable goal. In the next post we will make a model that tracks calibration independent of prediction, this post is a teaser to introduce the necessary concepts for none statisticians.

Summary so farWe can improve the likelihood of our predictions by being both well-calibrated and very knowledgeable. The next post in this series will focus on measuring calibration.

How good a predictor you are can be evaluated by the product of your likelihood function. Is there a better way to evaluate this? Yes, make a model!

We can also make a model to find out how well-calibrated we are. That is what we will explore in the next post.