Nice post! I agree that the conclusion is counterintuitive.
For Metaculus, the results are pretty astonishing: the correlation is negative for all four options, meaning that the higher the range of the question, the lower the Brier score (and therefore, the higher the accuracy)! And the correlation is extremly low either: −0.2 is quite formidable.
I tried to replicate some of your analysis, but I got different results for Metaculus (I still got the negative correlation for PredictionBook, though). I think this might be to an extent an artifact of the way you group your forecasts:
In bash, add headers, so that I can open the files and see how they look
library(ggplot2)
## Metaculus
data <- read.csv("met2.csv")
data$brier = (data$result-data$probability)^2
summary(lm(data$brier ~ data$range)) ## Positive correlation.
ggplot(data=data, aes(x=range, y=brier))+
geom_point(size=0.1)
### Normalize the range and the brier to get better units
data$briernorm = (data$brier - mean(data$brier))/sd(data$brier)
data$rangenorm = (data$range - mean(data$range))/sd(data$range)
summary(lm(data$briernorm ~ data$rangenorm))
### I get a correlation of ~0.02, on a standard deviation of 1, i.e., a correlation of 2%.
## Same thing for PredictionBook
data <- read.csv("pb2.csv")
data$brier = (data$result-data$probability)^2
summary(lm(data$brier ~ data$range)) ## Negative correlation.
ggplot(data=data, aes(x=range, y=brier))+
geom_point(size=0.2)
### Normalize the range and the brier to get better units
data$briernorm = (data$brier - mean(data$brier))/sd(data$brier)
data$rangenorm = (data$range - mean(data$range))/sd(data$range)
summary(lm(data$briernorm ~ data$rangenorm))
### I get a correlation of ~-0.02, on a standard deviation of 1, i.e., a correlation of -2%.
Essentially, when you say
To compare the accuracy between forecasts, one can’t deal with individual forecasts, only with sets of forecasts and outcomes. Here, I organise the predictions into buckets according to range.
This doesn’t necessarily follow, i.e., you can still calculate a regression between Brier score and range (time until resolution).
Okay, I finally had some time to look at your feedback.
The problem is, as you said, my attempt to bucket predictions together after range. This removes data, and makes my analysis much more complicated than it needs to be.
I thought that bucketing was a good idea because I was not sure how meaningful a brier score on only one forecast & outcome variable is (I didn’t have a very clear idea of why that should be the case, and didn’t question that intuition).
Let’s say I have my datasets fi (predictions), oi (outcomes) and ri (ranges), i∈1..n.
Then your analysis is calculating cor((oi−fi)2,ri). I introduced a partition variable pj (j∈1..m) and calculated cor((brier(opj..opj+1−1,fpj..fpi+1−1)|∀j∈1..n),(avg(rpj..opj+1−1|∀j∈1..n)).
This throws away information: if one makes p1=1 and p2=n, then one gets one brier score (of all forecasts & outcomes), and the average of all ranges, which results in a correlation of 1 (I haven’t proven that partitioning more roughly loses data monotonically, but it seems intuitively true to me).
If I repeat your analysis, I get the results you got.
Starting from here, I will probably rewrite large parts of the text (and the code, maybe even in a more understandable language) and apply your analysis by removing the bucketing of data.
Cool. Once you rewrite that, and if you do so before the end of the year, I’d encourage you to resubmit it to this contest.
In particular, the reason I’m excited about this kind of work is because it allows us to have at least some information about how accurate long-term predictions can be. Some previous work on this has been done, e.g., rating Kurzweil’s predictions from the 90s but overall we have very little information about this kind of thing. And yet we are interested in seeing how good we can be at making predictions n years out, and potentially making decisions based on that.
Another interesting this you can do is to calculate the accuracy score (Brier score—average of the Brier scores for the question), which adjusts for question difficulty. You gesture at this in your “Accuracy between questions” section.
If you do this, forecasts made further from the resolution time do worse, both in PredictionBook and in Metaculus (correlation is p<0.001, but very small). Code in R:
datapre <- read.csv("pb2.csv") ## or met2.csv
data <- datapre[datapre$range>0,]
data$brier = (data$result-data$probability)^2
accuracyscores = c() ## Lower is better, much like the Brier score.
ranges = c()
for(id in unique(data$id)){
predictions4question = (data$id == id)
briers4question = data$brier[predictions4question]
accuracyscores4question = briers4question - mean(briers4question)
ranges4question = data$range[predictions4question]
accuracyscores=c(accuracyscores,accuracyscores4question)
ranges=c(ranges, ranges4question)
}
summary(lm(accuracyscores ~ ranges))
Nice post! I agree that the conclusion is counterintuitive.
I tried to replicate some of your analysis, but I got different results for Metaculus (I still got the negative correlation for PredictionBook, though). I think this might be to an extent an artifact of the way you group your forecasts:
In bash, add headers, so that I can open the files and see how they look
In R:
Essentially, when you say
This doesn’t necessarily follow, i.e., you can still calculate a regression between Brier score and range (time until resolution).
Okay, I finally had some time to look at your feedback.
The problem is, as you said, my attempt to bucket predictions together after range. This removes data, and makes my analysis much more complicated than it needs to be.
I thought that bucketing was a good idea because I was not sure how meaningful a brier score on only one forecast & outcome variable is (I didn’t have a very clear idea of why that should be the case, and didn’t question that intuition).
Let’s say I have my datasets fi (predictions), oi (outcomes) and ri (ranges), i∈1..n.
Then your analysis is calculating cor((oi−fi)2,ri). I introduced a partition variable pj (j∈1..m) and calculated cor((brier(opj..opj+1−1,fpj..fpi+1−1)|∀j∈1..n),(avg(rpj..opj+1−1|∀j∈1..n)).
This throws away information: if one makes p1=1 and p2=n, then one gets one brier score (of all forecasts & outcomes), and the average of all ranges, which results in a correlation of 1 (I haven’t proven that partitioning more roughly loses data monotonically, but it seems intuitively true to me).
If I repeat your analysis, I get the results you got.
Basically, I believe my text lacks internal validity, but still has construct validity.
Starting from here, I will probably rewrite large parts of the text (and the code, maybe even in a more understandable language) and apply your analysis by removing the bucketing of data.
Cool. Once you rewrite that, and if you do so before the end of the year, I’d encourage you to resubmit it to this contest.
In particular, the reason I’m excited about this kind of work is because it allows us to have at least some information about how accurate long-term predictions can be. Some previous work on this has been done, e.g., rating Kurzweil’s predictions from the 90s but overall we have very little information about this kind of thing. And yet we are interested in seeing how good we can be at making predictions n years out, and potentially making decisions based on that.
Another interesting this you can do is to calculate the accuracy score (Brier score—average of the Brier scores for the question), which adjusts for question difficulty. You gesture at this in your “Accuracy between questions” section.
If you do this, forecasts made further from the resolution time do worse, both in PredictionBook and in Metaculus (correlation is p<0.001, but very small). Code in R:
Another interesting thing you can do with the data is to calculate the prior probability that a Metaculus or PB question will resolve positively:
For Metaculus, this is 0.3160874, for PB this is 0.3770311