NunoSempere comments on Range and Forecasting Accuracy

NunoSempere 18 Nov 2020 17:17 UTC
1 point
0
Nice post! I agree that the conclusion is counterintuitive.

For Metaculus, the results are pretty astonishing: the correlation is negative for all four options, meaning that the higher the range of the question, the lower the Brier score (and therefore, the higher the accuracy)! And the correlation is extremly low either: −0.2 is quite formidable.

I tried to replicate some of your analysis, but I got different results for Metaculus (I still got the negative correlation for PredictionBook, though). I think this might be to an extent an artifact of the way you group your forecasts:

In bash, add headers, so that I can open the files and see how they look
```
$ echo "id,questionrange,result,probability,range" > met2.csv
$ cat met.csv >> met2.csv
$ echo "id,questionrange,result,probability,range" > pb.csv
$ cat pb.csv >> pb2.csv
```
In R:
```
library(ggplot2)

## Metaculus
data <- read.csv("met2.csv")
data$brier = (data$result-data$probability)^2

summary(lm(data$brier ~ data$range)) ## Positive correlation.
ggplot(data=data, aes(x=range, y=brier))+
  geom_point(size=0.1)

### Normalize the range and the brier to get better units
data$briernorm = (data$brier - mean(data$brier))/sd(data$brier)
data$rangenorm = (data$range - mean(data$range))/sd(data$range)
summary(lm(data$briernorm ~ data$rangenorm))
   ### I get a correlation of ~0.02, on a standard deviation of 1, i.e., a correlation of 2%.

## Same thing for PredictionBook
data <- read.csv("pb2.csv")
data$brier = (data$result-data$probability)^2

summary(lm(data$brier ~ data$range)) ## Negative correlation.
ggplot(data=data, aes(x=range, y=brier))+
  geom_point(size=0.2)

### Normalize the range and the brier to get better units
data$briernorm = (data$brier - mean(data$brier))/sd(data$brier)
data$rangenorm = (data$range - mean(data$range))/sd(data$range)

summary(lm(data$briernorm ~ data$rangenorm))
### I get a correlation of ~-0.02, on a standard deviation of 1, i.e., a correlation of -2%.
```
Essentially, when you say

To compare the accuracy between forecasts, one can’t deal with individual forecasts, only with sets of forecasts and outcomes. Here, I organise the predictions into buckets according to range.

This doesn’t necessarily follow, i.e., you can still calculate a regression between Brier score and range (time until resolution).
- niplav 19 Dec 2020 13:17 UTC
  1 point
  0
  Parent
  Okay, I finally had some time to look at your feedback.
  
  The problem is, as you said, my attempt to bucket predictions together after range. This removes data, and makes my analysis much more complicated than it needs to be.
  
  I thought that bucketing was a good idea because I was not sure how meaningful a brier score on only one forecast & outcome variable is (I didn’t have a very clear idea of why that should be the case, and didn’t question that intuition).
  
  Let’s say I have my datasets $f_{i}$ (predictions), $o_{i}$ (outcomes) and $r_{i}$ (ranges), $i \in 1.. n$ .
  
  Then your analysis is calculating $cor ((o_{i} - f_{i})^{2}, r_{i})$ . I introduced a partition variable $p_{j}$ ( $j \in 1.. m$ ) and calculated $cor ((brier (o_{p_{j}} . . o_{p_{j + 1} - 1}, f_{p_{j}} . . f_{p_{i + 1} - 1}) | \forall j \in 1.. n), (avg (r_{p_{j}} . . o_{p_{j + 1} - 1} | \forall j \in 1.. n))$ .
  
  This throws away information: if one makes $p_{1} = 1$ and $p_{2} = n$ , then one gets one brier score (of all forecasts & outcomes), and the average of all ranges, which results in a correlation of 1 (I haven’t proven that partitioning more roughly loses data monotonically, but it seems intuitively true to me).
  
  If I repeat your analysis, I get the results you got.
  
  Basically, I believe my text lacks internal validity, but still has construct validity.
  
  Starting from here, I will probably rewrite large parts of the text (and the code, maybe even in a more understandable language) and apply your analysis by removing the bucketing of data.
  - NunoSempere 26 Dec 2020 18:06 UTC
    1 point
    0
    Parent
    Cool. Once you rewrite that, and if you do so before the end of the year, I’d encourage you to resubmit it to this contest.
    In particular, the reason I’m excited about this kind of work is because it allows us to have at least some information about how accurate long-term predictions can be. Some previous work on this has been done, e.g., rating Kurzweil’s predictions from the 90s but overall we have very little information about this kind of thing. And yet we are interested in seeing how good we can be at making predictions n years out, and potentially making decisions based on that.
- NunoSempere 18 Nov 2020 17:58 UTC
  1 point
  0
  Parent
  Another interesting this you can do is to calculate the accuracy score (Brier score—average of the Brier scores for the question), which adjusts for question difficulty. You gesture at this in your “Accuracy between questions” section.
  
  If you do this, forecasts made further from the resolution time do worse, both in PredictionBook and in Metaculus (correlation is p<0.001, but very small). Code in R:
```
datapre <- read.csv("pb2.csv") ## or met2.csv
data <- datapre[datapre$range>0,]

data$brier = (data$result-data$probability)^2

accuracyscores = c() ## Lower is better, much like the Brier score.
ranges = c()
for(id in unique(data$id)){
  predictions4question = (data$id == id)
  
  briers4question = data$brier[predictions4question]
  accuracyscores4question = briers4question - mean(briers4question)
  ranges4question = data$range[predictions4question]
  
  accuracyscores=c(accuracyscores,accuracyscores4question)
  ranges=c(ranges, ranges4question)
}
summary(lm(accuracyscores ~ ranges))
```
- NunoSempere 18 Nov 2020 17:21 UTC
  1 point
  0
  Parent
  Another interesting thing you can do with the data is to calculate the prior probability that a Metaculus or PB question will resolve positively:
```
data <- read.csv("met2.csv") ## or pb2.csv
data$brier = (data$result-data$probability)^2 
results = c()
for(id in unique(data$id)){
  predictions = ( data$id == id ) 
  result = data$result[predictions[1]]
  results = c(results, result)
}
mean(results) 
```
  For Metaculus, this is 0.3160874, for PB this is 0.3770311