Great questions, and thanks for the helpful comments!

underconfidence issues

We have not tried explicit extremizing. But in the study where we average our system’s prediction with community crowd, we find improved results better than both (under Brier scores). This essentially does the extremizing in those <10% cases.

However, humans on INFER and Metaculus are scored according to the integral of their Brier score over time, i.e. their score gets weighted by how long a given forecast is up

We were not aware of this! We always take unweighted average across the retrieval dates when evaluating our system. If we put more weights on the later retrieval dates, the gap between our system and human should be a bit smaller, for the reason you said.

Relatedly, have you tried other retrieval schedules and if so did they affect the results

No, we have not tried. One alternative is to sample k random or uniformly spaced intervals within [open, resolve]. Unfortunately, this is not super kosher as this leaks the resolution date, which, as we argued in the paper, correlates with the label.

figure 4c

This is on validation set. Notice that the figure caption begins with “Figure 4: System strengths. Evaluating on the validation set, we note”

log score

We will update the paper soon to include the log score.

standard error in time series

See here for some alternatives in time series modeling.

I don’t know what is the perfect choice in judgemental forecasting, though. I am not sure if it has been studied at all (probably kind of an open question). Generally, the keyword here you can Google is “autocorrelation standard error” and “standard error in time series”.

Great questions, and thanks for the helpful comments!

We have not tried explicit extremizing. But in the study where we average our system’s prediction with community crowd, we find improved results better than both (under Brier scores). This essentially does the extremizing in those <10% cases.

We were not aware of this! We always take unweighted average across the retrieval dates when evaluating our system. If we put more weights on the later retrieval dates, the gap between our system and human should be a bit smaller, for the reason you said.

No, we have not tried. One alternative is to sample k random or uniformly spaced intervals within [open, resolve]. Unfortunately, this is not super kosher as this leaks the resolution date, which, as we argued in the paper, correlates with the label.

This is on validation set. Notice that the figure caption begins with “Figure 4: System strengths. Evaluating on the

validation set, we note”We will update the paper soon to include the log score.

See here for some alternatives in time series modeling.

I don’t know what is the perfect choice in judgemental forecasting, though. I am not sure if it has been studied at all (probably kind of an open question). Generally, the keyword here you can Google is “autocorrelation standard error” and “standard error in time series”.