A useful & readable discussion of various methodological problems (including the date-range search problems above) which render all forecasting backtesting dead on arrival (IMO) was recently compiled as “Pitfalls in Evaluating Language Model Forecasters”, Paleka et al 2025, and is worth reading if you are at all interested in the topic.
Update: Bots are still beaten by human forecasting teams/superforecasters/centaurs on truly heldout Metaculus problems as of early 2025: https://www.metaculus.com/notebooks/38673/q1-ai-benchmarking-results/
A useful & readable discussion of various methodological problems (including the date-range search problems above) which render all forecasting backtesting dead on arrival (IMO) was recently compiled as “Pitfalls in Evaluating Language Model Forecasters”, Paleka et al 2025, and is worth reading if you are at all interested in the topic.