When DeepSeek R1 came out in January 2025, I felt that the fact that RL on LLMs simply worked was incredible, but using it on coding and math wasn’t the right path.
Before RL we had pretraining, a scalable and general training methodology that worked extremely well to get the model to the human level, through learning by imitation over human data. Then RL came in and gave us a way to get even further, to the expert level and beyond, through sampling many trajectories from the LLM and using a reward function to select the best ones to reinforce. But it isn’t general anymore when only short term, self contained verifiable tasks such as coding or math make up the environment.
A strongly superhuman coder might change everything—if recursive self improvement happens like the labs hope (and doesn’t kill us). But it might not change that much at all by itself, beyond giving us more of the software abundance we in many ways already have. A strongly superhuman forecaster instantly gives people and organizations the ability to make superhuman decisions through forecasting of their outcomes, and would be a massive boost to the overall competence of our civilization.
You may ask why should it work, even in theory—math is deterministic and forecasting is not, so forecasting reward may give bad weight updates. The analogy to keep in mind is next token prediction—the model predicts a distribution, and sometimes it’s punished for a great forecast, because the next token was typo. And so you start with a high learning rate and lower it gradually during training, and the averaging of the updates at the end is enough to continue to get stable learning even when your accuracy is high and the randomness of the signal starts to dominate your errors. Which is to say, this is a solved problem.
So I decided my thesis at Oxford would be on RL for forecasting, though it took me another half a year after the end of the degree to get what I think is a truly scalable, working formula.
The main observation is that the most tempting way to do RL on forecasting, and what I did for the thesis—the simplest way—is to separate the context gathering step for the LLM, from the forecasting, and train this second part. You gather a bunch of questions, generate context summaries at the time of the dataset creation for each question, and then a few months later when the questions have resolved, you train a model to reason over this context to create a probability.
What I noticed is that there is a scaling effect across base model capability and across compute spent training on forecasting—but then a plateau is reached, and that plateau depends on the size of the pregenerated context summary. That is, performance is bottlenecked by the information available to the model. And it makes it look like you can reach—or mildly surpass—frontier LLMs with RL on smaller models, but not go any further. But this is only in this limited information environment.
Seeing how important the context is made me realize that you need to buckle up and figure out how to let the model do tool calls to find the information it needs inside the RL environment, as if it has access to the live internet, even though of course you cannot do that, because the questions have already resolved. As far as I’m aware, no one had done it yet, but it’s actually not as daunting as you would think.
Something to understand is that even when an AI model does live forecasting, it’s not best to give it a single “browser use” tool. You give it many tools, because some APIs are gated by API keys and accounts that you don’t want the model to create itself for every question, you want to give it a coding tool so it can do simulations to inform its forecasts if it wishes, and then yes, you also give it a search tool and a web fetch tool.
So to create the time-masked RL environment (“cached internet”), you give it many tools, and from its perspective, they are simply mildly more finicky than the tools it would ideally have, and I’ve seen no evidence the model realizes it’s “in the past” so to speak. It still has search, but the search works through searching Wikipedia dumps, Wikipedia revisions, and AskNews. It still has a fetch web tool, it just (invisibly) goes through the Wayback Machine. And the APIs—Google Trends data, finance data, are the easiest to time-mask, and look identical to their counterparts in the live forecasting environment.
Results
Once I had this environment working, improvement became a lot more drastic. First of all it became easier to get any improvement at all—I switched from training on Polymarket price prediction (which is easier to get stable training from since it has less randomness, at the cost of introducing limits and biases) to real world resolutions.[1] Second, the improvement went from taking small models to matching large models in the low information regime, to taking moderately sized open-weights models (DeepSeek V3.1) to crushing massive closed-source models in the full environment.
The results shown here are backtested on 100 random questions from those already resolved on the Metaculus Spring 2026 AI Benchmark, the y-axis is Brier score (lower is better, 0.25 is a coin flip, 0 is a perfect oracle) and the x-axis is how much it cost me to get to that training checkpoint. An example of a typical question is “Will any model evaluated by Epoch AI score at least 40.0% on FrontierMath Tier 4 before May 1, 2026?”.
The improvement getting steeper in time is partially real improvements to the training environment (e.g. at the start the only tool was Wikipedia search and Wikipedia revisions) and partially an artifact of the x-axis being cost, and me figuring out cost saving measures over time—especially at the ~$5K mark I realized that generalization from few tool calls during training (as few as three tool calls) to more during evaluation (10 tool calls) actually worked incredibly well, and since training cost is quadratic in the trajectory length, it actually makes training way more efficient.
Nevertheless, it is clear results are not plateauing. I varied the exact training dataset over time, but all checkpoints were trained on questions that were resolved before the questions in Metaculus Spring 2026 opened.[2] The last two checkpoints are a result of maximalist training on a wide variety of questions from Metaculus Minibenches, Metaculus Fall 2025, as well as personal decision questions (“If I do X, will I regret it?”) and seemed to have generalized best to Spring 2026. The model was released before the first question opened, so we can be sure that the pretraining data did not contain any leakage.
During March 2026 I let one of the RL’d DeepSeek model checkpoints compete on the Metaculus competitions as-is, getting a few hundred dollars in winnings, which was enough to indicate that there is also generalization from improvement on the “cached internet”/temporally masked training tools to the live tools. In addition, the familiarity for the model of the tools available in the environment isn’t enough to overtake the superiority of the full internet browsing capability that live forecasting permits—a version of the model performing live using simple live alternatives to the training tools beats a version of the model using the backtesting/training compatible tools.
But this isn’t yet enough to beat the current AI forecasting state of the art, which seems to be heavy scaffolding on the information gathering side and then ensembling outputs from different models on the prediction itself, with possibly the simpler RL methodology I described above being applied as well.
My thoughts on this
I think that this path is clear enough, compute efficient enough, and the use cases valuable enough that this is a capability that we will be seeing in AI models more over the coming years. This is clearly not happening yet though—the improvement over the past two years in model performance is less than what a few thousand dollars of purposeful forecasting RL gets you, many orders of magnitude less than the amount the labs have spent between those models (gpt-4o and Gemini 3.1 Pro are SOTA forecasting models 2 years apart).
In the short term I’m personally especially interested in forecasting as it impacts better planning and decision making. People regularly use LLMs for advice on a whole plethora of small and large topics, and an LLM trained on forecasting—and with a better long term world model as a result—would be a lot better positioned to give good advice.
I also think this direction is a lot safer. Early LLMs were used to answer questions, but instead of getting much better at this we are now redirecting them towards being more and more agentic, and as a result riskier from an alignment perspective.
Part of the reason for this is that there isn’t a real vision for how LLMs can become better at question answering than “give the current expert consensus answer” or “give the answer the user will like” (RLHF), since we don’t have a reward signal for what is “actually true” otherwise. I think forecasting gives us this signal since we can instead have the goal be: give the future consensus answer, or predict what the future user would have been happy with retrospectively, after having followed the proposed advice.
The definition of an ASI has changed rapidly over the past few years, but I think if we simply had a strongly superhuman forecaster, that gives us a probability as a response to questions—that would already be enough to solve most of our problems—maybe not as fast as a more active ASI, but fast enough, and it would be less likely to kill us.[3]
Some technical details that are important:
You don’t want the group size to be too big—at the extremes, the model will always have a trajectory predicting 0% and one predicting 100%. In practice this is a big problem for the smaller models and becomes less so for larger models that generally give reasonable forecasts in all of their trajectories.
One way to alleviate the above problem is, even if your evaluation is binary, to train on numeric forecasts—for instance, instead of answering “Will X happen by Y”, predict a timestamp—when will X happen. A practical way to have the model output these full distributions is by having it output code that defines the distribution.
As shown by Bereket et al, proper scoring rules stop being proper when you divide by the group standard deviation, so you need to remove that part of the reward function.
Training on high volume prediction market prices works great for a fast improvement on those questions, but it generalizes poorly to other types of questions in practice. My hypothesis is that these questions by definition have a lot of commentary about them online—and so the model learns to find a good forecast online, instead of reasoning independently. This doesn’t work for questions that have attracted less interest.
The LLM doing the forecast itself would still need to be air-gapped, but the idea is that the actual final output would be aligned since the model was consistently rewarded during training to a very simple, objective reward—what will happen. And the model can be entirely “read-only” at any point but when it provides that final output.
We Should Be Scaling RL on Forecasting
Link post
This is a crosspost of a post from my blog, Metal Ivy. The original is here: Reinforcement Learning on Forecasting Will Give Us a Superhuman Forecaster.
Why RL on forecasting?
When DeepSeek R1 came out in January 2025, I felt that the fact that RL on LLMs simply worked was incredible, but using it on coding and math wasn’t the right path.
Before RL we had pretraining, a scalable and general training methodology that worked extremely well to get the model to the human level, through learning by imitation over human data. Then RL came in and gave us a way to get even further, to the expert level and beyond, through sampling many trajectories from the LLM and using a reward function to select the best ones to reinforce. But it isn’t general anymore when only short term, self contained verifiable tasks such as coding or math make up the environment.
A strongly superhuman coder might change everything—if recursive self improvement happens like the labs hope (and doesn’t kill us). But it might not change that much at all by itself, beyond giving us more of the software abundance we in many ways already have. A strongly superhuman forecaster instantly gives people and organizations the ability to make superhuman decisions through forecasting of their outcomes, and would be a massive boost to the overall competence of our civilization.
You may ask why should it work, even in theory—math is deterministic and forecasting is not, so forecasting reward may give bad weight updates. The analogy to keep in mind is next token prediction—the model predicts a distribution, and sometimes it’s punished for a great forecast, because the next token was typo. And so you start with a high learning rate and lower it gradually during training, and the averaging of the updates at the end is enough to continue to get stable learning even when your accuracy is high and the randomness of the signal starts to dominate your errors. Which is to say, this is a solved problem.
So I decided my thesis at Oxford would be on RL for forecasting, though it took me another half a year after the end of the degree to get what I think is a truly scalable, working formula.
The main observation is that the most tempting way to do RL on forecasting, and what I did for the thesis—the simplest way—is to separate the context gathering step for the LLM, from the forecasting, and train this second part. You gather a bunch of questions, generate context summaries at the time of the dataset creation for each question, and then a few months later when the questions have resolved, you train a model to reason over this context to create a probability.
What I noticed is that there is a scaling effect across base model capability and across compute spent training on forecasting—but then a plateau is reached, and that plateau depends on the size of the pregenerated context summary. That is, performance is bottlenecked by the information available to the model. And it makes it look like you can reach—or mildly surpass—frontier LLMs with RL on smaller models, but not go any further. But this is only in this limited information environment.
Seeing how important the context is made me realize that you need to buckle up and figure out how to let the model do tool calls to find the information it needs inside the RL environment, as if it has access to the live internet, even though of course you cannot do that, because the questions have already resolved. As far as I’m aware, no one had done it yet, but it’s actually not as daunting as you would think.
Something to understand is that even when an AI model does live forecasting, it’s not best to give it a single “browser use” tool. You give it many tools, because some APIs are gated by API keys and accounts that you don’t want the model to create itself for every question, you want to give it a coding tool so it can do simulations to inform its forecasts if it wishes, and then yes, you also give it a search tool and a web fetch tool.
So to create the time-masked RL environment (“cached internet”), you give it many tools, and from its perspective, they are simply mildly more finicky than the tools it would ideally have, and I’ve seen no evidence the model realizes it’s “in the past” so to speak. It still has search, but the search works through searching Wikipedia dumps, Wikipedia revisions, and AskNews. It still has a fetch web tool, it just (invisibly) goes through the Wayback Machine. And the APIs—Google Trends data, finance data, are the easiest to time-mask, and look identical to their counterparts in the live forecasting environment.
Results
Once I had this environment working, improvement became a lot more drastic. First of all it became easier to get any improvement at all—I switched from training on Polymarket price prediction (which is easier to get stable training from since it has less randomness, at the cost of introducing limits and biases) to real world resolutions.[1] Second, the improvement went from taking small models to matching large models in the low information regime, to taking moderately sized open-weights models (DeepSeek V3.1) to crushing massive closed-source models in the full environment.
The results shown here are backtested on 100 random questions from those already resolved on the Metaculus Spring 2026 AI Benchmark, the y-axis is Brier score (lower is better, 0.25 is a coin flip, 0 is a perfect oracle) and the x-axis is how much it cost me to get to that training checkpoint. An example of a typical question is “Will any model evaluated by Epoch AI score at least 40.0% on FrontierMath Tier 4 before May 1, 2026?”.
The improvement getting steeper in time is partially real improvements to the training environment (e.g. at the start the only tool was Wikipedia search and Wikipedia revisions) and partially an artifact of the x-axis being cost, and me figuring out cost saving measures over time—especially at the ~$5K mark I realized that generalization from few tool calls during training (as few as three tool calls) to more during evaluation (10 tool calls) actually worked incredibly well, and since training cost is quadratic in the trajectory length, it actually makes training way more efficient.
Nevertheless, it is clear results are not plateauing. I varied the exact training dataset over time, but all checkpoints were trained on questions that were resolved before the questions in Metaculus Spring 2026 opened.[2] The last two checkpoints are a result of maximalist training on a wide variety of questions from Metaculus Minibenches, Metaculus Fall 2025, as well as personal decision questions (“If I do X, will I regret it?”) and seemed to have generalized best to Spring 2026. The model was released before the first question opened, so we can be sure that the pretraining data did not contain any leakage.
During March 2026 I let one of the RL’d DeepSeek model checkpoints compete on the Metaculus competitions as-is, getting a few hundred dollars in winnings, which was enough to indicate that there is also generalization from improvement on the “cached internet”/temporally masked training tools to the live tools. In addition, the familiarity for the model of the tools available in the environment isn’t enough to overtake the superiority of the full internet browsing capability that live forecasting permits—a version of the model performing live using simple live alternatives to the training tools beats a version of the model using the backtesting/training compatible tools.
But this isn’t yet enough to beat the current AI forecasting state of the art, which seems to be heavy scaffolding on the information gathering side and then ensembling outputs from different models on the prediction itself, with possibly the simpler RL methodology I described above being applied as well.
My thoughts on this
I think that this path is clear enough, compute efficient enough, and the use cases valuable enough that this is a capability that we will be seeing in AI models more over the coming years. This is clearly not happening yet though—the improvement over the past two years in model performance is less than what a few thousand dollars of purposeful forecasting RL gets you, many orders of magnitude less than the amount the labs have spent between those models (gpt-4o and Gemini 3.1 Pro are SOTA forecasting models 2 years apart).
In the short term I’m personally especially interested in forecasting as it impacts better planning and decision making. People regularly use LLMs for advice on a whole plethora of small and large topics, and an LLM trained on forecasting—and with a better long term world model as a result—would be a lot better positioned to give good advice.
I also think this direction is a lot safer. Early LLMs were used to answer questions, but instead of getting much better at this we are now redirecting them towards being more and more agentic, and as a result riskier from an alignment perspective.
Part of the reason for this is that there isn’t a real vision for how LLMs can become better at question answering than “give the current expert consensus answer” or “give the answer the user will like” (RLHF), since we don’t have a reward signal for what is “actually true” otherwise. I think forecasting gives us this signal since we can instead have the goal be: give the future consensus answer, or predict what the future user would have been happy with retrospectively, after having followed the proposed advice.
The definition of an ASI has changed rapidly over the past few years, but I think if we simply had a strongly superhuman forecaster, that gives us a probability as a response to questions—that would already be enough to solve most of our problems—maybe not as fast as a more active ASI, but fast enough, and it would be less likely to kill us.[3]
Some technical details that are important:
You don’t want the group size to be too big—at the extremes, the model will always have a trajectory predicting 0% and one predicting 100%. In practice this is a big problem for the smaller models and becomes less so for larger models that generally give reasonable forecasts in all of their trajectories.
One way to alleviate the above problem is, even if your evaluation is binary, to train on numeric forecasts—for instance, instead of answering “Will X happen by Y”, predict a timestamp—when will X happen. A practical way to have the model output these full distributions is by having it output code that defines the distribution.
As shown by Bereket et al, proper scoring rules stop being proper when you divide by the group standard deviation, so you need to remove that part of the reward function.
Training on high volume prediction market prices works great for a fast improvement on those questions, but it generalizes poorly to other types of questions in practice. My hypothesis is that these questions by definition have a lot of commentary about them online—and so the model learns to find a good forecast online, instead of reasoning independently. This doesn’t work for questions that have attracted less interest.
Excluding the one labeled “polymarket numeric” which was on price prediction, and a few Minibench questions that resolved in January 2026.
The LLM doing the forecast itself would still need to be air-gapped, but the idea is that the actual final output would be aligned since the model was consistently rewarded during training to a very simple, objective reward—what will happen. And the model can be entirely “read-only” at any point but when it provides that final output.