Epistemic status: low-medium confidence in results, this is work I did last year and has a low sample size. However I think the takeaways are still accurate.
I built a forecasting bot using OpenAI’s Reinforcement Finetuning and a multi-agent architecture, then tested it against simpler baselines in a metaculus tournament. The aggregate scores favored the baseline, but when I broke down results by question type, the finetuned model outperformed on numeric questions (average +14.59 vs +9.25 using Metaculus Baseline Scoring. 0 means random guessing) while underperforming on binary ones (-0.7 vs +2.4).
The tournament was minibench-2025-09-29, running from September 29 to October 25, 2025.
Setup
I used o4-mini finetuned via OpenAI RFT as the model and compared it to o4-mini and high-effort o4-mini metac bots.
For my scaffold, I built a multi-agent system with 3 parallel forecaster teams (each team includes a researcher to find info from the web and a forecaster to predict the result) then an aggregator. The aggregator finds a middle ground between the teams and chooses when to stop—specifically, when predictions converge to within 2% over two rounds.
Then I used RFT to improve forecasting ability. I used a dual grading setup: 60% forecast accuracy (using baseline scores) and 40% reasoning (we want to ensure the model is learning to reason based on events instead of memorizing—a pitfall in paper Pitfalls in Evaluating Language Model Forecasters).
One limitation: OpenAI’s data policies required me to compress the web research content in my training samples. This meant the model learned how to use research results for forecasting, but not how to find good research in the first place.
The training dataset had 979 samples from 344 unique forecasting questions. The dataset was created by running the system on these questions and compressing research at different steps of the trajectory to turn into training samples. Training cost $1,670 and took about 12 hours 42 minutes.
The training data broke down by question type as: 56.5% binary, 21.5% multiple choice, 21.1% numeric, 0.9% discrete. By topic: roughly 54% miscellaneous (flight destinations, reservoir levels, etc.), 16% politics/government, 9% economics/finance, 9% AI/technology, 5% geopolitics, 4% business/stocks, 3% sports, and under 1% weather.
Meanwhile, the baseline metac bots used a simpler pipeline: research → 5 parallel forecasters → take the median.
Results
On 35 questions where all three bots competed, my finetuned model won 12 (34.3%), the o4-mini baseline won 13 (37.1%), and o4-mini-high won 10 (28.6%). Average scores: finetuned 3.23, baseline 4.16, high-effort 1.82.
When I looked at performance by question type, the picture changed. The tournament had 9 numeric questions and 26 binary questions. On numeric, my finetuned model won 5 of 9 (55.6%) with average score +14.59, versus +9.25 for baseline. On binary, my finetuned model won only 7 of 26 (26.9%) with average score −0.70, versus +2.40 for baseline.
The finetuned model’s aggregate underperformance was driven by binary questions. On numeric questions it outperformed the baseline.
Biggest wins came on financial questions while biggest losses came on political/legal questions.
Key Lesson 1: The model learned to trust authoritative sources
Here are 2 questions which show this pattern where the finetuned model did well and poorly due to this heuristic.
Case 1: Inditex market cap (+77 points) The question asked whether the Metaculus community prediction would stay above 40% for whether Inditex would remain in the EU’s top 10 companies by market cap.
My finetuned bot predicted 58%: “CompaniesMarketCap consistently lists Inditex at #10 among EU companies with a 5-6% cushion over #11 Deutsche Telekom. This directly supports the underlying event and should keep the Metaculus community prediction above 40%...”
The baseline predicted 20%: “With only nine days to go and no major positive catalyst for Inditex—recent news shows underperformance relative to peers—forecasters will likely drift lower.”
Resolution: Yes. My bot found an authoritative data source that directly measured the outcome and trusted it.
Case 2: IRA tax credits (-62 points) The question asked whether the community prediction would rise above 5.7% for whether IRA manufacturing tax credits would start requiring domestic materials.
My finetuned bot predicted 74%: “Multiple independent and reputable legal analyses (Miller & Chevalier, Grant Thornton, Sidley, Arnold & Porter, Baker Botts, Gibson Dunn, Davis Polk, Latham, RSM) converge to confirm that the July 2025 reconciliation law enacted a specific domestic-materials requirement… it is highly probable that a well-sourced comment will surface and gain traction before October 18.”
The baseline predicted 15%: “With Republicans holding Congress and no public push for new domestic-content rules in the next two weeks, it’s unlikely the community forecast will climb meaningfully.”
Resolution: No. My bot found authoritative legal analysis and trusted it. But the question wasn’t about the legal facts. It was about whether forecasters would notice and update their predictions in the next two weeks.
My takeaway: In both cases, my model found credible sources and made confident predictions. It would have been better if the model learned when to distrust sources or use information more accurately for meta-level questions.
Key Lesson 2: Training data composition shapes failure modes non-obviously
My training data had more political questions (~16%) than finance (~9%). Yet the model performed better on financial questions. If topic exposure were the issue, I’d expect the opposite.
I think the issue is in how the model reasoned. The graders for RL evaluated forecast accuracy and reasoning quality, but not research quality or source selection. The model didn’t seem to learn when to apply heuristics like “nothing ever happens,” or how to model the gap between “evidence exists” and “forecasters will notice.”
Key Lesson 3: The ROI of iteration beats the ROI of finetuning
Finetuning cost $1,670 plus ~35 hours engineering. However, on these 35 questions, it performed worse than the metaculus baseline.
What I’d do differently:
Backtesting infrastructure. Services likeAskNews offer historical APIs for testing scaffolding against resolved questions. I could have spent more time iterating on scaffolding—testing research strategies, aggregation methods, confidence calibration—against known outcomes.
Experience databases.Zhao et al. 2023 showed storing historical forecasts and outcomes in a retrievable database improves predictions without training. This seems to simulate some of the benefits of finetuning by improving performance with more samples.
The core issue: if training data contains biases, the model also has them.
Caveats
Small sample size. 35 questions (9 numeric, 26 binary) means high variance. The pattern could partially reflect noise.
Confounded comparison. I tested finetuned + complex scaffolding against unfinetuned + simple scaffolding. Can’t isolate whether underperformance came from finetuning, scaffolding, or their interaction.
Compressed research. The model learned to reason about research but not find good research.
Scaffolding vs Reinforcement Finetuning for AI Forecasting
Epistemic status: low-medium confidence in results, this is work I did last year and has a low sample size. However I think the takeaways are still accurate.
I built a forecasting bot using OpenAI’s Reinforcement Finetuning and a multi-agent architecture, then tested it against simpler baselines in a metaculus tournament. The aggregate scores favored the baseline, but when I broke down results by question type, the finetuned model outperformed on numeric questions (average +14.59 vs +9.25 using Metaculus Baseline Scoring. 0 means random guessing) while underperforming on binary ones (-0.7 vs +2.4).
The tournament was minibench-2025-09-29, running from September 29 to October 25, 2025.
Setup
I used o4-mini finetuned via OpenAI RFT as the model and compared it to o4-mini and high-effort o4-mini metac bots.
For my scaffold, I built a multi-agent system with 3 parallel forecaster teams (each team includes a researcher to find info from the web and a forecaster to predict the result) then an aggregator. The aggregator finds a middle ground between the teams and chooses when to stop—specifically, when predictions converge to within 2% over two rounds.
Then I used RFT to improve forecasting ability. I used a dual grading setup: 60% forecast accuracy (using baseline scores) and 40% reasoning (we want to ensure the model is learning to reason based on events instead of memorizing—a pitfall in paper Pitfalls in Evaluating Language Model Forecasters).
One limitation: OpenAI’s data policies required me to compress the web research content in my training samples. This meant the model learned how to use research results for forecasting, but not how to find good research in the first place.
The training dataset had 979 samples from 344 unique forecasting questions. The dataset was created by running the system on these questions and compressing research at different steps of the trajectory to turn into training samples. Training cost $1,670 and took about 12 hours 42 minutes.
The training data broke down by question type as: 56.5% binary, 21.5% multiple choice, 21.1% numeric, 0.9% discrete. By topic: roughly 54% miscellaneous (flight destinations, reservoir levels, etc.), 16% politics/government, 9% economics/finance, 9% AI/technology, 5% geopolitics, 4% business/stocks, 3% sports, and under 1% weather.
Meanwhile, the baseline metac bots used a simpler pipeline: research → 5 parallel forecasters → take the median.
Results
On 35 questions where all three bots competed, my finetuned model won 12 (34.3%), the o4-mini baseline won 13 (37.1%), and o4-mini-high won 10 (28.6%). Average scores: finetuned 3.23, baseline 4.16, high-effort 1.82.
When I looked at performance by question type, the picture changed. The tournament had 9 numeric questions and 26 binary questions. On numeric, my finetuned model won 5 of 9 (55.6%) with average score +14.59, versus +9.25 for baseline. On binary, my finetuned model won only 7 of 26 (26.9%) with average score −0.70, versus +2.40 for baseline.
The finetuned model’s aggregate underperformance was driven by binary questions. On numeric questions it outperformed the baseline.
Biggest wins came on financial questions while biggest losses came on political/legal questions.
Key Lesson 1: The model learned to trust authoritative sources
Here are 2 questions which show this pattern where the finetuned model did well and poorly due to this heuristic.
Case 1: Inditex market cap (+77 points) The question asked whether the Metaculus community prediction would stay above 40% for whether Inditex would remain in the EU’s top 10 companies by market cap.
My finetuned bot predicted 58%: “CompaniesMarketCap consistently lists Inditex at #10 among EU companies with a 5-6% cushion over #11 Deutsche Telekom. This directly supports the underlying event and should keep the Metaculus community prediction above 40%...”
The baseline predicted 20%: “With only nine days to go and no major positive catalyst for Inditex—recent news shows underperformance relative to peers—forecasters will likely drift lower.”
Resolution: Yes. My bot found an authoritative data source that directly measured the outcome and trusted it.
Case 2: IRA tax credits (-62 points) The question asked whether the community prediction would rise above 5.7% for whether IRA manufacturing tax credits would start requiring domestic materials.
My finetuned bot predicted 74%: “Multiple independent and reputable legal analyses (Miller & Chevalier, Grant Thornton, Sidley, Arnold & Porter, Baker Botts, Gibson Dunn, Davis Polk, Latham, RSM) converge to confirm that the July 2025 reconciliation law enacted a specific domestic-materials requirement… it is highly probable that a well-sourced comment will surface and gain traction before October 18.”
The baseline predicted 15%: “With Republicans holding Congress and no public push for new domestic-content rules in the next two weeks, it’s unlikely the community forecast will climb meaningfully.”
Resolution: No. My bot found authoritative legal analysis and trusted it. But the question wasn’t about the legal facts. It was about whether forecasters would notice and update their predictions in the next two weeks.
My takeaway: In both cases, my model found credible sources and made confident predictions. It would have been better if the model learned when to distrust sources or use information more accurately for meta-level questions.
Key Lesson 2: Training data composition shapes failure modes non-obviously
My training data had more political questions (~16%) than finance (~9%). Yet the model performed better on financial questions. If topic exposure were the issue, I’d expect the opposite.
I think the issue is in how the model reasoned. The graders for RL evaluated forecast accuracy and reasoning quality, but not research quality or source selection. The model didn’t seem to learn when to apply heuristics like “nothing ever happens,” or how to model the gap between “evidence exists” and “forecasters will notice.”
Key Lesson 3: The ROI of iteration beats the ROI of finetuning
Finetuning cost $1,670 plus ~35 hours engineering. However, on these 35 questions, it performed worse than the metaculus baseline.
What I’d do differently:
Backtesting infrastructure. Services like AskNews offer historical APIs for testing scaffolding against resolved questions. I could have spent more time iterating on scaffolding—testing research strategies, aggregation methods, confidence calibration—against known outcomes.
Experience databases. Zhao et al. 2023 showed storing historical forecasts and outcomes in a retrievable database improves predictions without training. This seems to simulate some of the benefits of finetuning by improving performance with more samples.
The core issue: if training data contains biases, the model also has them.
Caveats
Small sample size. 35 questions (9 numeric, 26 binary) means high variance. The pattern could partially reflect noise.
Confounded comparison. I tested finetuned + complex scaffolding against unfinetuned + simple scaffolding. Can’t isolate whether underperformance came from finetuning, scaffolding, or their interaction.
Compressed research. The model learned to reason about research but not find good research.