o3-mini got 32% on Frontier Math (!) when given access to use a Python tool. In an AMA, @kevinweil
/ @snsf (OAI) both referenced tool use w reasoning models incl retrieval (!) as a future rollout.
METR RE-bench
Models are tested with agent scaffolds
AIDE and Modular refer to different agent scaffolds; Modular is a very simple baseline scaffolding that just lets the model repeatedly run code and see the results; AIDE is a more sophisticated scaffold that implements a tree search procedure.
I think the correct way to address this is by also testing the other models with agent scaffolds that supply web search and a python interpreter.
I think it’s wrong to jump to the conclusion that non-agent-finetuned models can’t benefit from tools.
See for example:
Frontier Math result
https://x.com/Justin_Halford_/status/1885547672108511281
METR RE-bench
Models are tested with agent scaffolds