But does that necessarily matter? Many of those models can’t use tools; and since much of the point of the end-to-end RL training of Deep Research is to teach tool use, showing DR results without tool use would be either irrelevant or misleading (eg. it might do worse than the original o3 model it is trained from, when deprived of the tools it is supposed to use).
I assume that what’s going on here is something like, ”This was low-hanging fruit, it was just a matter of time until someone did the corresponding test.”
This would imply that OpenAI’s work here isn’t impressive, and also, that previous LLMs might have essentially been underestimated. There’s basically a cheap latent capabilities gap.
I imagine a lot of software engineers / entrepreneurs aren’t too surprised now. Many companies are basically trying to find wins where LLMs + simple tools give a large gain.
So some people could look at this and say, “sure, this test is to be expected”, and others would be impressed by what LLMs + simple tools are capable of.
o3-mini got 32% on Frontier Math (!) when given access to use a Python tool. In an AMA, @kevinweil
/ @snsf (OAI) both referenced tool use w reasoning models incl retrieval (!) as a future rollout.
METR RE-bench
Models are tested with agent scaffolds
AIDE and Modular refer to different agent scaffolds; Modular is a very simple baseline scaffolding that just lets the model repeatedly run code and see the results; AIDE is a more sophisticated scaffold that implements a tree search procedure.
But does that necessarily matter? Many of those models can’t use tools; and since much of the point of the end-to-end RL training of Deep Research is to teach tool use, showing DR results without tool use would be either irrelevant or misleading (eg. it might do worse than the original o3 model it is trained from, when deprived of the tools it is supposed to use).
I assume that what’s going on here is something like,
”This was low-hanging fruit, it was just a matter of time until someone did the corresponding test.”
This would imply that OpenAI’s work here isn’t impressive, and also, that previous LLMs might have essentially been underestimated. There’s basically a cheap latent capabilities gap.
I imagine a lot of software engineers / entrepreneurs aren’t too surprised now. Many companies are basically trying to find wins where LLMs + simple tools give a large gain.
So some people could look at this and say, “sure, this test is to be expected”, and others would be impressed by what LLMs + simple tools are capable of.
I think the correct way to address this is by also testing the other models with agent scaffolds that supply web search and a python interpreter.
I think it’s wrong to jump to the conclusion that non-agent-finetuned models can’t benefit from tools.
See for example:
Frontier Math result
https://x.com/Justin_Halford_/status/1885547672108511281
METR RE-bench
Models are tested with agent scaffolds