Note that for HLE, most of the difference in performance might be explained by Deep Research having access to tools while other models are forced to reply instantly with no tool use.
But does that necessarily matter? Many of those models can’t use tools; and since much of the point of the end-to-end RL training of Deep Research is to teach tool use, showing DR results without tool use would be either irrelevant or misleading (eg. it might do worse than the original o3 model it is trained from, when deprived of the tools it is supposed to use).
I assume that what’s going on here is something like, ”This was low-hanging fruit, it was just a matter of time until someone did the corresponding test.”
This would imply that OpenAI’s work here isn’t impressive, and also, that previous LLMs might have essentially been underestimated. There’s basically a cheap latent capabilities gap.
I imagine a lot of software engineers / entrepreneurs aren’t too surprised now. Many companies are basically trying to find wins where LLMs + simple tools give a large gain.
So some people could look at this and say, “sure, this test is to be expected”, and others would be impressed by what LLMs + simple tools are capable of.
o3-mini got 32% on Frontier Math (!) when given access to use a Python tool. In an AMA, @kevinweil
/ @snsf (OAI) both referenced tool use w reasoning models incl retrieval (!) as a future rollout.
METR RE-bench
Models are tested with agent scaffolds
AIDE and Modular refer to different agent scaffolds; Modular is a very simple baseline scaffolding that just lets the model repeatedly run code and see the results; AIDE is a more sophisticated scaffold that implements a tree search procedure.
Yes, they do highlight this difference. I wonder how full o3 scores? It would be interesting to know how much imrovement is based on o3′s improved reasoning and how much is the sequential research procedure.
And how much the improved reasoning is from using a different base model vs. different post-training. It’s possible R1-like training didn’t work for models below GPT-4 level, and then that same training started working at GPT-4 level (at which point you can iterate from a working prototype or simply distill to get it to work for weaker models). So it might work even better for the next level of scale of base models, without necessarily changing the RL part all that much.
Note that for HLE, most of the difference in performance might be explained by Deep Research having access to tools while other models are forced to reply instantly with no tool use.
But does that necessarily matter? Many of those models can’t use tools; and since much of the point of the end-to-end RL training of Deep Research is to teach tool use, showing DR results without tool use would be either irrelevant or misleading (eg. it might do worse than the original o3 model it is trained from, when deprived of the tools it is supposed to use).
I assume that what’s going on here is something like,
”This was low-hanging fruit, it was just a matter of time until someone did the corresponding test.”
This would imply that OpenAI’s work here isn’t impressive, and also, that previous LLMs might have essentially been underestimated. There’s basically a cheap latent capabilities gap.
I imagine a lot of software engineers / entrepreneurs aren’t too surprised now. Many companies are basically trying to find wins where LLMs + simple tools give a large gain.
So some people could look at this and say, “sure, this test is to be expected”, and others would be impressed by what LLMs + simple tools are capable of.
I think the correct way to address this is by also testing the other models with agent scaffolds that supply web search and a python interpreter.
I think it’s wrong to jump to the conclusion that non-agent-finetuned models can’t benefit from tools.
See for example:
Frontier Math result
https://x.com/Justin_Halford_/status/1885547672108511281
METR RE-bench
Models are tested with agent scaffolds
Yes, they do highlight this difference. I wonder how full o3 scores? It would be interesting to know how much imrovement is based on o3′s improved reasoning and how much is the sequential research procedure.
And how much the improved reasoning is from using a different base model vs. different post-training. It’s possible R1-like training didn’t work for models below GPT-4 level, and then that same training started working at GPT-4 level (at which point you can iterate from a working prototype or simply distill to get it to work for weaker models). So it might work even better for the next level of scale of base models, without necessarily changing the RL part all that much.
How o3-mini scores: https://x.com/DanHendrycks/status/1886213523900109011
10.5-13% on text only part of HLE (text only are 90% of the questions)
[corrected the above to read “o3-mini”, thanks.]
This is for o3-mini, while the ~25% figure for o3 from the tweet you linked is simply restating deep research evals.