Nathan Helm-Burger comments on OpenAI releases deep research agent

Nathan Helm-Burger 3 Feb 2025 20:04 UTC
12 points
6
I think the correct way to address this is by also testing the other models with agent scaffolds that supply web search and a python interpreter.

I think it’s wrong to jump to the conclusion that non-agent-finetuned models can’t benefit from tools.

See for example:

Frontier Math result

https://x.com/Justin_Halford_/status/1885547672108511281

o3-mini got 32% on Frontier Math (!) when given access to use a Python tool. In an AMA, @kevinweil / @snsf (OAI) both referenced tool use w reasoning models incl retrieval (!) as a future rollout.

METR RE-bench

Models are tested with agent scaffolds

AIDE and Modular refer to different agent scaffolds; Modular is a very simple baseline scaffolding that just lets the model repeatedly run code and see the results; AIDE is a more sophisticated scaffold that implements a tree search procedure.