gwern comments on OpenAI releases deep research agent

gwern 3 Feb 2025 16:54 UTC
9 points
−8
But does that necessarily matter? Many of those models can’t use tools; and since much of the point of the end-to-end RL training of Deep Research is to teach tool use, showing DR results without tool use would be either irrelevant or misleading (eg. it might do worse than the original o3 model it is trained from, when deprived of the tools it is supposed to use).
- ozziegooen 3 Feb 2025 19:13 UTC
  15 points
  10
  Parent
  I assume that what’s going on here is something like,
  ”This was low-hanging fruit, it was just a matter of time until someone did the corresponding test.”
  
  This would imply that OpenAI’s work here isn’t impressive, and also, that previous LLMs might have essentially been underestimated. There’s basically a cheap latent capabilities gap.
  
  I imagine a lot of software engineers / entrepreneurs aren’t too surprised now. Many companies are basically trying to find wins where LLMs + simple tools give a large gain.
  
  So some people could look at this and say, “sure, this test is to be expected”, and others would be impressed by what LLMs + simple tools are capable of.
- Nathan Helm-Burger 3 Feb 2025 20:04 UTC
  12 points
  6
  Parent
  I think the correct way to address this is by also testing the other models with agent scaffolds that supply web search and a python interpreter.
  
  I think it’s wrong to jump to the conclusion that non-agent-finetuned models can’t benefit from tools.
  
  See for example:
  
  Frontier Math result
  
  https://x.com/Justin_Halford_/status/1885547672108511281
  
  o3-mini got 32% on Frontier Math (!) when given access to use a Python tool. In an AMA, @kevinweil / @snsf (OAI) both referenced tool use w reasoning models incl retrieval (!) as a future rollout.
  
  METR RE-bench
  
  Models are tested with agent scaffolds
  
  AIDE and Modular refer to different agent scaffolds; Modular is a very simple baseline scaffolding that just lets the model repeatedly run code and see the results; AIDE is a more sophisticated scaffold that implements a tree search procedure.

gwern comments on OpenAI releases deep research agent

Frontier Math result

METR RE-bench