Nikola Jurkovic comments on OpenAI releases deep research agent

Nikola Jurkovic 3 Feb 2025 13:14 UTC
11 points
11
Note that for HLE, most of the difference in performance might be explained by Deep Research having access to tools while other models are forced to reply instantly with no tool use.
- gwern 3 Feb 2025 16:54 UTC
  9 points
  −8
  Parent
  But does that necessarily matter? Many of those models can’t use tools; and since much of the point of the end-to-end RL training of Deep Research is to teach tool use, showing DR results without tool use would be either irrelevant or misleading (eg. it might do worse than the original o3 model it is trained from, when deprived of the tools it is supposed to use).
  - ozziegooen 3 Feb 2025 19:13 UTC
    15 points
    10
    Parent
    I assume that what’s going on here is something like,
    ”This was low-hanging fruit, it was just a matter of time until someone did the corresponding test.”
    
    This would imply that OpenAI’s work here isn’t impressive, and also, that previous LLMs might have essentially been underestimated. There’s basically a cheap latent capabilities gap.
    
    I imagine a lot of software engineers / entrepreneurs aren’t too surprised now. Many companies are basically trying to find wins where LLMs + simple tools give a large gain.
    
    So some people could look at this and say, “sure, this test is to be expected”, and others would be impressed by what LLMs + simple tools are capable of.
  - Nathan Helm-Burger 3 Feb 2025 20:04 UTC
    12 points
    6
    Parent
    I think the correct way to address this is by also testing the other models with agent scaffolds that supply web search and a python interpreter.
    
    I think it’s wrong to jump to the conclusion that non-agent-finetuned models can’t benefit from tools.
    
    See for example:
    
    Frontier Math result
    
    https://x.com/Justin_Halford_/status/1885547672108511281
    
    o3-mini got 32% on Frontier Math (!) when given access to use a Python tool. In an AMA, @kevinweil / @snsf (OAI) both referenced tool use w reasoning models incl retrieval (!) as a future rollout.
    
    METR RE-bench
    
    Models are tested with agent scaffolds
    
    AIDE and Modular refer to different agent scaffolds; Modular is a very simple baseline scaffolding that just lets the model repeatedly run code and see the results; AIDE is a more sophisticated scaffold that implements a tree search procedure.
- Seth Herd 3 Feb 2025 13:21 UTC
  3 points
  0
  Parent
  Yes, they do highlight this difference. I wonder how full o3 scores? It would be interesting to know how much imrovement is based on o3′s improved reasoning and how much is the sequential research procedure.
  - Vladimir_Nesov 3 Feb 2025 15:57 UTC
    4 points
    0
    Parent
    And how much the improved reasoning is from using a different base model vs. different post-training. It’s possible R1-like training didn’t work for models below GPT-4 level, and then that same training started working at GPT-4 level (at which point you can iterate from a working prototype or simply distill to get it to work for weaker models). So it might work even better for the next level of scale of base models, without necessarily changing the RL part all that much.
  - sweenesm 3 Feb 2025 13:40 UTC
    3 points
    −2
    Parent
    How o3-mini scores: https://x.com/DanHendrycks/status/1886213523900109011
    10.5-13% on text only part of HLE (text only are 90% of the questions)
    [corrected the above to read “o3-mini”, thanks.]
    - Vladimir_Nesov 3 Feb 2025 15:59 UTC
      7 points
      0
      Parent
      
      10.5-13% on text only part of HLE
      
      This is for o3-mini, while the ~25% figure for o3 from the tweet you linked is simply restating deep research evals.

Nikola Jurkovic comments on OpenAI releases deep research agent

Frontier Math result

METR RE-bench