Seth Herd comments on OpenAI releases deep research agent

Seth Herd 3 Feb 2025 13:21 UTC
3 points
0
Yes, they do highlight this difference. I wonder how full o3 scores? It would be interesting to know how much imrovement is based on o3′s improved reasoning and how much is the sequential research procedure.
- Vladimir_Nesov 3 Feb 2025 15:57 UTC
  4 points
  0
  Parent
  And how much the improved reasoning is from using a different base model vs. different post-training. It’s possible R1-like training didn’t work for models below GPT-4 level, and then that same training started working at GPT-4 level (at which point you can iterate from a working prototype or simply distill to get it to work for weaker models). So it might work even better for the next level of scale of base models, without necessarily changing the RL part all that much.
- sweenesm 3 Feb 2025 13:40 UTC
  3 points
  −2
  Parent
  How o3-mini scores: https://x.com/DanHendrycks/status/1886213523900109011
  10.5-13% on text only part of HLE (text only are 90% of the questions)
  [corrected the above to read “o3-mini”, thanks.]
  - Vladimir_Nesov 3 Feb 2025 15:59 UTC
    7 points
    0
    Parent
    
    10.5-13% on text only part of HLE
    
    This is for o3-mini, while the ~25% figure for o3 from the tweet you linked is simply restating deep research evals.