Yes, they do highlight this difference. I wonder how full o3 scores? It would be interesting to know how much imrovement is based on o3′s improved reasoning and how much is the sequential research procedure.
And how much the improved reasoning is from using a different base model vs. different post-training. It’s possible R1-like training didn’t work for models below GPT-4 level, and then that same training started working at GPT-4 level (at which point you can iterate from a working prototype or simply distill to get it to work for weaker models). So it might work even better for the next level of scale of base models, without necessarily changing the RL part all that much.
Yes, they do highlight this difference. I wonder how full o3 scores? It would be interesting to know how much imrovement is based on o3′s improved reasoning and how much is the sequential research procedure.
And how much the improved reasoning is from using a different base model vs. different post-training. It’s possible R1-like training didn’t work for models below GPT-4 level, and then that same training started working at GPT-4 level (at which point you can iterate from a working prototype or simply distill to get it to work for weaker models). So it might work even better for the next level of scale of base models, without necessarily changing the RL part all that much.
How o3-mini scores: https://x.com/DanHendrycks/status/1886213523900109011
10.5-13% on text only part of HLE (text only are 90% of the questions)
[corrected the above to read “o3-mini”, thanks.]
This is for o3-mini, while the ~25% figure for o3 from the tweet you linked is simply restating deep research evals.