How o3-mini scores: https://x.com/DanHendrycks/status/1886213523900109011
10.5-13% on text only part of HLE (text only are 90% of the questions)
[corrected the above to read “o3-mini”, thanks.]
10.5-13% on text only part of HLE
This is for o3-mini, while the ~25% figure for o3 from the tweet you linked is simply restating deep research evals.
How o3-mini scores: https://x.com/DanHendrycks/status/1886213523900109011
10.5-13% on text only part of HLE (text only are 90% of the questions)
[corrected the above to read “o3-mini”, thanks.]
This is for o3-mini, while the ~25% figure for o3 from the tweet you linked is simply restating deep research evals.