Josh You comments on Why does METR score o3 as effective for such a long time duration despite overall poor scores?

Josh You 3 May 2025 0:37 UTC
5 points
2
The RE-bench result is just for five tasks, the second graph is for a broader task suite of almost 200 tasks. I wouldn’t read much into o3 doing worse than other models at RE-bench because of the small sample.