The RE-bench result is just for five tasks, the second graph is for a broader task suite of almost 200 tasks. I wouldn’t read much into o3 doing worse than other models at RE-bench because of the small sample.
The RE-bench result is just for five tasks, the second graph is for a broader task suite of almost 200 tasks. I wouldn’t read much into o3 doing worse than other models at RE-bench because of the small sample.