The METR time horizon is for fully autonomous execution of tasks. I’d expect giving the model hints when it gets stuck to help substantially with that, and for other tasks I do observe that that approach does seem to work. But the one time I tried to actually measure and quantify it, this happened.
The actual part Claude got stuck on was the part which looked like a leeetcode medium problem with a slight twist, not the part that requires actually understanding the application-specific logic. If it had gotten stuck on “write regression tests (as in fact it did once the initial hurdle was cleared), that would not have been surprising.
Like, it does make sense that “a 50% success rate at 4 hour tasks” looks like “approximately 100% success rate at most constituent 30 minute subtasks combined with occasional ~0% success rate at rare subtasks that usually don’t come up in a 4 hour task” rather than “a uniform 92% success rate at each 30 minute subtask” but it still feels a little jarring to experience.
Why is it so surprising? Although it has many issues, the METR 80% time horizon for Claude Opus 4.5 is 27 mins, with a 95% CI from 7 mins to 86 mins.
Couple reasons:
The METR time horizon is for fully autonomous execution of tasks. I’d expect giving the model hints when it gets stuck to help substantially with that, and for other tasks I do observe that that approach does seem to work. But the one time I tried to actually measure and quantify it, this happened.
The actual part Claude got stuck on was the part which looked like a leeetcode medium problem with a slight twist, not the part that requires actually understanding the application-specific logic. If it had gotten stuck on “write regression tests (as in fact it did once the initial hurdle was cleared), that would not have been surprising.
Like, it does make sense that “a 50% success rate at 4 hour tasks” looks like “approximately 100% success rate at most constituent 30 minute subtasks combined with occasional ~0% success rate at rare subtasks that usually don’t come up in a 4 hour task” rather than “a uniform 92% success rate at each 30 minute subtask” but it still feels a little jarring to experience.