I personally think the stronger argument here is that Claude models are not growing in capability consistent with higher task length = harder. (Grok 4 was similar) if you look at the histograms.
Both Sonnet 4.5 and Opus 4.5 were outperforming in the 8 to 16 hour bracket over the 2 to 4 hour, which is highly inconsistent with the task length difficulty model. The model appears broken at last since 3.5 sonnet given the flatness of the 2-16 hour tasks.
You end up in a case where the 4.5 Sonnet curve has a higher % of the solved tasks under it than 4.5 Opus (note how 4.5 Opus gets 0 tasks right in the 16 hour to 32 hour window even though the distribution implies it should be more like 25%). That is the “gain” this implies is overstated dramatically. [1]
The unfortunate consequence is largely shash42′s point—it’s not clear that modeling “task length horizon” is a valid way to view this data. Raw accuracy seems better correlated with time.
[1] An alternative interpretation is that Sonnet 4.5 was much better than the METR curve then implied.
I personally think the stronger argument here is that Claude models are not growing in capability consistent with higher task length = harder. (Grok 4 was similar) if you look at the histograms.
Both Sonnet 4.5 and Opus 4.5 were outperforming in the 8 to 16 hour bracket over the 2 to 4 hour, which is highly inconsistent with the task length difficulty model. The model appears broken at last since 3.5 sonnet given the flatness of the 2-16 hour tasks.
You end up in a case where the 4.5 Sonnet curve has a higher % of the solved tasks under it than 4.5 Opus (note how 4.5 Opus gets 0 tasks right in the 16 hour to 32 hour window even though the distribution implies it should be more like 25%). That is the “gain” this implies is overstated dramatically. [1]
The unfortunate consequence is largely shash42′s point—it’s not clear that modeling “task length horizon” is a valid way to view this data. Raw accuracy seems better correlated with time.
[1] An alternative interpretation is that Sonnet 4.5 was much better than the METR curve then implied.