I’m confused by the change in the METR trend

Measuring AI Ability to Complete Long Tasks—METR

In their original 2025 paper, METR noticed that the slope (aka task horizon doubling time) of the trendline for models released in 2024 and later is different from the slope for <2024 models.

First, I decided to check whether a piecewise linear function fits the data better than a simple linear function. If it doesn’t, then this change in trend is a random fluke and there is nothing worth talking about.

Here is the data so far (SOTA models only):

Note: the Y axis has human-friendly labels, but the data used in all further calculations is log10(raw value in minutes).

The piecewise linear function clearly provides a better fit, based on the Bayesian information criterion (BIC, lower=better by the way) and based on a qualitative “bro just look at it” assessment. I added RMSE, MAE and R² as extra information, keep in mind that since a piecewise linear function has more parameters, it’s not surprising that it fits data better than linear, this is to be expected. BIC penalizes model complexity (number of degrees of freedom), so it’s more relevant for this case than RMSE/​MAE/​R².

However, I’m not satisfied. Let’s randomly remove 20% of datapoints from this graph, do it a few thousand times, and see how frequently the piecewise linear function provides a better fit according to BIC.

It’s pretty clear that piecewise linear fits the METR data better. But it’s possible that this is an artifact of METR’s methodology. Is there any other benchmark where a similar change could show up? ECI, Epoch Capabilities Index, doesn’t go back in time far enough to be reliable, the oldest SOTA model on ECI is GPT-4 released in 2023. If anyone knows a benchmark that includes models from the oldest (GPT-2 and GPT-3) to the newest, let me know.

Ok, let’s say that the change in the trend is real—it’s not only the line on the graph that has changed, the underlying reality has changed. What could be the cause?

  1. RLHF. Doesn’t fit: earliest model that was RLHFed was InstructGPT, released in 2022, way before the change in trend.

  2. Chain-of-Thought. This fits better, but not perfectly. The piecewise linear fit shows that the faster trend started around February or March 2024. o1-preview, the first model that used CoT natively, was released in September, many months after the change in trend, and several non-CoT models are on the faster trend. Even if the estimated date of change is off by a month or two, it would still mean that the trend changed before o1-preview was released, and several non-CoT models would still be on the faster trend. I’m leaning towards this explanation as the other two seem much less likely.

  3. Some secret sauce that the labs are very good at hiding. This seems unlikely. If it was important enough to change the trend permanently as opposed to offering a one-time improvement, it would’ve been common knowledge by now.

  4. EDIT: RLVR seems like a plausible candidate, but I don’t know exactly when the first model that was heavily RLVRed was released.

EDIT: here is a graph with two linear fits for CoT and non-CoT models. Unlike the previous graph, this one also includes models that weren’t SOTA at the time of their release.

This seems like good evidence that CoT is responsible for the change in the trend. Though, it’s possible that labs started allocating a lot of compute to RLVR at the same time as CoT became a thing.

In conclusion:

  1. The change in the METR trend is extremely unlikely to be random noise.

  2. It’s possible that it’s a systematic rather than a random error due to some methodological bias.

  3. To the best of my knowledge, there is no other benchmark that includes all SOTA models from GPT-2 to present day ones, so it’s impossible to do a sanity check by looking for a similar change on a different benchmark.

  4. Assuming this is not a methodological artifact, the change can probably be attributed to CoT, but it’s not clear. It could be due to RLVR, if it coincided with the invention of CoT.