I don’t know how problematic it is to assume a logistic function for this data.
The logistic is just one of many functions that’s a reasonable fit for the P(success) vs length data. You can use lots of different curves and still get the exponential horizon trend, it’s not specific to the log-logistic.
E.g. Here’s a silly log-linear fit that I quickly threw together:
What it looks like on some histograms
Still exponential (the error bars get huge because some of the fits are very bad)
Here’s all the histograms if you want to take a look
It’s not the function being used to fit per-model P(success) vs length that causes the exponential horizon trend on our task suite. I think its more that:
our tasks are ~~log-distributed in length (as measured by human time-to-complete)
models’ ‘raw scores’ on these log-distributed tasks are getting better at a ~~linear rate
(and I guess also that models have a pretty consistent pattern of finding our longer tasks harder, such that it’s reasonable to summarise their performance with a single number like 50% horizon)
(I work at METR but I’m speaking for myself here. Also thanks for the post, appreciate people digging into the data.)
Thanks for checking this. Log-linear isn’t that different from logistic in how it would affect the downstream prediction. Could you (someone at METR) update the public all-results file on GitHub so we can play around with this data?
I am particularly curious to know what would happen if we took the 50% horizon as the startpoint of the first bar the model drops below 50% accuracy. This increases uncertainty, but it would be interesting to see what trend comes out, and how model rankings change (is opus 4.5 a big update?).
I do expect it would still be an exponential trend, and agree with you that the underlying data distribution (specifically the topics aligning exactly with frontier lab priorities) is the more risky confounder. Although one could argue for choosing to do it this way, it just reduces chances of the horizon length being relevant outside the model’s strongest areas.
Seconding that updating the data on Github would be useful.
A few months ago, I had a similar concern, and redid the METR plot but estimates time horizon by the following algorithm.
Loop through a fine grid of times (1 min, 2 min, 3 min, etc)
For each time, use a local linear regression to estimate average probability of success at this time (any differentiable function is locally linear, so local linear is a good nonparametric estimator)
If you cannot reject that this estimated probability is 50%, then add it to the time horizon of the model confidence set.
For each model, this gives you a 95% confidence set for the 50% time horizon. This approach doesn’t naturally come with a point estimate though, so for the following graph I just take the median of the set.
In general, confidence intervals are a bit wider, but you get a similar overall trend. The wider CI is because the original METR graph is getting statistical power from the logistic parametric assumption (e.g. it lets you use far away datapoints to help estimate where the 50% time horizon cutoff is, while a nonparametric approach only uses close by data).
Because the graph is in log space, even changing the estimate by 75-150% does not impact the overall trend.
This is cool! I think I’m updating toward the logistic fit not mattering. The question I have now is: what would it have taken on this underlying data for the log-linear trend not to hold. My guess is models not making progress for months, and staying at similar aggregate accuracy (with success rates staying roughly inversely correlated with task length).
Thanks for the histograms. Is the raw data available somewhere?
Just eyeballing it:
accuracy growth rate for Opus 4.5 in the 4-8 hour range is what is expected given the trendline from Sonnet to Sonnet 4.5.
The 8-16 hour growth came in ~3 months ahead of target.
The 2-4 hour growth is a month or so ahead of target.
Aligns to my sense the model is a month, maybe 2 months, ahead of what is expected and a lot of this jump (4.5 months ahead of expected) is from artifacts of the curve fitting
The logistic is just one of many functions that’s a reasonable fit for the P(success) vs length data. You can use lots of different curves and still get the exponential horizon trend, it’s not specific to the log-logistic.
E.g. Here’s a silly log-linear fit that I quickly threw together:
What it looks like on some histograms
Still exponential (the error bars get huge because some of the fits are very bad)
Here’s all the histograms if you want to take a look
It’s not the function being used to fit per-model P(success) vs length that causes the exponential horizon trend on our task suite. I think its more that:
our tasks are ~~log-distributed in length (as measured by human time-to-complete)
models’ ‘raw scores’ on these log-distributed tasks are getting better at a ~~linear rate
(and I guess also that models have a pretty consistent pattern of finding our longer tasks harder, such that it’s reasonable to summarise their performance with a single number like 50% horizon)
(I work at METR but I’m speaking for myself here. Also thanks for the post, appreciate people digging into the data.)
Thanks for checking this. Log-linear isn’t that different from logistic in how it would affect the downstream prediction. Could you (someone at METR) update the public all-results file on GitHub so we can play around with this data?
I am particularly curious to know what would happen if we took the 50% horizon as the startpoint of the first bar the model drops below 50% accuracy. This increases uncertainty, but it would be interesting to see what trend comes out, and how model rankings change (is opus 4.5 a big update?).
I do expect it would still be an exponential trend, and agree with you that the underlying data distribution (specifically the topics aligning exactly with frontier lab priorities) is the more risky confounder. Although one could argue for choosing to do it this way, it just reduces chances of the horizon length being relevant outside the model’s strongest areas.
Seconding that updating the data on Github would be useful.
A few months ago, I had a similar concern, and redid the METR plot but estimates time horizon by the following algorithm.
Loop through a fine grid of times (1 min, 2 min, 3 min, etc)
For each time, use a local linear regression to estimate average probability of success at this time (any differentiable function is locally linear, so local linear is a good nonparametric estimator)
If you cannot reject that this estimated probability is 50%, then add it to the time horizon of the model confidence set.
For each model, this gives you a 95% confidence set for the 50% time horizon. This approach doesn’t naturally come with a point estimate though, so for the following graph I just take the median of the set.
In general, confidence intervals are a bit wider, but you get a similar overall trend. The wider CI is because the original METR graph is getting statistical power from the logistic parametric assumption (e.g. it lets you use far away datapoints to help estimate where the 50% time horizon cutoff is, while a nonparametric approach only uses close by data).
Because the graph is in log space, even changing the estimate by 75-150% does not impact the overall trend.
This is cool! I think I’m updating toward the logistic fit not mattering. The question I have now is: what would it have taken on this underlying data for the log-linear trend not to hold. My guess is models not making progress for months, and staying at similar aggregate accuracy (with success rates staying roughly inversely correlated with task length).
Thanks for the histograms. Is the raw data available somewhere?
Just eyeballing it:
accuracy growth rate for Opus 4.5 in the 4-8 hour range is what is expected given the trendline from Sonnet to Sonnet 4.5.
The 8-16 hour growth came in ~3 months ahead of target.
The 2-4 hour growth is a month or so ahead of target.
Aligns to my sense the model is a month, maybe 2 months, ahead of what is expected and a lot of this jump (4.5 months ahead of expected) is from artifacts of the curve fitting