Thanks for checking this. Log-linear isn’t that different from logistic in how it would affect the downstream prediction. Could you (someone at METR) update the public all-results file on GitHub so we can play around with this data?
I am particularly curious to know what would happen if we took the 50% horizon as the startpoint of the first bar the model drops below 50% accuracy. This increases uncertainty, but it would be interesting to see what trend comes out, and how model rankings change (is opus 4.5 a big update?).
I do expect it would still be an exponential trend, and agree with you that the underlying data distribution (specifically the topics aligning exactly with frontier lab priorities) is the more risky confounder. Although one could argue for choosing to do it this way, it just reduces chances of the horizon length being relevant outside the model’s strongest areas.
Seconding that updating the data on Github would be useful.
A few months ago, I had a similar concern, and redid the METR plot but estimates time horizon by the following algorithm.
Loop through a fine grid of times (1 min, 2 min, 3 min, etc)
For each time, use a local linear regression to estimate average probability of success at this time (any differentiable function is locally linear, so local linear is a good nonparametric estimator)
If you cannot reject that this estimated probability is 50%, then add it to the time horizon of the model confidence set.
For each model, this gives you a 95% confidence set for the 50% time horizon. This approach doesn’t naturally come with a point estimate though, so for the following graph I just take the median of the set.
In general, confidence intervals are a bit wider, but you get a similar overall trend. The wider CI is because the original METR graph is getting statistical power from the logistic parametric assumption (e.g. it lets you use far away datapoints to help estimate where the 50% time horizon cutoff is, while a nonparametric approach only uses close by data).
Because the graph is in log space, even changing the estimate by 75-150% does not impact the overall trend.
This is cool! I think I’m updating toward the logistic fit not mattering. The question I have now is: what would it have taken on this underlying data for the log-linear trend not to hold. My guess is models not making progress for months, and staying at similar aggregate accuracy (with success rates staying roughly inversely correlated with task length).
Thanks for checking this. Log-linear isn’t that different from logistic in how it would affect the downstream prediction. Could you (someone at METR) update the public all-results file on GitHub so we can play around with this data?
I am particularly curious to know what would happen if we took the 50% horizon as the startpoint of the first bar the model drops below 50% accuracy. This increases uncertainty, but it would be interesting to see what trend comes out, and how model rankings change (is opus 4.5 a big update?).
I do expect it would still be an exponential trend, and agree with you that the underlying data distribution (specifically the topics aligning exactly with frontier lab priorities) is the more risky confounder. Although one could argue for choosing to do it this way, it just reduces chances of the horizon length being relevant outside the model’s strongest areas.
Seconding that updating the data on Github would be useful.
A few months ago, I had a similar concern, and redid the METR plot but estimates time horizon by the following algorithm.
Loop through a fine grid of times (1 min, 2 min, 3 min, etc)
For each time, use a local linear regression to estimate average probability of success at this time (any differentiable function is locally linear, so local linear is a good nonparametric estimator)
If you cannot reject that this estimated probability is 50%, then add it to the time horizon of the model confidence set.
For each model, this gives you a 95% confidence set for the 50% time horizon. This approach doesn’t naturally come with a point estimate though, so for the following graph I just take the median of the set.
In general, confidence intervals are a bit wider, but you get a similar overall trend. The wider CI is because the original METR graph is getting statistical power from the logistic parametric assumption (e.g. it lets you use far away datapoints to help estimate where the 50% time horizon cutoff is, while a nonparametric approach only uses close by data).
Because the graph is in log space, even changing the estimate by 75-150% does not impact the overall trend.
This is cool! I think I’m updating toward the logistic fit not mattering. The question I have now is: what would it have taken on this underlying data for the log-linear trend not to hold. My guess is models not making progress for months, and staying at similar aggregate accuracy (with success rates staying roughly inversely correlated with task length).