Seconding that updating the data on Github would be useful.
A few months ago, I had a similar concern, and redid the METR plot but estimates time horizon by the following algorithm.
Loop through a fine grid of times (1 min, 2 min, 3 min, etc)
For each time, use a local linear regression to estimate average probability of success at this time (any differentiable function is locally linear, so local linear is a good nonparametric estimator)
If you cannot reject that this estimated probability is 50%, then add it to the time horizon of the model confidence set.
For each model, this gives you a 95% confidence set for the 50% time horizon. This approach doesn’t naturally come with a point estimate though, so for the following graph I just take the median of the set.
In general, confidence intervals are a bit wider, but you get a similar overall trend. The wider CI is because the original METR graph is getting statistical power from the logistic parametric assumption (e.g. it lets you use far away datapoints to help estimate where the 50% time horizon cutoff is, while a nonparametric approach only uses close by data).
Because the graph is in log space, even changing the estimate by 75-150% does not impact the overall trend.
This is cool! I think I’m updating toward the logistic fit not mattering. The question I have now is: what would it have taken on this underlying data for the log-linear trend not to hold. My guess is models not making progress for months, and staying at similar aggregate accuracy (with success rates staying roughly inversely correlated with task length).
Seconding that updating the data on Github would be useful.
A few months ago, I had a similar concern, and redid the METR plot but estimates time horizon by the following algorithm.
Loop through a fine grid of times (1 min, 2 min, 3 min, etc)
For each time, use a local linear regression to estimate average probability of success at this time (any differentiable function is locally linear, so local linear is a good nonparametric estimator)
If you cannot reject that this estimated probability is 50%, then add it to the time horizon of the model confidence set.
For each model, this gives you a 95% confidence set for the 50% time horizon. This approach doesn’t naturally come with a point estimate though, so for the following graph I just take the median of the set.
In general, confidence intervals are a bit wider, but you get a similar overall trend. The wider CI is because the original METR graph is getting statistical power from the logistic parametric assumption (e.g. it lets you use far away datapoints to help estimate where the 50% time horizon cutoff is, while a nonparametric approach only uses close by data).
Because the graph is in log space, even changing the estimate by 75-150% does not impact the overall trend.
This is cool! I think I’m updating toward the logistic fit not mattering. The question I have now is: what would it have taken on this underlying data for the log-linear trend not to hold. My guess is models not making progress for months, and staying at similar aggregate accuracy (with success rates staying roughly inversely correlated with task length).