Notice how the log-linear fit here only looks good for the SWAA data, in the 1 sec − 1 min range. There’s something completely different going on for tasks longer than 1 minute, clearly not explained by the log-linear fit. If you tried to make a best fit line on the blue points (the length of tasks we care about after 2024), you’d get a very different, almost vertical line, with a very low R^2.
I don’t think this is true. I got claude to clone the repo and reproduce it without the SWAA data points. The slope is ~identical (-0.076 rather than the original −0.072) and the correlation is still pretty good. (0.51)
Edit: That was with HCAST and RE-bench. Just HCAST is slope=-0.077 and R^2=0.48. I think it makes more sense to include RE-bench.
Edit 2: Updated the slopes. Now the slope is per doubling, like in the paper (and so the first slope matches the one in the paper). I think the previuos slopes were measuring per factor e instead.
Thanks, I should’ve done that myself instead of lazily mentioning what it “looked like”. R^2=0.51 is still a lot lower than the initial 0.83. Though same as before, I am not fully sure what this implies for the logistic model chosen, and downstream conclusions.
Incidentally, your intuition might’ve been misled by one or both of:
underestimating the number of data points clustered tightly together in the bottom right. (They look less significant because they take up less space.)
trying to imagine a line that minimizes the distance to all data-points, whereas linear regression works by minimizing the vertical distance to all data points.
As illustration of the last point: here’s a bonus plot where the green line is minimizing the horizontal squared distance instead, ie predicting human minutes from average model score. I wouldn’t quite say it’s almost vertical, but it’s much steeper.
thats an interesting point. If I kept adding points to the right, i.e. longer and longer tasks which I know the model would fail on, it would keep making the line flatter? That kind of makes me wonder, once again, if its even a good idea to try and fit a line here...
Yeah, a line is definitely not the “right” relationship, given that the y-axis is bounded 0-1 and a line isn’t. A sigmoid or some other 0-1 function would make more sense, and more so the further outside the sensitive, middle region of success rate you go. I imagine the purpose of this graph was probably to sanity-check that the human baselines did roughly track difficulty for the AIs as well. (Which looks pretty true to me when eye-balling the graph. The biggest eye-sore is definitely the 0% success rate in the 2-4h bucket.)
Graph for 2-parameter sigmoid, assuming that you top out at 1 and bottom-out at 0.
If you instead do a 4-parameter sigmoid with free top and bottom, the version without SWAA asymptotes at 0.7 to the left instead, which looks terrible. (With SWAA the asymptote is a little above 1 to the left; and they both get asymptotes a little below 0 to the right.)
(Wow, graphing is so fun when I don’t have to remember matplotlib commands. TBC I’m not really checking the language models’ work here other than assessing consistency and reasonableness of output, so discount depending on how much you trust them to graph things correctly in METR’s repo.)
Here R (the square root of R²) is Pearson correlation, which checks for linear association. The better measure here would be to use Spearman correlation on the original data, which checks for any monotonic association. Spearman is more principled than trying to transform the data first with some monotonic function (e.g. various sigmoids) before applying Pearson.
Mean squared error seems pretty principled. Normalizing by variance to make it more comparable to other distributions seems pretty principled.
I guess after that it seems more natural to take the standard deviation (to get RMSE normalized by standard deviation), than to subtract it off of 1. But I guess the latter is a simple enough transformation and makes it comparable to the (more well-motivated) R^2 for linear models, so therefore more commonly reported than RMSE/STD.
Anyway, spearman r is −0.903 (square 0.82) and −0.710, (square 0.5) so basically the same.
This doesn’t contradict common sense if you remember that Claude Opus 4.5 has a 50%-time horizon of around 4 hours 49 minutes (95% confidence interval: 1 hour 49 minutes to 20 hours 25 minutes).
Just think about it: from 1 hour 49 minutes up to 20 hours 25 minutes.
There simply isn’t enough data for reliable conclusions.
Lower R^2 is a natural consequence of the x axis being shorter, in an experimental setup with both signal and noise. In the extreme, imagine that we only benchmarked models on 47 minute tasks. AIs will naturally do better at some 47 minute tasks than others. Since R^2 is variance explained by the correlation with x, and x is always 47 minutes, all the variance in task success rate will be due to other factors and the R^2 will be zero.
In fact this is why we constructed SWAA—to increase the R^2 in worlds where the trend is real. In a world where models were not actually better at tasks with shorter human time, the R^2 would be lower when you add more data, so it’s a very scientifically normal thing to do.
I don’t think this is true. I got claude to clone the repo and reproduce it without the SWAA data points. The slope is ~identical (-0.076 rather than the original −0.072) and the correlation is still pretty good. (0.51)
Edit: That was with HCAST and RE-bench. Just HCAST is slope=-0.077 and R^2=0.48. I think it makes more sense to include RE-bench.
Edit 2: Updated the slopes. Now the slope is per doubling, like in the paper (and so the first slope matches the one in the paper). I think the previuos slopes were measuring per factor e instead.
Thanks, I should’ve done that myself instead of lazily mentioning what it “looked like”. R^2=0.51 is still a lot lower than the initial 0.83. Though same as before, I am not fully sure what this implies for the logistic model chosen, and downstream conclusions.
Incidentally, your intuition might’ve been misled by one or both of:
underestimating the number of data points clustered tightly together in the bottom right. (They look less significant because they take up less space.)
trying to imagine a line that minimizes the distance to all data-points, whereas linear regression works by minimizing the vertical distance to all data points.
As illustration of the last point: here’s a bonus plot where the green line is minimizing the horizontal squared distance instead, ie predicting human minutes from average model score. I wouldn’t quite say it’s almost vertical, but it’s much steeper.
thats an interesting point. If I kept adding points to the right, i.e. longer and longer tasks which I know the model would fail on, it would keep making the line flatter? That kind of makes me wonder, once again, if its even a good idea to try and fit a line here...
Yeah, a line is definitely not the “right” relationship, given that the y-axis is bounded 0-1 and a line isn’t. A sigmoid or some other 0-1 function would make more sense, and more so the further outside the sensitive, middle region of success rate you go. I imagine the purpose of this graph was probably to sanity-check that the human baselines did roughly track difficulty for the AIs as well. (Which looks pretty true to me when eye-balling the graph. The biggest eye-sore is definitely the 0% success rate in the 2-4h bucket.)
Graph for 2-parameter sigmoid, assuming that you top out at 1 and bottom-out at 0.
If you instead do a 4-parameter sigmoid with free top and bottom, the version without SWAA asymptotes at 0.7 to the left instead, which looks terrible. (With SWAA the asymptote is a little above 1 to the left; and they both get asymptotes a little below 0 to the right.)
(Wow, graphing is so fun when I don’t have to remember matplotlib commands. TBC I’m not really checking the language models’ work here other than assessing consistency and reasonableness of output, so discount depending on how much you trust them to graph things correctly in METR’s repo.)
Here R (the square root of R²) is Pearson correlation, which checks for linear association. The better measure here would be to use Spearman correlation on the original data, which checks for any monotonic association. Spearman is more principled than trying to transform the data first with some monotonic function (e.g. various sigmoids) before applying Pearson.
Hm.
R² = 1 − (mean squared errors / variance)
Mean squared error seems pretty principled. Normalizing by variance to make it more comparable to other distributions seems pretty principled.
I guess after that it seems more natural to take the standard deviation (to get RMSE normalized by standard deviation), than to subtract it off of 1. But I guess the latter is a simple enough transformation and makes it comparable to the (more well-motivated) R^2 for linear models, so therefore more commonly reported than RMSE/STD.
Anyway, spearman r is −0.903 (square 0.82) and −0.710, (square 0.5) so basically the same.
This doesn’t contradict common sense if you remember that Claude Opus 4.5 has a 50%-time horizon of around 4 hours 49 minutes (95% confidence interval: 1 hour 49 minutes to 20 hours 25 minutes).
Just think about it: from 1 hour 49 minutes up to 20 hours 25 minutes.
There simply isn’t enough data for reliable conclusions.
Lower R^2 is a natural consequence of the x axis being shorter, in an experimental setup with both signal and noise. In the extreme, imagine that we only benchmarked models on 47 minute tasks. AIs will naturally do better at some 47 minute tasks than others. Since R^2 is variance explained by the correlation with x, and x is always 47 minutes, all the variance in task success rate will be due to other factors and the R^2 will be zero.
In fact this is why we constructed SWAA—to increase the R^2 in worlds where the trend is real. In a world where models were not actually better at tasks with shorter human time, the R^2 would be lower when you add more data, so it’s a very scientifically normal thing to do.