Incidentally, your intuition might’ve been misled by one or both of:
underestimating the number of data points clustered tightly together in the bottom right. (They look less significant because they take up less space.)
trying to imagine a line that minimizes the distance to all data-points, whereas linear regression works by minimizing the vertical distance to all data points.
As illustration of the last point: here’s a bonus plot where the green line is minimizing the horizontal squared distance instead, ie predicting human minutes from average model score. I wouldn’t quite say it’s almost vertical, but it’s much steeper.
thats an interesting point. If I kept adding points to the right, i.e. longer and longer tasks which I know the model would fail on, it would keep making the line flatter? That kind of makes me wonder, once again, if its even a good idea to try and fit a line here...
Yeah, a line is definitely not the “right” relationship, given that the y-axis is bounded 0-1 and a line isn’t. A sigmoid or some other 0-1 function would make more sense, and more so the further outside the sensitive, middle region of success rate you go. I imagine the purpose of this graph was probably to sanity-check that the human baselines did roughly track difficulty for the AIs as well. (Which looks pretty true to me when eye-balling the graph. The biggest eye-sore is definitely the 0% success rate in the 2-4h bucket.)
Graph for 2-parameter sigmoid, assuming that you top out at 1 and bottom-out at 0.
If you instead do a 4-parameter sigmoid with free top and bottom, the version without SWAA asymptotes at 0.7 to the left instead, which looks terrible. (With SWAA the asymptote is a little above 1 to the left; and they both get asymptotes a little below 0 to the right.)
(Wow, graphing is so fun when I don’t have to remember matplotlib commands. TBC I’m not really checking the language models’ work here other than assessing consistency and reasonableness of output, so discount depending on how much you trust them to graph things correctly in METR’s repo.)
Here R (the square root of R²) is Pearson correlation, which checks for linear association. The better measure here would be to use Spearman correlation on the original data, which checks for any monotonic association. Spearman is more principled than trying to transform the data first with some monotonic function (e.g. various sigmoids) before applying Pearson.
Mean squared error seems pretty principled. Normalizing by variance to make it more comparable to other distributions seems pretty principled.
I guess after that it seems more natural to take the standard deviation (to get RMSE normalized by standard deviation), than to subtract it off of 1. But I guess the latter is a simple enough transformation and makes it comparable to the (more well-motivated) R^2 for linear models, so therefore more commonly reported than RMSE/STD.
Anyway, spearman r is −0.903 (square 0.82) and −0.710, (square 0.5) so basically the same.
This doesn’t contradict common sense if you remember that Claude Opus 4.5 has a 50%-time horizon of around 4 hours 49 minutes (95% confidence interval: 1 hour 49 minutes to 20 hours 25 minutes).
Just think about it: from 1 hour 49 minutes up to 20 hours 25 minutes.
There simply isn’t enough data for reliable conclusions.
Incidentally, your intuition might’ve been misled by one or both of:
underestimating the number of data points clustered tightly together in the bottom right. (They look less significant because they take up less space.)
trying to imagine a line that minimizes the distance to all data-points, whereas linear regression works by minimizing the vertical distance to all data points.
As illustration of the last point: here’s a bonus plot where the green line is minimizing the horizontal squared distance instead, ie predicting human minutes from average model score. I wouldn’t quite say it’s almost vertical, but it’s much steeper.
thats an interesting point. If I kept adding points to the right, i.e. longer and longer tasks which I know the model would fail on, it would keep making the line flatter? That kind of makes me wonder, once again, if its even a good idea to try and fit a line here...
Yeah, a line is definitely not the “right” relationship, given that the y-axis is bounded 0-1 and a line isn’t. A sigmoid or some other 0-1 function would make more sense, and more so the further outside the sensitive, middle region of success rate you go. I imagine the purpose of this graph was probably to sanity-check that the human baselines did roughly track difficulty for the AIs as well. (Which looks pretty true to me when eye-balling the graph. The biggest eye-sore is definitely the 0% success rate in the 2-4h bucket.)
Graph for 2-parameter sigmoid, assuming that you top out at 1 and bottom-out at 0.
If you instead do a 4-parameter sigmoid with free top and bottom, the version without SWAA asymptotes at 0.7 to the left instead, which looks terrible. (With SWAA the asymptote is a little above 1 to the left; and they both get asymptotes a little below 0 to the right.)
(Wow, graphing is so fun when I don’t have to remember matplotlib commands. TBC I’m not really checking the language models’ work here other than assessing consistency and reasonableness of output, so discount depending on how much you trust them to graph things correctly in METR’s repo.)
Here R (the square root of R²) is Pearson correlation, which checks for linear association. The better measure here would be to use Spearman correlation on the original data, which checks for any monotonic association. Spearman is more principled than trying to transform the data first with some monotonic function (e.g. various sigmoids) before applying Pearson.
Hm.
R² = 1 − (mean squared errors / variance)
Mean squared error seems pretty principled. Normalizing by variance to make it more comparable to other distributions seems pretty principled.
I guess after that it seems more natural to take the standard deviation (to get RMSE normalized by standard deviation), than to subtract it off of 1. But I guess the latter is a simple enough transformation and makes it comparable to the (more well-motivated) R^2 for linear models, so therefore more commonly reported than RMSE/STD.
Anyway, spearman r is −0.903 (square 0.82) and −0.710, (square 0.5) so basically the same.
This doesn’t contradict common sense if you remember that Claude Opus 4.5 has a 50%-time horizon of around 4 hours 49 minutes (95% confidence interval: 1 hour 49 minutes to 20 hours 25 minutes).
Just think about it: from 1 hour 49 minutes up to 20 hours 25 minutes.
There simply isn’t enough data for reliable conclusions.