Yeah, a line is definitely not the “right” relationship, given that the y-axis is bounded 0-1 and a line isn’t. A sigmoid or some other 0-1 function would make more sense, and more so the further outside the sensitive, middle region of success rate you go. I imagine the purpose of this graph was probably to sanity-check that the human baselines did roughly track difficulty for the AIs as well. (Which looks pretty true to me when eye-balling the graph. The biggest eye-sore is definitely the 0% success rate in the 2-4h bucket.)
Graph for 2-parameter sigmoid, assuming that you top out at 1 and bottom-out at 0.
If you instead do a 4-parameter sigmoid with free top and bottom, the version without SWAA asymptotes at 0.7 to the left instead, which looks terrible. (With SWAA the asymptote is a little above 1 to the left; and they both get asymptotes a little below 0 to the right.)
(Wow, graphing is so fun when I don’t have to remember matplotlib commands. TBC I’m not really checking the language models’ work here other than assessing consistency and reasonableness of output, so discount depending on how much you trust them to graph things correctly in METR’s repo.)
Here R (the square root of R²) is Pearson correlation, which checks for linear association. The better measure here would be to use Spearman correlation on the original data, which checks for any monotonic association. Spearman is more principled than trying to transform the data first with some monotonic function (e.g. various sigmoids) before applying Pearson.
Mean squared error seems pretty principled. Normalizing by variance to make it more comparable to other distributions seems pretty principled.
I guess after that it seems more natural to take the standard deviation (to get RMSE normalized by standard deviation), than to subtract it off of 1. But I guess the latter is a simple enough transformation and makes it comparable to the (more well-motivated) R^2 for linear models, so therefore more commonly reported than RMSE/STD.
Anyway, spearman r is −0.903 (square 0.82) and −0.710, (square 0.5) so basically the same.
Yeah, a line is definitely not the “right” relationship, given that the y-axis is bounded 0-1 and a line isn’t. A sigmoid or some other 0-1 function would make more sense, and more so the further outside the sensitive, middle region of success rate you go. I imagine the purpose of this graph was probably to sanity-check that the human baselines did roughly track difficulty for the AIs as well. (Which looks pretty true to me when eye-balling the graph. The biggest eye-sore is definitely the 0% success rate in the 2-4h bucket.)
Graph for 2-parameter sigmoid, assuming that you top out at 1 and bottom-out at 0.
If you instead do a 4-parameter sigmoid with free top and bottom, the version without SWAA asymptotes at 0.7 to the left instead, which looks terrible. (With SWAA the asymptote is a little above 1 to the left; and they both get asymptotes a little below 0 to the right.)
(Wow, graphing is so fun when I don’t have to remember matplotlib commands. TBC I’m not really checking the language models’ work here other than assessing consistency and reasonableness of output, so discount depending on how much you trust them to graph things correctly in METR’s repo.)
Here R (the square root of R²) is Pearson correlation, which checks for linear association. The better measure here would be to use Spearman correlation on the original data, which checks for any monotonic association. Spearman is more principled than trying to transform the data first with some monotonic function (e.g. various sigmoids) before applying Pearson.
Hm.
R² = 1 − (mean squared errors / variance)
Mean squared error seems pretty principled. Normalizing by variance to make it more comparable to other distributions seems pretty principled.
I guess after that it seems more natural to take the standard deviation (to get RMSE normalized by standard deviation), than to subtract it off of 1. But I guess the latter is a simple enough transformation and makes it comparable to the (more well-motivated) R^2 for linear models, so therefore more commonly reported than RMSE/STD.
Anyway, spearman r is −0.903 (square 0.82) and −0.710, (square 0.5) so basically the same.