My big problem with METR time horizons as a useful metric is that they start to break down exactly when things get interesting, after the 1 hour mark. I think this is because the benchmark is based on the pool of [people available to METR via friend/professional networks and also randoms from Task Rabbit], and there aren’t enough people in that pool with time horizons much longer that 1 hour to set a consistant metric.
I think there are actual human beings who can complete actual 16 hour+ time horizon tasks, but people like that can’t be found on Task Rabbit, and are instead doing cutting-edge research or making bank at Jane Street.
Conclusion: an 8 hour time horizon shouldn’t actually impress you much more than a 1 hour time horizon, since the benchmark breaks down right around that point due to a lack of consistant 1 hour+ time horizon humans to use as a benchmark. It would definitely help if METR improved their human baseliner selection methodology.
“skilled professionals in software engineering, machine learning, and cybersecurity, with the majority having attended world top-100 universities. They have an average of about 5 years of relevant experience.”
My big problem with METR time horizons as a useful metric is that they start to break down exactly when things get interesting, after the 1 hour mark. I think this is because the benchmark is based on the pool of [people available to METR via friend/professional networks and also randoms from Task Rabbit], and there aren’t enough people in that pool with time horizons much longer that 1 hour to set a consistant metric.
source:2503.14499
I think there are actual human beings who can complete actual 16 hour+ time horizon tasks, but people like that can’t be found on Task Rabbit, and are instead doing cutting-edge research or making bank at Jane Street.
Conclusion: an 8 hour time horizon shouldn’t actually impress you much more than a 1 hour time horizon, since the benchmark breaks down right around that point due to a lack of consistant 1 hour+ time horizon humans to use as a benchmark. It would definitely help if METR improved their human baseliner selection methodology.
The baseliners are described in the paper as
(pg. 7)