Lao Mein comments on Lao Mein’s Shortform

Lao Mein 13 Feb 2026 5:44 UTC
3 points
−3
My big problem with METR time horizons as a useful metric is that they start to break down exactly when things get interesting, after the 1 hour mark. I think this is because the benchmark is based on the pool of [people available to METR via friend/professional networks and also randoms from Task Rabbit], and there aren’t enough people in that pool with time horizons much longer that 1 hour to set a consistant metric.
source:2503.14499
I think there are actual human beings who can complete actual 16 hour+ time horizon tasks, but people like that can’t be found on Task Rabbit, and are instead doing cutting-edge research or making bank at Jane Street.
Conclusion: an 8 hour time horizon shouldn’t actually impress you much more than a 1 hour time horizon, since the benchmark breaks down right around that point due to a lack of consistant 1 hour+ time horizon humans to use as a benchmark. It would definitely help if METR improved their human baseliner selection methodology.
- Jacobson 13 Feb 2026 18:42 UTC
  3 points
  0
  Parent
  The baseliners are described in the paper as
  “skilled professionals in software engineering, machine learning, and cybersecurity, with the majority having attended world top-100 universities. They have an average of about 5 years of relevant experience.”
  (pg. 7)