Oliver Sourbut comments on METR: How Does Time Horizon Vary Across Domains?

Oliver Sourbut 16 Jul 2025 11:55 UTC
4 points
2
Really pleased to see this published! Probably the single standout limitation of the first paper was the scope to a particular domain/benchmark (which unfortunately got forgotten in a lot of lossy discussion of the paper). This fills that gap admirably!

I’d be interested in the same analysis but with a ‘time’ axis ranging over some metrics of standardised cost or scale, rather than raw date. Likely sensible choices would be training compute, runtime compute, or some appropriate combination of the two.

Is there an easy way to grab the collected model success rates and human time-to-complete data, in case I get round to doing this myself?
- Thomas Kwa 16 Jul 2025 22:53 UTC
  4 points
  0
  Parent
  Yes, it’s in the repo, in the data/scores and data/benchmarks folders.