Really pleased to see this published! Probably the single standout limitation of the first paper was the scope to a particular domain/benchmark (which unfortunately got forgotten in a lot of lossy discussion of the paper). This fills that gap admirably!
I’d be interested in the same analysis but with a ‘time’ axis ranging over some metrics of standardised cost or scale, rather than raw date. Likely sensible choices would be training compute, runtime compute, or some appropriate combination of the two.
Is there an easy way to grab the collected model success rates and human time-to-complete data, in case I get round to doing this myself?
Really pleased to see this published! Probably the single standout limitation of the first paper was the scope to a particular domain/benchmark (which unfortunately got forgotten in a lot of lossy discussion of the paper). This fills that gap admirably!
I’d be interested in the same analysis but with a ‘time’ axis ranging over some metrics of standardised cost or scale, rather than raw date. Likely sensible choices would be training compute, runtime compute, or some appropriate combination of the two.
Is there an easy way to grab the collected model success rates and human time-to-complete data, in case I get round to doing this myself?
Yes, it’s in the repo, in the data/scores and data/benchmarks folders.