I do wonder why the SWE-Bench and METR benchmarks are taken as THE best indicator of progress. SWE-Bench is a particular benchmark that only captures a small fraction of real-world software engineering. METR themselves have published work that shows the benchmark only captures very narrow algorithmic work, not software engineering holistically. Benchmarks tell a minimal story, so to extrapolate predictions from limited benchmarks is a good example of Goodhart’s law. Real-world impact from AI on software engineering is much smaller than progress on benchmarks such as SWE-Bench would imply.
I do wonder why the SWE-Bench and METR benchmarks are taken as THE best indicator of progress. SWE-Bench is a particular benchmark that only captures a small fraction of real-world software engineering. METR themselves have published work that shows the benchmark only captures very narrow algorithmic work, not software engineering holistically. Benchmarks tell a minimal story, so to extrapolate predictions from limited benchmarks is a good example of Goodhart’s law. Real-world impact from AI on software engineering is much smaller than progress on benchmarks such as SWE-Bench would imply.