fwiw they focused only on very narrow checks for external validity—mostly just on whether solutions were possible to brute force
This seems inaccurate to me. Here’s the introduction to the external validity and robustness section of the paper:
To investigate the applicability of our results to other benchmarks, and to real task distributions, we performed four supplementary experiments. First, we check whether the 2023–2025 trend without the SWAA dataset retrodicts the trend since 2019, and find that the trends agree. Second, we label each of our tasks on 16 “messiness” factors—factors that we expect to (1) be representative of how real-world tasks may systematically differ from our tasks and (2) be relevant to AI agent performance. Third, we calculate AI agent horizon lengths from SWE-bench Verified tasks. We find a similar exponential trend, although with a shorter doubling time. However, we believe this shorter doubling time to be a result of SWE-bench Verified time annotations differentially underestimating the difficulty easier SWE-bench tasks. Finally, we collect and baseline a small set of uncontaminated issues from internal METR repositories. We find that our contracted human baseliners take much longer to complete these tasks than repository maintainers. We also find that AI agent performance is worse than would be predicted by maintainer time-to-complete but is consistent with contractor time-to-complete, given the AI agent success curves from HCAST + SWAA + RE-Bench tasks shown in Figure 5.
Sorry, looking again at the messiness factors fewer are about brute force than I remembered; will edit.
But they do indeed all strike me as quite narrow external validity checks, given that the validity in question is whether the benchmark predicts when AI will gain world-transforming capabilities.
“messiness” factors—factors that we expect to (1) be representative of how real-world tasks may systematically differ from our tasks
I felt very confused reading this claim in the paper. Why do you think they are representative? It seems to me that real-world problems obviously differ systematically from these factors, too—e.g., solving them often requires having novel thoughts.
This seems inaccurate to me. Here’s the introduction to the external validity and robustness section of the paper:
(For transparency, I am an author on the paper)
Sorry, looking again at the messiness factors fewer are about brute force than I remembered; will edit.
But they do indeed all strike me as quite narrow external validity checks, given that the validity in question is whether the benchmark predicts when AI will gain world-transforming capabilities.
I felt very confused reading this claim in the paper. Why do you think they are representative? It seems to me that real-world problems obviously differ systematically from these factors, too—e.g., solving them often requires having novel thoughts.