My interpretation of the METR results are an empirical observation of a trend that seems robust, in the same way scaling laws are. You could write the same post about why there’s no robust first principles reason that “cross-entropy loss decreases with scale in a way that correlates in an important, predictably useful way with an absurdely wide range of downstream tasks”.
The METR paper itself is almost entirely justifying the empirical prediction aspect, not a first principles argument for the approach from a theoretical perspective. I think the robustness of this analysis is why the paper had the impact it did. Are there specifics of the statistical analysis they did for the stuff around:
Since our tasks do not perfectly represent the average segment of intellectual labor by researchers and software engineers, this raises the question of external validity (Section 6): whether the exponential trend holds on real-world tasks. We include results from three supplementary external validity experiments.
That you think are sufficient to meaningfully change how valid people should interpret the overall predicted trend?
That you think are sufficient to meaningfully change how valid people should interpret the overall predicted trend?
I’m not Adam, but my response is “No”, based on the description Megan copied in thread and skimming some of the paper. It’s good that the paper includes those experiments, but they don’t really speak to the concerns Adam is discussing. Those concerns, as I see it (I could be misunderstanding):
Conceptual coherence: in humans there are different skills, e.g., between different fields, that don’t seem to easily project onto a time horizon dimension. Or like, our sense of how much intelligence is required for them or how difficult they are does not correspond all that closely with the time taken to do them.
Benchmark bias: solution criteria is known and progress criteria is often known; big jump from that to the real world scary things we’re worried about.
Do the experiments in Sec 6 deal with this?
No SWAA (“Retrodiction from 2023–2025 data”): Does not deal with 2. Mostly does not deal with 1, as both HCAST + RE-Bench and All-3 are mostly sofware engineerig dominated with a little bit of other stuff.
Messiness factors: Does not speak to 1. This is certainly relevant to 2, but I don’t think it’s conclusive. Quoting from the paper some:
We rated HCAST and RE-Bench tasks on 16 properties that we expected to be 1) representative of how real world tasks might be systematically harder than our tasks and 2) relevant to AI agent performance. Some example factors include whether the task involved a novel situation, was constrained by a finite resource, involved real-time coordination, or was sourced from a real-world context. We labeled RE-bench and HCAST tasks on the presence or absence of these 16 messiness factors, then summed these to obtain a “messiness score” ranging from 0 to 16. Factor definitions can be found in Appendix D.4.
The mean messiness score amongst HCAST and RE-Bench tasks is 3.2/16. None of these tasks have a messiness score above 8⁄16. For comparison, a task like ’write a good research paper’ would score between 9⁄16 and 15⁄16, depending on the specifics of the task.
On HCAST tasks, AI agents do perform worse on messier tasks than would be predicted from the task’s length alone (b=-0.081, R2 = 0.251) …
However, trends in AI agent performance over time are similar for lower and higher messiness subsets of our tasks.
This seems like very weak evidence in favor of the hypothesis that Benchmark Bias is a big deal. But they just don’t have very messy tasks.
c. SWE-Bench Verified: doesn’t speak to 1 or 2.
d. Internal PR experiments: Maybe speaks a little to 1 and 2 because these are more real world, closer to the thing we care about tasks, but not much, as they’re still clearly verifiable and still software engineering.
I do think Thomas and Vincent’s follow up work here on time horizons for other domains is useful evidence pointing a little against the conceptual coherence objection. But only a little.
I guess my understanding is more that the conceptual coherence objection isn’t an objection to the predictive accuracy of the trend, which is why I had brought up the scaling law / pretraining loss / downstream task analogy.
As far as I understand, the messiness relates to the Benchmark Bias objection as far as predicting performance at any given point in time, but not the actual trend, given the trend was similar for lower and higher messiness tasks.
Is your intuition that the trend is significantly (like more than their CI) wrong as well? Or that it’s just the performance prediction at a given point in time? Or is the question ill formed / undefined?
We care about the performance prediction at a given point in time for skills like “take over the world”, “invent new science”, and “do RSI” (and “automate AI R&D”, which I think the benchmark does speak to). We would like to know when those skills will be developed.
In the frame of this benchmark, and Thomas and Vincent’s follow up work, it seems like we’re facing down at least three problems:
The original time horizons tasks are clearly out of the distribution we care about. Solution: create a new task suite we think is the right distribution.
We don’t know how well time horizons will do at predicting future capabilities, even in this domain. Solution: keep collecting new data as it comes out in order to test predictions on whatever distributions we have, examine things like the conceptual coherence objection and try to make progress.
We don’t know how well the general “time horizons” approach works across domains. We have some data on this in the follow up work, maybe it’s a 2:1 update from a 1:1 prior?
So my overall take is that I think the current work I’m aware of tells us
Small positive update on time horizons being predictive at all.
A small positive update on the specific Software Engineering trends being predictive within distribution.
Small positive update on “time horizons” being common across different reasonable and easy to define distributions.
And on “doubling time in the single digit months” being the rate of time horizon increase across many domains.
A small negative update on the specific time horizon length from one task distribution generalizing to other task distributions (maybe an update, tbh the prior is much lower than 50⁄50). So it tells us approximately nothing about the performance prediction at a given point in time for the capabilities I care about.
I think there is more empirical evidence of robust scaling laws than of robust horizon length trends, but broadly I agree—I think it’s also quite unclear how scaling laws should constrain our expectations about timelines.
(Not sure I understand what you mean about the statistical analyses, but fwiw they focused only on very narrow checks for external validity—mostly just on whether solutions were possible to brute force).
fwiw they focused only on very narrow checks for external validity—mostly just on whether solutions were possible to brute force
This seems inaccurate to me. Here’s the introduction to the external validity and robustness section of the paper:
To investigate the applicability of our results to other benchmarks, and to real task distributions, we performed four supplementary experiments. First, we check whether the 2023–2025 trend without the SWAA dataset retrodicts the trend since 2019, and find that the trends agree. Second, we label each of our tasks on 16 “messiness” factors—factors that we expect to (1) be representative of how real-world tasks may systematically differ from our tasks and (2) be relevant to AI agent performance. Third, we calculate AI agent horizon lengths from SWE-bench Verified tasks. We find a similar exponential trend, although with a shorter doubling time. However, we believe this shorter doubling time to be a result of SWE-bench Verified time annotations differentially underestimating the difficulty easier SWE-bench tasks. Finally, we collect and baseline a small set of uncontaminated issues from internal METR repositories. We find that our contracted human baseliners take much longer to complete these tasks than repository maintainers. We also find that AI agent performance is worse than would be predicted by maintainer time-to-complete but is consistent with contractor time-to-complete, given the AI agent success curves from HCAST + SWAA + RE-Bench tasks shown in Figure 5.
Sorry, looking again at the messiness factors fewer are about brute force than I remembered; will edit.
But they do indeed all strike me as quite narrow external validity checks, given that the validity in question is whether the benchmark predicts when AI will gain world-transforming capabilities.
“messiness” factors—factors that we expect to (1) be representative of how real-world tasks may systematically differ from our tasks
I felt very confused reading this claim in the paper. Why do you think they are representative? It seems to me that real-world problems obviously differ systematically from these factors, too—e.g., solving them often requires having novel thoughts.
My interpretation of the METR results are an empirical observation of a trend that seems robust, in the same way scaling laws are. You could write the same post about why there’s no robust first principles reason that “cross-entropy loss decreases with scale in a way that correlates in an important, predictably useful way with an absurdely wide range of downstream tasks”.
The METR paper itself is almost entirely justifying the empirical prediction aspect, not a first principles argument for the approach from a theoretical perspective. I think the robustness of this analysis is why the paper had the impact it did. Are there specifics of the statistical analysis they did for the stuff around:
That you think are sufficient to meaningfully change how valid people should interpret the overall predicted trend?
I’m not Adam, but my response is “No”, based on the description Megan copied in thread and skimming some of the paper. It’s good that the paper includes those experiments, but they don’t really speak to the concerns Adam is discussing. Those concerns, as I see it (I could be misunderstanding):
Conceptual coherence: in humans there are different skills, e.g., between different fields, that don’t seem to easily project onto a time horizon dimension. Or like, our sense of how much intelligence is required for them or how difficult they are does not correspond all that closely with the time taken to do them.
Benchmark bias: solution criteria is known and progress criteria is often known; big jump from that to the real world scary things we’re worried about.
Do the experiments in Sec 6 deal with this?
No SWAA (“Retrodiction from 2023–2025 data”): Does not deal with 2. Mostly does not deal with 1, as both HCAST + RE-Bench and All-3 are mostly sofware engineerig dominated with a little bit of other stuff.
Messiness factors: Does not speak to 1. This is certainly relevant to 2, but I don’t think it’s conclusive. Quoting from the paper some:
This seems like very weak evidence in favor of the hypothesis that Benchmark Bias is a big deal. But they just don’t have very messy tasks.
c. SWE-Bench Verified: doesn’t speak to 1 or 2.
d. Internal PR experiments: Maybe speaks a little to 1 and 2 because these are more real world, closer to the thing we care about tasks, but not much, as they’re still clearly verifiable and still software engineering.
I do think Thomas and Vincent’s follow up work here on time horizons for other domains is useful evidence pointing a little against the conceptual coherence objection. But only a little.
I guess my understanding is more that the conceptual coherence objection isn’t an objection to the predictive accuracy of the trend, which is why I had brought up the scaling law / pretraining loss / downstream task analogy.
As far as I understand, the messiness relates to the Benchmark Bias objection as far as predicting performance at any given point in time, but not the actual trend, given the trend was similar for lower and higher messiness tasks.
Is your intuition that the trend is significantly (like more than their CI) wrong as well? Or that it’s just the performance prediction at a given point in time? Or is the question ill formed / undefined?
We care about the performance prediction at a given point in time for skills like “take over the world”, “invent new science”, and “do RSI” (and “automate AI R&D”, which I think the benchmark does speak to). We would like to know when those skills will be developed.
In the frame of this benchmark, and Thomas and Vincent’s follow up work, it seems like we’re facing down at least three problems:
The original time horizons tasks are clearly out of the distribution we care about. Solution: create a new task suite we think is the right distribution.
We don’t know how well time horizons will do at predicting future capabilities, even in this domain. Solution: keep collecting new data as it comes out in order to test predictions on whatever distributions we have, examine things like the conceptual coherence objection and try to make progress.
We don’t know how well the general “time horizons” approach works across domains. We have some data on this in the follow up work, maybe it’s a 2:1 update from a 1:1 prior?
So my overall take is that I think the current work I’m aware of tells us
Small positive update on time horizons being predictive at all.
A small positive update on the specific Software Engineering trends being predictive within distribution.
Small positive update on “time horizons” being common across different reasonable and easy to define distributions.
And on “doubling time in the single digit months” being the rate of time horizon increase across many domains.
A small negative update on the specific time horizon length from one task distribution generalizing to other task distributions (maybe an update, tbh the prior is much lower than 50⁄50). So it tells us approximately nothing about the performance prediction at a given point in time for the capabilities I care about.
I think there is more empirical evidence of robust scaling laws than of robust horizon length trends, but broadly I agree—I think it’s also quite unclear how scaling laws should constrain our expectations about timelines.
(Not sure I understand what you mean about the statistical analyses, but fwiw they focused only on very narrow checks for external validity
—mostly just on whether solutions were possible to brute force).This seems inaccurate to me. Here’s the introduction to the external validity and robustness section of the paper:
(For transparency, I am an author on the paper)
Sorry, looking again at the messiness factors fewer are about brute force than I remembered; will edit.
But they do indeed all strike me as quite narrow external validity checks, given that the validity in question is whether the benchmark predicts when AI will gain world-transforming capabilities.
I felt very confused reading this claim in the paper. Why do you think they are representative? It seems to me that real-world problems obviously differ systematically from these factors, too—e.g., solving them often requires having novel thoughts.