I also applaud the effort to interrogate the underlying data. I have also been dismayed at people hanging dramatic updates off (what usually should be?) 1-few bits of surprisal. (I don’t think METR can be fairly blamed for others ~hunting noise in the ‘last’ datapoint—the CIs are clearly printed on the graph.)
Per other comments, I think the more theoretical worries in the OP miss the mark: you should end up with something like logistic curve if task length is unbounded but success probability is (0, 1); logging does a fairly good job at linearizing the data (although at least for sonnet 3.7 the fit collapses in the 2hr+ region, and eyeballing the other histograms suggests this might generalize).
Yet I think they may be in right neighbourhood of a ‘construct validity’ worry around time horizons. In precis (hopefully a full post someday):
Unlike (e.g.) ‘how fast can you run?’ or ‘how much can you lift?’ there’s seldom a handy cardinal scale for intellectual performance: IQ = 0 does not mean ‘zero intelligence’, nor you having double my chess ELO means you are twice as good at chess as I am. (Even if you’re happy not having a meaningful zero, meaningful interval scales don’t exist either.)
Besides issues of general overprediction, it seems hard to tell how meaningful a D increment on X benchmark is. The function from ‘benchmark score’ to ‘irl importance’ (or ‘AI capabilities’) could be almost anything monotonic: from “any nonzero score is a cataclysmic breakthrough (but any further increment matters little on the margin”, to “long march through the ‘nines’ (so all scores <99.9% are ~equally worthless), and everything in between.
Hence the utility of METR’s time horizons as a (/the only?) cardinal measure: ‘doubling’ is meaningful, and (if treated, as it often is—and I suspect more than METR would like it to be—as a proxy for ‘AI capabilities in general’) it shows a broad trend of exponentially increasing capabilities over last few years. (+/- discourse over whether recent data points indicate even more dramatic acceleration, ‘hitting the wall’, etc.)
What is load-bearing for this account is the essentially exponential transformation between ‘raw’ scores on HCAST etc. to time horizons. Per OP (and comments), you can get a similar plot with just the raw scores, and it is largely the transformation from that to time horizons which gives (e.g.) Opus 4.5, scoring 75%, ~double the time horizon of GPT5 (70%), or ~treble the time horizon of o3 (66%). If the y-axis of the figure was instead “composite accuracy (SWAA+HCAST+REBench)”, the figure might be grist for the mill of folks like Gary Marcus: “A whole year of dramatically increasing investment and computation, and all it got you was another 10%.”
It goes without saying METR didn’t simply stipulate ’linear score improvement = exponentially increasing time horizons”: it arose from a lot of admirable empirical work demonstrating the human completion time is roughly log-distributed.
But at least when taken as the colloquial byword for AI capabilities, this crucial contour feels a bit too mechanistic to me. I take that you can generalise the technique widely to other benchmarks deepens rather than alleviates this concern: if human benchmarking exercises would give log-distributed horizons across the items in many (/most?) benchmarks, such that progressive linear increments in model performance would give a finding of exponentially improving capabilities, maybe too much is being proven.
Taking the horizons (and their changes) literally has dubious face validity by my lights:
It doesn’t seem to me the frontier has gotten ~3x more capable over this year, and although I’m no software engineer, it doesn’t look from the outside like e.g. Opus 4.5 is 2x better at SWE than Opus 4.1, etc.
Presumably we could benchmark humans against the time horizons (IIRC not everyone used in the benchmarking could successfully complete the task), or at least the benchmarks from which time horizons could be imputed from. I’d at least be doubtful our best guess should be Alice (who cracks 75%) is 3x the SWE of Bob who hits 65%, etc.
That said, given our grasp of the ‘true cardinal scale of intellect’ is murky—or fictitious—even if my vibes are common, it looks reasonable to deny them rather than the facially contradicting data.
Perhaps the underlying moral of the jagged frontier is there isn’t some crisp (at least crisp + practically accessible) measure out there re. ‘general intelligence’ (or even general measures of intelligence when particularly applied: cf. ‘twice as good at chess’), and we should focus on metrics specific to whatever real-world impact we are interested in (maybe for ‘AI generally’, just trend extrapolate from ‘economy generally’?). But if the story of benchmarks over the last while is they are missing whatever intellectual dark matter intervenes between ‘benchmark assessing X’ and ‘actually Xing’, maybe you can’t derive sturdy synthetic y-axis yardsticks from their distorted timber: the transfer function from ‘time horizon’ to ‘irl importance’ is a similar value of ”??” as the original benchmarks were.
It goes without saying METR didn’t simply stipulate ’linear score improvement = exponentially increasing time horizons”: it arose from a lot of admirable empirical work demonstrating the human completion time is roughly log-distributed.
Not sure I agree with this, we constructed the benchmark to span a wide range, and the empirical work was mostly to show that the model success rate curve was logistic in human time.
But at least when taken as the colloquial byword for AI capabilities, this crucial contour feels a bit too mechanistic to me. I take that you can generalise the technique widely to other benchmarks deepens rather than alleviates this concern: if human benchmarking exercises would give log-distributed horizons across the items in many (/most?) benchmarks, such that progressive linear increments in model performance would give a finding of exponentially improving capabilities, maybe too much is being proven.
This isn’t always the case. What matters for a benchmark is a range that spans many orders of magnitude, not exactly log distribution, and in that report, we saw that many benchmarks generally spanned narrower ranges than the METR task suite, so there isn’t enough data to prove that increments in model performance. In others, task length was poorly correlated with difficulty for models. But taken together, the fact that models went from 0% to saturating MATH500, then the AIME, and now IMO points to a dramatic increase in capabilities.
I also think our everyday experience points to something like an exponential relationship between an intuitive “task complexity rating” and human time. It’s natural to think of a level n+1 task as being decomposable into several level n tasks (eg get groceries = drive to the store, shop, drive back from the store) which naturally gives you exponential.
It doesn’t seem to me the frontier has gotten ~3x more capable over this year, and although I’m no software engineer, it doesn’t look from the outside like e.g. Opus 4.5 is 2x better at SWE than Opus 4.1, etc.
Some of this is probably the benchmark being unrepresentative of real tasks, but it’s not clear why an agent with 2x the time horizon should feel 2x better. When using an assistant with double the time horizon on real tasks, you need to intervene half as much, but each intervention takes you longer, since it’s written about double the code, fails in more complicated ways, and you have less understanding of what it’s doing. Combined with Amdahl’s law effects, I wouldn’t be surprised if doubling the time horizon only causes a speedup of 1.2x or so on average on tasks much longer than the agent’s time horizon, which are still most of them.
I also applaud the effort to interrogate the underlying data. I have also been dismayed at people hanging dramatic updates off (what usually should be?) 1-few bits of surprisal. (I don’t think METR can be fairly blamed for others ~hunting noise in the ‘last’ datapoint—the CIs are clearly printed on the graph.)
Per other comments, I think the more theoretical worries in the OP miss the mark: you should end up with something like logistic curve if task length is unbounded but success probability is (0, 1); logging does a fairly good job at linearizing the data (although at least for sonnet 3.7 the fit collapses in the 2hr+ region, and eyeballing the other histograms suggests this might generalize).
Yet I think they may be in right neighbourhood of a ‘construct validity’ worry around time horizons. In precis (hopefully a full post someday):
Unlike (e.g.) ‘how fast can you run?’ or ‘how much can you lift?’ there’s seldom a handy cardinal scale for intellectual performance: IQ = 0 does not mean ‘zero intelligence’, nor you having double my chess ELO means you are twice as good at chess as I am. (Even if you’re happy not having a meaningful zero, meaningful interval scales don’t exist either.)
Besides issues of general overprediction, it seems hard to tell how meaningful a D increment on X benchmark is. The function from ‘benchmark score’ to ‘irl importance’ (or ‘AI capabilities’) could be almost anything monotonic: from “any nonzero score is a cataclysmic breakthrough (but any further increment matters little on the margin”, to “long march through the ‘nines’ (so all scores <99.9% are ~equally worthless), and everything in between.
Hence the utility of METR’s time horizons as a (/the only?) cardinal measure: ‘doubling’ is meaningful, and (if treated, as it often is—and I suspect more than METR would like it to be—as a proxy for ‘AI capabilities in general’) it shows a broad trend of exponentially increasing capabilities over last few years. (+/- discourse over whether recent data points indicate even more dramatic acceleration, ‘hitting the wall’, etc.)
What is load-bearing for this account is the essentially exponential transformation between ‘raw’ scores on HCAST etc. to time horizons. Per OP (and comments), you can get a similar plot with just the raw scores, and it is largely the transformation from that to time horizons which gives (e.g.) Opus 4.5, scoring 75%, ~double the time horizon of GPT5 (70%), or ~treble the time horizon of o3 (66%). If the y-axis of the figure was instead “composite accuracy (SWAA+HCAST+REBench)”, the figure might be grist for the mill of folks like Gary Marcus: “A whole year of dramatically increasing investment and computation, and all it got you was another 10%.”
It goes without saying METR didn’t simply stipulate ’linear score improvement = exponentially increasing time horizons”: it arose from a lot of admirable empirical work demonstrating the human completion time is roughly log-distributed.
But at least when taken as the colloquial byword for AI capabilities, this crucial contour feels a bit too mechanistic to me. I take that you can generalise the technique widely to other benchmarks deepens rather than alleviates this concern: if human benchmarking exercises would give log-distributed horizons across the items in many (/most?) benchmarks, such that progressive linear increments in model performance would give a finding of exponentially improving capabilities, maybe too much is being proven.
Taking the horizons (and their changes) literally has dubious face validity by my lights:
It doesn’t seem to me the frontier has gotten ~3x more capable over this year, and although I’m no software engineer, it doesn’t look from the outside like e.g. Opus 4.5 is 2x better at SWE than Opus 4.1, etc.
Presumably we could benchmark humans against the time horizons (IIRC not everyone used in the benchmarking could successfully complete the task), or at least the benchmarks from which time horizons could be imputed from. I’d at least be doubtful our best guess should be Alice (who cracks 75%) is 3x the SWE of Bob who hits 65%, etc.
That said, given our grasp of the ‘true cardinal scale of intellect’ is murky—or fictitious—even if my vibes are common, it looks reasonable to deny them rather than the facially contradicting data.
Perhaps the underlying moral of the jagged frontier is there isn’t some crisp (at least crisp + practically accessible) measure out there re. ‘general intelligence’ (or even general measures of intelligence when particularly applied: cf. ‘twice as good at chess’), and we should focus on metrics specific to whatever real-world impact we are interested in (maybe for ‘AI generally’, just trend extrapolate from ‘economy generally’?). But if the story of benchmarks over the last while is they are missing whatever intellectual dark matter intervenes between ‘benchmark assessing X’ and ‘actually Xing’, maybe you can’t derive sturdy synthetic y-axis yardsticks from their distorted timber: the transfer function from ‘time horizon’ to ‘irl importance’ is a similar value of ”??” as the original benchmarks were.
Not sure I agree with this, we constructed the benchmark to span a wide range, and the empirical work was mostly to show that the model success rate curve was logistic in human time.
This isn’t always the case. What matters for a benchmark is a range that spans many orders of magnitude, not exactly log distribution, and in that report, we saw that many benchmarks generally spanned narrower ranges than the METR task suite, so there isn’t enough data to prove that increments in model performance. In others, task length was poorly correlated with difficulty for models. But taken together, the fact that models went from 0% to saturating MATH500, then the AIME, and now IMO points to a dramatic increase in capabilities.
I also think our everyday experience points to something like an exponential relationship between an intuitive “task complexity rating” and human time. It’s natural to think of a level n+1 task as being decomposable into several level n tasks (eg get groceries = drive to the store, shop, drive back from the store) which naturally gives you exponential.
Some of this is probably the benchmark being unrepresentative of real tasks, but it’s not clear why an agent with 2x the time horizon should feel 2x better. When using an assistant with double the time horizon on real tasks, you need to intervene half as much, but each intervention takes you longer, since it’s written about double the code, fails in more complicated ways, and you have less understanding of what it’s doing. Combined with Amdahl’s law effects, I wouldn’t be surprised if doubling the time horizon only causes a speedup of 1.2x or so on average on tasks much longer than the agent’s time horizon, which are still most of them.