Edit: I’ve played with the numbers a bit more, and on reflection, I’m inclined to partially unroll this update. o3 doesn’t break the trendline as much as I’d thought, and in fact, it’s basically on-trend if we remove the GPT-2 and GPT-3 data-points (which I consider particularly dubious).
Regarding METR’s agency-horizon benchmark:
I still don’t like anchoring stuff to calendar dates, and I think the o3/o4-mini datapoints perfectly show why.
It would be one thing if they did fit into the pattern. If, by some divine will controlling the course of our world’s history, OpenAI’s semi-arbitrary decision about when to allow METR’s researchers to benchmark o3 just so happened to coincide with the 2x/7-month model. But it didn’t: o3 massively overshot that model.[1]
Imagine a counterfactual in which METR’s agency-horizon model existed back in December, and OpenAI invited them for safety testing/benchmarking then, four months sooner. How different would the inferred agency-horizing scaling laws have been, how much faster the extrapolated progress? Let’s run it:
o1 was announced September 12th, o3 was announced December 19th, 98 days apart.
o1 scored at ~40 minutes, o3 at ~1.5 hours, a 2.25x’ing.
There’s ~2.14 intervals of 98 days in 7 months.
Implied scaling factor: 2.252.14=5.67 each 7 months.
And I don’t see any reasons to believe it was overdetermined that this counterfactual wouldn’t have actualized. METR could have made the benchmark a few months earlier, OpenAI could have been more open about benchmarking o3.
And if we lived in that possible world… It’s now been 135 days since December 19th, i. e., ~1.38 intervals of 98 days. Extrapolating, we should expect the best publicly known model would have the time horizon of 1.5 hours×2.251.38=4.59 hours. I don’t think we have any hint that those exist.
So: in that neighbouring world in which OpenAI let METR benchmark o3 sooner, we’re looking around and seeing that the progress is way behind the schedule.[2]
To me, this makes the whole model fall apart. I don’t see how it can follow any mechanistic model-based reality of what’s happening, and as per the o3/o4-mini data points, it doesn’t predict the empirical reality as well. Further, whether we believe that the progress is much faster vs. much slower than expected is entirely controlled by the arbitrary fact that METR didn’t get to benchmark o3 in December.
7 months contain 3.5 two-month intervals. This means that, if horizons extend as fast as they did between 3.7 and o3, we should expect a 1.53.5=4.13x’ing of agency horizons each 7 months.
Edit: Yes, counterfactual!METR wouldn’t have used just those two last data points, so the inferred multiplier would’ve been somewhat less than that. But I think it would’ve still been bigger than 2x/7-months, and the graph would’ve been offset to the left (the 1.5-hour performance achieved much earlier), so we’d still be overdue for ~2.5-hour AIs. Half-a-year behind, I think?
I agree that anchoring stuff to release dates isn’t perfect because the underlying variable of “how long does it take until a model is released” is variable, but I think is variability is sufficiently low that it doesn’t cause that much of an issue in practice. The trend is only going to be very solid over multiple model releases and it won’t reliably time things to within 6 months, but that seems fine to me.
I agree that if you add one outlier data point and then trend extrapolate between just the last two data points, you’ll be in trouble, but fortunately, you can just not do this and instead use more than 2 data points.
This also means that I think people shouldn’t update that much on the individual o3 data point in either direction. Let’s see where things go for the next few model releases.
That one seems to work more reliably, perhaps because it became the metric the industry aims for.
I agree that if you add one outlier data point and then trend extrapolate between just the last two data points, you’ll be in trouble
My issue here is that there wasn’t that much variance in the performance of all preceding models they benchmarked: from GPT-2 to Sonnet 3.7, they seem to almost perfectly fall on the straight line. Then, the very first advancement of the frontier after the trend-model is released is an outlier. That suggests an overfit model.
I do agree that it might just be a coincidental outlier and that we should wait and see whether the pattern recovers with subsequent model releases. But this is suspicious enough I feel compelled to make my prediction now.
The dates used in our regression are the dates models were publicly released, not the dates we benchmarked them. If we use the latter dates, or the dates they were announced, I agree they would be more arbitrary.
Also, there is lots of noise in a time horizon measurement and it only displays any sort of pattern because we measured over many orders of magnitude and years. It’s not very meaningful to extrapolate from just 2 data points; there are many reasons one datapoint could randomly change by a couple of months or factor of 2 in time horizon.
Release schedules could be altered
A model could be overfit to our dataset
One model could play less well with our elicitation/scaffolding
One company could be barely at the frontier, and release a slightly-better model right before the leading company releases a much-better model.
All of these factors are averaged out if you look at more than 2 models. So I prefer to see each model as evidence of whether the trend is accelerating or slowing down over the last 1-2 years, rather than an individual model being very meaningful.
Have you considered removing GPT-2 and GPT-3 from your models, and seeing what happens? As I’d previously complained, I don’t think they can be part of any underlying pattern (due to the distribution shift in the AI industry after ChatGPT/GPT-3.5). And indeed: removing them seems to produce a much cleaner trend with a ~130-day doubling.
For what it’s worth, the model showcased in December (then called o3) seems to be completely different from the model that METR benchmarked (now called o3).
Edit: I’ve played with the numbers a bit more, and on reflection, I’m inclined to partially unroll this update. o3 doesn’t break the trendline as much as I’d thought, and in fact, it’s basically on-trend if we remove the GPT-2 and GPT-3 data-points (which I consider particularly dubious).
Regarding METR’s agency-horizon benchmark:
I still don’t like anchoring stuff to calendar dates, and I think the o3/o4-mini datapoints perfectly show why.
It would be one thing if they did fit into the pattern. If, by some divine will controlling the course of our world’s history, OpenAI’s semi-arbitrary decision about when to allow METR’s researchers to benchmark o3 just so happened to coincide with the 2x/7-month model. But it didn’t: o3 massively overshot that model.[1]
Imagine a counterfactual in which METR’s agency-horizon model existed back in December, and OpenAI invited them for safety testing/benchmarking then, four months sooner. How different would the inferred agency-horizing scaling laws have been, how much faster the extrapolated progress? Let’s run it:
o1 was announced September 12th, o3 was announced December 19th, 98 days apart.
o1 scored at ~40 minutes, o3 at ~1.5 hours, a 2.25x’ing.
There’s ~2.14 intervals of 98 days in 7 months.
Implied scaling factor: 2.252.14=5.67 each 7 months.
And I don’t see any reasons to believe it was overdetermined that this counterfactual wouldn’t have actualized. METR could have made the benchmark a few months earlier, OpenAI could have been more open about benchmarking o3.
And if we lived in that possible world… It’s now been 135 days since December 19th, i. e., ~1.38 intervals of 98 days. Extrapolating, we should expect the best publicly known model would have the time horizon of 1.5 hours×2.251.38=4.59 hours. I don’t think we have any hint that those exist.
So: in that neighbouring world in which OpenAI let METR benchmark o3 sooner, we’re looking around and seeing that the progress is way behind the schedule.[2]
To me, this makes the whole model fall apart. I don’t see how it can follow any mechanistic model-based reality of what’s happening, and as per the o3/o4-mini data points, it doesn’t predict the empirical reality as well. Further, whether we believe that the progress is much faster vs. much slower than expected is entirely controlled by the arbitrary fact that METR didn’t get to benchmark o3 in December.
I think we’re completely at sea.
o3′s datapoint implies a 4x/7-month model, no? Correct me if I’m wrong:
Sonnet 3.7 was released 24th of February, 2025; o3′s System Card and METR’s reports were released 16th of April, 2025: 51 days apart.
Sonnet 3.7 is benchmarked as having 1-hour agency; o3 has 1.5x that, ~1.5-hour agency.
7 months contain 3.5 two-month intervals. This means that, if horizons extend as fast as they did between 3.7 and o3, we should expect a 1.53.5=4.13x’ing of agency horizons each 7 months.
Edit: Yes, counterfactual!METR wouldn’t have used just those two last data points, so the inferred multiplier would’ve been somewhat less than that. But I think it would’ve still been bigger than 2x/7-months, and the graph would’ve been offset to the left (the 1.5-hour performance achieved much earlier), so we’d still be overdue for ~2.5-hour AIs. Half-a-year behind, I think?
Do you also dislike Moore’s law?
I agree that anchoring stuff to release dates isn’t perfect because the underlying variable of “how long does it take until a model is released” is variable, but I think is variability is sufficiently low that it doesn’t cause that much of an issue in practice. The trend is only going to be very solid over multiple model releases and it won’t reliably time things to within 6 months, but that seems fine to me.
I agree that if you add one outlier data point and then trend extrapolate between just the last two data points, you’ll be in trouble, but fortunately, you can just not do this and instead use more than 2 data points.
This also means that I think people shouldn’t update that much on the individual o3 data point in either direction. Let’s see where things go for the next few model releases.
That one seems to work more reliably, perhaps because it became the metric the industry aims for.
My issue here is that there wasn’t that much variance in the performance of all preceding models they benchmarked: from GPT-2 to Sonnet 3.7, they seem to almost perfectly fall on the straight line. Then, the very first advancement of the frontier after the trend-model is released is an outlier. That suggests an overfit model.
I do agree that it might just be a coincidental outlier and that we should wait and see whether the pattern recovers with subsequent model releases. But this is suspicious enough I feel compelled to make my prediction now.
The dates used in our regression are the dates models were publicly released, not the dates we benchmarked them. If we use the latter dates, or the dates they were announced, I agree they would be more arbitrary.
Also, there is lots of noise in a time horizon measurement and it only displays any sort of pattern because we measured over many orders of magnitude and years. It’s not very meaningful to extrapolate from just 2 data points; there are many reasons one datapoint could randomly change by a couple of months or factor of 2 in time horizon.
Release schedules could be altered
A model could be overfit to our dataset
One model could play less well with our elicitation/scaffolding
One company could be barely at the frontier, and release a slightly-better model right before the leading company releases a much-better model.
All of these factors are averaged out if you look at more than 2 models. So I prefer to see each model as evidence of whether the trend is accelerating or slowing down over the last 1-2 years, rather than an individual model being very meaningful.
Fair, also see my un-update edit.
Have you considered removing GPT-2 and GPT-3 from your models, and seeing what happens? As I’d previously complained, I don’t think they can be part of any underlying pattern (due to the distribution shift in the AI industry after ChatGPT/GPT-3.5). And indeed: removing them seems to produce a much cleaner trend with a ~130-day doubling.
For what it’s worth, the model showcased in December (then called o3) seems to be completely different from the model that METR benchmarked (now called o3).