Bayesians are updating too much on AI capability speed from this data point, given:
The CI is extremely wide and METR’s own caveats about sparsity at higher horizons.
This level of a jump relative to the previous Opus 4.1 or Opus 4 is inconsistent with the 80% success threshold, accuracy level, and other key benchmarks that should correlate with capabilities (ECI, swe-bench bash).
I modeled all this in GPT-5.2 and the more realistic estimate for 50% derived from the other benchmarks is in the range of 190 to 210 minutes, depending on how much weight you put on the impressive (but not to the degree of the 50%) accuracy jump. The 80% is likely a slight underestimate (my guess is closer to 29 minutes).
These numbers:
Maps to around a 15-20th percentile on the CI interval for the 50% and a 60th percentile for the 80%, in the realms of “this is due to chance”. [1]
Gives around a 5-6 month doubling relative to Opus 4 and a 6-7 month doubling relative to O3.
Imply evidence for the 50% − 80% capabilities systemically widening is weak.
[1] Note that this does provide evidence that Gemini 3 and GPT-5.2 will also have high p50 scores. Not because of capability jump per se but because of the distribution of tasks within METR benchmarks.
Could you share your model, if you haven’t modeled it in an incognito chat? And don’t forget to check out my comment below. If you are right, then the alternate Claude who succeeded at 15-second-long tasks would have a lower 50% time horizon.
P.S. I also asked the AIs a similar question. Grok 4 failed to answer, Gemini 3 Pro estimated the revised 80% horizon as 35-40 mins and the 50% horizon as 6-7 hrs. GPT-5′s free version gave me this piece of slop, EDIT: Claude Opus 4.5 estimated the revised 80% horizons at 30-45 or 45-60 mins and 50% horizons as 2-3 hrs or 3-4 hrs.
Private workspace so I can’t share the session. But the approach is simple and doesn’t really require it to understand.
I think we’re coming at this from different angles: you’re doing a “white-box” critique (how specific task outcomes / curve fitting affect the METR horizon), whereas I’m doing a “black-box” consistency check: is the claimed p50 result consistent with what we see on other benchmarks that should correlate with capability?
The core model is:
Take Sonnet 4 → Sonnet 4.5 and compute the improvement rate (slope).
Assume Opus improves at the same rate as Sonnet over this period.
Start from Opus 4 as the anchor and ask: “when would we expect to reach the Opus 4.5 reported value?” (For METR horizons I do this in log space; for accuracy/ECI I treat it as linear.)
That yields “time ahead/behind” vs the reported Opus 4.5 result:
ECI: ~1.3 months ahead
SWE-bench bash agent:on target (about a week behind)
The point is that METR p50 is the outlier relative to the other signals.
If instead we assume Opus 4.5 is only as far “ahead” as the other benchmarks suggest, then p50 should be closer to:
1.3 months ahead (ECI-like): ~200 minutes
2.4 months ahead (accuracy-like): ~226 minutes
And the corresponding implied p80 would be:
on-target: ~28 minutes
1.3 months ahead: ~30 minutes
2.4 months ahead: ~32 minutes
My best guess is we’re ~1 month ahead overall, which puts p50/p80 in-between those cases. Finally, percentiles inside METR’s CI depend on the (unstated) sampling distribution; if you approximate it as log-normal you get the rough “position within the CI” numbers I mentioned, but it’s only an approximation.
Bayesians are updating too much on AI capability speed from this data point, given:
The CI is extremely wide and METR’s own caveats about sparsity at higher horizons.
This level of a jump relative to the previous Opus 4.1 or Opus 4 is inconsistent with the 80% success threshold, accuracy level, and other key benchmarks that should correlate with capabilities (ECI, swe-bench bash).
I modeled all this in GPT-5.2 and the more realistic estimate for 50% derived from the other benchmarks is in the range of 190 to 210 minutes, depending on how much weight you put on the impressive (but not to the degree of the 50%) accuracy jump. The 80% is likely a slight underestimate (my guess is closer to 29 minutes).
These numbers:
Maps to around a 15-20th percentile on the CI interval for the 50% and a 60th percentile for the 80%, in the realms of “this is due to chance”. [1]
Gives around a 5-6 month doubling relative to Opus 4 and a 6-7 month doubling relative to O3.
Imply evidence for the 50% − 80% capabilities systemically widening is weak.
[1] Note that this does provide evidence that Gemini 3 and GPT-5.2 will also have high p50 scores. Not because of capability jump per se but because of the distribution of tasks within METR benchmarks.
Could you share your model, if you haven’t modeled it in an incognito chat? And don’t forget to check out my comment below. If you are right, then the alternate Claude who succeeded at 15-second-long tasks would have a lower 50% time horizon.
P.S. I also asked the AIs a similar question. Grok 4 failed to answer, Gemini 3 Pro estimated the revised 80% horizon as 35-40 mins and the 50% horizon as 6-7 hrs. GPT-5′s free version gave me this piece of slop, EDIT: Claude Opus 4.5 estimated the revised 80% horizons at 30-45 or 45-60 mins and 50% horizons as 2-3 hrs or 3-4 hrs.
Private workspace so I can’t share the session. But the approach is simple and doesn’t really require it to understand.
I think we’re coming at this from different angles: you’re doing a “white-box” critique (how specific task outcomes / curve fitting affect the METR horizon), whereas I’m doing a “black-box” consistency check: is the claimed p50 result consistent with what we see on other benchmarks that should correlate with capability?
The core model is:
Take Sonnet 4 → Sonnet 4.5 and compute the improvement rate (slope).
Assume Opus improves at the same rate as Sonnet over this period.
Start from Opus 4 as the anchor and ask: “when would we expect to reach the Opus 4.5 reported value?”
(For METR horizons I do this in log space; for accuracy/ECI I treat it as linear.)
That yields “time ahead/behind” vs the reported Opus 4.5 result:
ECI: ~1.3 months ahead
SWE-bench bash agent: on target (about a week behind)
METR accuracy: ~2.4 months ahead
METR 80% horizon: ~1 month behind
METR 50% horizon (using METR’s reported 289 min): ~4.5 months ahead
The point is that METR p50 is the outlier relative to the other signals.
If instead we assume Opus 4.5 is only as far “ahead” as the other benchmarks suggest, then p50 should be closer to:
1.3 months ahead (ECI-like): ~200 minutes
2.4 months ahead (accuracy-like): ~226 minutes
And the corresponding implied p80 would be:
on-target: ~28 minutes
1.3 months ahead: ~30 minutes
2.4 months ahead: ~32 minutes
My best guess is we’re ~1 month ahead overall, which puts p50/p80 in-between those cases.
Finally, percentiles inside METR’s CI depend on the (unstated) sampling distribution; if you approximate it as log-normal you get the rough “position within the CI” numbers I mentioned, but it’s only an approximation.