I ran the 93.9% percent on SWE-bench verified by Claude Mythos through my analysis that estimates time horizons from percentage scores based on the task time distribution derived from commit timestamps.
Compared to Claude Opus 4.6′s 80.8% this pushes the imputed 50% time horizon from 6h to 34.4h and the 80% time horizon from 1.9h to 11h.
I never published it because it seemed to be clear that SWE-bench verified saturates well below 100% and without a known saturation point the derived time horizons could really be anything. I wasn’t even sure getting 93.9% was possible.
Of course one doesn’t need to transfer the 93.9% to time horizons to see that this is a huge discontinuous jump.
Just by looking at the benchmark scores you can see that it is very off-trend. But of course error bars for such long time horizons (even done with much better methodology than mine) are huge.
You should also put ever-decreasing credence in reported time horizons, cf Ryan’s post
The old METR time horizon benchmark has mostly saturated when it comes to measuring 50%-reliability time-horizon (as in, scores are sufficiently high this measurement is unreliable), but at 80% reliability the best publicly deployed models are at a bit over an hour while I expect the best internal models are reaching a bit below 2 hours. I expect that increasingly this 80%-reliability score is dominated by relatively niche tasks that don’t centrally reflect automating software engineering or AI R&D. Further, the time horizon measurement is increasingly sensitive to the task distribution.
Hm, this is comparing a model release date with a model preview internal release date. I suspect that Opus 4.6 had a internal preview release data that was earlier than Feb 5th. So you probably need to add some kind of fudge factor for that to get an accurate doubling period.
Do you mention Claude Opus 4.6′s 80.8% because your analysis has one free parameter and you set it to fit that 80.8%? How well does your analysis translate other model’s percentage scores to their time horizons?
I mention Opus 4.6 because it is the predecessor model and this allows a comparison between the numbers that pop out of my analysis and the “official” METR values.
My analysis at least recovered the exponential improvement of time horizons with similar doubling times as the METR analysis, but the concrete values depend on modelling assumptions.
If I find the time I might write it up after all, but here is a short sketch:
Two assumptions:
The logistics fitted by METR tend to have quite similar slopes (at least the later models), so I take the average slope for my fit.
The task time completions of SWE-bench verified are log-normally distributed, I derive the concrete distribution from commit timestamps by cleverly trying to correct for pauses. Here different modelling assumptions don’t change the trend but can change the time horizon values.
With the slope and the distribution I can find for each percentage the position of the logistic which gives me the time horizons.
The ratio of the 80% to 50% time horizon in your modeling is low at only 3; traditionally, it has been 5-6. 3 is in fact the lower bound of what should be plausible, representing a world where all subtasks of a given tasks have independent odds of success. (normally we’d expect some success correlation between subtasks).
That said, I don’t think swe-bench-verified is useful to infer metr data for several reasons:
opus 4.6 had large gains over opus 4.5 but no gain on swe-bench
If I had to ballpark this, I’d rely on the gain on swe-bench-pro relative to opus 4.6 being similar to going from opus 4.1 to opus 4.5 or 4.6 depending on your modeling. That would imply a 80% time of something like 2.5 hours to 4 hours. But many caveats are present, especially given the high levels of memorization present with these benchmarks.
My model takes the average slope of earlier logistic curves. If for some reason the logistic fitted for Mythos is much less steep than for earlier models, the ratio of the time horizons could be different. Have to wait for a task level analysis to see that.
I ran the 93.9% percent on SWE-bench verified by Claude Mythos through my analysis that estimates time horizons from percentage scores based on the task time distribution derived from commit timestamps.
Compared to Claude Opus 4.6′s 80.8% this pushes the imputed 50% time horizon from 6h to 34.4h and the 80% time horizon from 1.9h to 11h.
Do you have details of this analysis published anywhere? I’m curious to take a look. Thanks!
I never published it because it seemed to be clear that SWE-bench verified saturates well below 100% and without a known saturation point the derived time horizons could really be anything. I wasn’t even sure getting 93.9% was possible.
Of course one doesn’t need to transfer the 93.9% to time horizons to see that this is a huge discontinuous jump.
Opus 4.6 released on Feb 5. Mythos Preview sort of released today, if you count the system card. Your results impute a ~5.7x time horizon jump.
That’s 61 days, which works out to a (61/log2(5.7)) = about 24 days doubling period.
METR’s most recent doubling period was 4.3 months, AI Futures’ most recent one is 4. This would be about .8 months.
That would be very off-trend if true, to say the least.
We should expect a step change in model size and pretraining compute to be off trend.
Wouldn’t that imply that Opus 4.6 should have been below trend since it wasn’t a step change in model size or pretraining (as far as I know).
Just by looking at the benchmark scores you can see that it is very off-trend. But of course error bars for such long time horizons (even done with much better methodology than mine) are huge.
You should also put ever-decreasing credence in reported time horizons, cf Ryan’s post
Hm, this is comparing a model release date with a model preview internal release date. I suspect that Opus 4.6 had a internal preview release data that was earlier than Feb 5th. So you probably need to add some kind of fudge factor for that to get an accurate doubling period.
Do you mention Claude Opus 4.6′s 80.8% because your analysis has one free parameter and you set it to fit that 80.8%? How well does your analysis translate other model’s percentage scores to their time horizons?
I mention Opus 4.6 because it is the predecessor model and this allows a comparison between the numbers that pop out of my analysis and the “official” METR values.
My analysis at least recovered the exponential improvement of time horizons with similar doubling times as the METR analysis, but the concrete values depend on modelling assumptions.
If I find the time I might write it up after all, but here is a short sketch:
Two assumptions:
The logistics fitted by METR tend to have quite similar slopes (at least the later models), so I take the average slope for my fit.
The task time completions of SWE-bench verified are log-normally distributed, I derive the concrete distribution from commit timestamps by cleverly trying to correct for pauses. Here different modelling assumptions don’t change the trend but can change the time horizon values.
With the slope and the distribution I can find for each percentage the position of the logistic which gives me the time horizons.
The ratio of the 80% to 50% time horizon in your modeling is low at only 3; traditionally, it has been 5-6. 3 is in fact the lower bound of what should be plausible, representing a world where all subtasks of a given tasks have independent odds of success. (normally we’d expect some success correlation between subtasks).
That said, I don’t think swe-bench-verified is useful to infer metr data for several reasons:
OpenAI has noted the benchmark is both contaminated and nearing saturation.
opus 4.6 had large gains over opus 4.5 but no gain on swe-bench
If I had to ballpark this, I’d rely on the gain on swe-bench-pro relative to opus 4.6 being similar to going from opus 4.1 to opus 4.5 or 4.6 depending on your modeling. That would imply a 80% time of something like 2.5 hours to 4 hours. But many caveats are present, especially given the high levels of memorization present with these benchmarks.
My model takes the average slope of earlier logistic curves. If for some reason the logistic fitted for Mythos is much less steep than for earlier models, the ratio of the time horizons could be different. Have to wait for a task level analysis to see that.