Thanks for the comments! Besides the below, I’m curious what your overall views are. What does your distribution for AC look like?
The authors don’t seem to address the possibility that we are seeing a temporary acceleration of AI, because the labs are ramping methods that are much more expensive to scale, but they are doing so from very low baselines.
I think this is basically addressed in our uncertainty over the present doubling time, at least that’s how I’d think of it for myself. Note that my median present doubling time estimate of 5.5 months is slower than the potentially accelerated recent time horizon trend.
I don’t think there’s any reason to believe that AI-aided R&D acceleration has happened in any meaningful way,
Our model reflects that, with my median parameters the current software R&D upflit is 1.1x.
2- One place where has been an acceleration is on my spending on AI. I am now spending more than one thousand dollars in tokens and the marginal task of my job I am automating with AI costs what I used to pay for AI during an entire month. Toby Ord argues that the costs of AI are increasing exponentially: “the hourly costs for some models are now close to human costs.” While the evidence is small and we need further work, if each jump makes the marginal task exponentially more expensive, but for a fixed level of intelligence, we get prices 90% cheaper per year, one could imagine a point where we achieve the AGI at 2028, but only can deploy it economically in 2030. And a world where we achieve the Automated Coder in 2031, but only can deploy it economically in 2035.
Our Automated Coder has efficiency definitions built in, so you wouldn’t put it that way, you’d instead say you get an Automated Coder in 2035 and a very expensive replication of AC abilities in 2031. I personally think that a large majority of the relevant recent gains have not come from inference scaling, but if I did think that a lot of it had been, I would adjust my present doubling time to be slower.
Here’s some portions of a rough Slack message I wrote recently on this topic:
Let me try… a concrete case study: let’s compare GPT-4 and GPT-5 and long-horizon coding (if we did GPT-3 vs. GPT-4 it would be even more obvious, but perhaps better to discuss a jump that’s more recent).
Our model says that this is a 10,000x increase in effective compute, i.e. 4 OOMs (it seems more relevant to discuss something like effective compute, ECI, etc. rather than pure compute scaling, because pure compute scaling isn’t what happens in practice). Now your numbers (as far as I understand) say that we could achieve the same gains with 6 OOMs of inference compute if this all came from pretraining, or 2 OOMs of inference compute if this all came from RL [note for LW: this was responding to the exchange rates proposed in https://www.tobyord.com/writing/how-well-does-rl-scale]. From https://evaluations.metr.org/gpt-5-report/, I’m attaching what they say for the curve of tokens->performance on METR’s time horizon suite.
We can’t even see GPT-4 here, but GPT-4o for example is clearly basically asymptoting at something like 10 minute time horizons. Meanwhile GPT-5 is above 2 hours at max tokens. If we look at 10 minute time horizons, then according to this graph GPT-5 is a bit more expensive, though iirc the graph overrepresents GPT-5 costs (e.g. it should not be near o3′s costs). But if we look at 2 hour horizons (or even like 20+ mins), it’s essentially an infinite cost improvement over GPT-4o, much less GPT-4 (this is a bit oversimplified because models obviously have probabilistic success rates at each horizon, but I don’t think it changes the basic takeaway).
So stepping back, we see that how we compare scaling effective compute / ECI / “years of recent progress” (pick your favorite) to inference scaling just changes a ton based on what difficulty of task you’re looking at, but if it’s more difficult (and if you are looking at a larger effective compute difference) then you basically can’t match it with any practically achievable amounts of inference scaling. And imo those are the tasks we care the most about! So I find these inference scaling comparison numbers interesting and informative for some questions, but not as relevant to the overall picture relative to other capability forecasting lenses.
Btw also attaching a pic from https://www.anthropic.com/news/claude-opus-4-5 comparing SWEBench-Verified on Sonnet and Opus 4.5. Obviously just one data point but I found it interesting on just a short time frame (~2 months) Anthropic saw 5x token efficiency improvement at high levels of SWEBench-Verified performance (Opus 4.5 is about 1-1.7x as expensive per token), and that’s not even looking at the highest levels, I assume the multiplier would be much higher if you tried to scaling Sonnet to reach Opus’s high performance.
[end Slack message]
Furthermore, it seems that once capabilities can be reached very expensively, they pretty reliably get cheaper very quickly. See here for my research into this or just skip to Epoch’s data which I used as input to my parameter esitmate; happy to answer questions, sorry that my explanation is pretty rough.
3- Despite the METR and ECI indexes of capabilities per unit of time following an exponential with even an acceleration, the underlying trends have changed massively. a- Pretraining scaling has slowed down massively since the GPT-4.5 debacle. b- Massive efforts have been done to create human cured data around the matters we care about. SemiAnalysis say the labs are spending single-digits billions on human generated data. Beren argues most algorithimic progress is data progress. Obviously, replacing the corpus of text from random dudes debating in a 2007 forum to all the intermediate steps of a math proof by a math PhD improves the models. Obviously, this can’t scale and is an one-off improvement. b- Inference-time scaling has been improving the models considerably. To the point, I consider OpenAI models like GPT-5.2-Codex-High unusable, given how slow they are. Not only that, but gains from inference-time scaling must be paid every time they are executed. I don’t think we can continue to scale inference time compute into the back-half of the decade. c- Toby Ord also argues that RL is in on the order of 1,000,000x less compute efficient than pre-training. He says “I estimate that at the time of writing (Oct 2025), we’ve already seen something like a 1,000,000x scale-up in RL training and it required ≤2x the total training cost. But the next 1,000,000x scale-up would require 1,000,000x the total training cost, which is not possible in the foreseeable future.” Regardless of the level, I feel anyone paying attention feels the same way. Ilya argues that RL is learning from a straw.
I think I already addressed at least part of this in my answer to (2).
4- The authors don’t address that they are making a somewhat unverifiable prediction. The largest tasks inside the METR are on the order of 16 hours. I’d argue that the complexity of benchmarking translates to the complexity of improving the models themselves.
I don’t understand this. What exactly do you want us to address? Why should we adjust our predictions because of this? We do explicitly say we are assuming a hypothetical “METR-HRS-Extended” benchmark in our explanation. Ok maybe you are saying that it will be hard to create long-horizon tasks which will slow down the trend. I would say that I adjust for this when my make my all-things-considered AC prediction longer due to the potential for data bottlenecks, and also to some extent by making the doubling difficulty growth factor higher than it otherwise would be.
All that said, I confess the straight lines on a chart are immensely persuasive and hard to not extrapolate for many years through the Lindy Effect.
Yeah for the parts I didn’t explicitly respond to, my response is mainly that it seems like this sort of inside view reasoning is valuable but overall I give more weight to trend extrapolation, and historically simple trend extrapolations like “when will we have the same ops/second as the human brain” have performed pretty well, as we discuss in our blog post.
Thanks for the comments! Besides the below, I’m curious what your overall views are. What does your distribution for AC look like?
I think this is basically addressed in our uncertainty over the present doubling time, at least that’s how I’d think of it for myself. Note that my median present doubling time estimate of 5.5 months is slower than the potentially accelerated recent time horizon trend.
Our model reflects that, with my median parameters the current software R&D upflit is 1.1x.
Our Automated Coder has efficiency definitions built in, so you wouldn’t put it that way, you’d instead say you get an Automated Coder in 2035 and a very expensive replication of AC abilities in 2031. I personally think that a large majority of the relevant recent gains have not come from inference scaling, but if I did think that a lot of it had been, I would adjust my present doubling time to be slower.
Here’s some portions of a rough Slack message I wrote recently on this topic:
[end Slack message]
Furthermore, it seems that once capabilities can be reached very expensively, they pretty reliably get cheaper very quickly. See here for my research into this or just skip to Epoch’s data which I used as input to my parameter esitmate; happy to answer questions, sorry that my explanation is pretty rough.
I think I already addressed at least part of this in my answer to (2).
I don’t understand this. What exactly do you want us to address? Why should we adjust our predictions because of this? We do explicitly say we are assuming a hypothetical “METR-HRS-Extended” benchmark in our explanation. Ok maybe you are saying that it will be hard to create long-horizon tasks which will slow down the trend. I would say that I adjust for this when my make my all-things-considered AC prediction longer due to the potential for data bottlenecks, and also to some extent by making the doubling difficulty growth factor higher than it otherwise would be.
Yeah for the parts I didn’t explicitly respond to, my response is mainly that it seems like this sort of inside view reasoning is valuable but overall I give more weight to trend extrapolation, and historically simple trend extrapolations like “when will we have the same ops/second as the human brain” have performed pretty well, as we discuss in our blog post.