My hot take, thinking step by step, expecting to be wrong about things & hoping to be corrected:
What you basically doing is looking at the part of the s-curve prior to plateauing (the exponential growth part) and noticing that, in that regime, scaling up inference compute buys you more performance than scaling up training compute.
However, afaict, scaling up training compute lets you push the plateau part of the inference scaling curve out/higher. GPT5 pumped up with loads of inference compute is significantly better than GPT4 pumped up with loads of inference compute. Not just a little better. They aren’t asymptoting to the same level.
So I think you are missing the important reason to do RL training. Obviously you shouldn’t do RL training for a use-case that you can already achieve by just spending more inference compute with existing models! (Well, I mean, you still should, depending on how much you are spending on inference. The economics still works out depending on the details.) But the point of RL training is to unlock new levels of capability that you simply couldn’t get by massively scaling up inference on current models.
Now, all that being said, I don’t think it’s actually super clear how much unlock you get. If the answer is “not much” then yeah RL scaling is doomed for exactly the reasons you mention. But there seems to have at the very least been a zero to one effect, where a little bit of RL scaling resulted in an increase in the level at which the inference scaling curve plateaus. Right?
Like, you say:
So the evidence on RL-scaling and inference-scaling supports a general pattern:
a 10x scaling of RL is required to get the same performance boost as a 3x scaling of inference
a 10,000x scaling of RL is required to get the same performance boost as a 100x scaling of inference
Grok 4 probably had something like 10,000x or more RL compared to the pure pretrained version of Grok 4. So would you predict therefore that xAI could take the pure pretrained version of Grok 4, pump it up with 100x inference compute (so, let it run 100x longer for example, or 10x longer and 10x in parallel) and get the same performance? (Or I’d run the same argument with the chatbot-finetuned version of Grok 4 as well. The point is, there was some earlier version that had 10,000x less RL.)
I agree that separately from its direct boost to performance at the same inference-compute, RL training also helps enable more inference scaling. I talk about that above when I say “this RL also unlocked the ability to productively use much longer chains of thought (~30x longer in this example). And these longer chains of thought contributed a much larger boost.”
A key thing I’m trying to get across is that I think this is where most of the benefit from RL is coming from. i.e. that while you pay the RL scaling costs at training time, you also need to pay the inference scaling costs at deployment time in order to get the benefit. Thus, RL is not an alternative to a paradigm where we cost-per-use is going to 10x, 100x, 1000x etc in order to keep seeing benefits.
Many people have said the opposite. The reason I looked into this is that people such as Dan Hendrycks and Josh You pushed back on my earlier statements that most of the gain has been in enabling longer CoT and the scaling of inference, saying respectively that it makes models much better at the same token-budget and that we’re witnessing a one-off scale up of token budget but further gains will come from RL scaling. I think I’ve delivered clear evidence against those takes.
You’d probably enjoy my post on Inference scaling reshapes AI governance and the whole series on my website. I think they paint a picture where compute scaling is becoming a smaller tailwind for AI progress and where it is changing in character.
That’s reasonable, but it seems to be different from what these quotes imply:
So while we may see another jump in reasoning ability beyond GPT-5 by scaling RL training a further 10x, I think that is the end of the line for cheap RL-scaling.
… Now that RL-training is nearing its effective limit, we may have lost the ability to effectively turn more compute into more intelligence.
There are a bunch of quotes like the above that make it sound like you are predicting progress will slow down in a few years. But instead you are saying that progress will continue, and AIs will become capable of doing more and more impressive tasks thanks to RL scaling, but they’ll require longer and longer CoTs to do those more and more impressive tasks? That’s very reasonable and less spicy / contrarian, I think most people would already agree with that.
I like your post on inference scaling reshaping AI governance. I think I agree with all the conclusions on the margin, but think that the magnitude of the effect will be small in every case and thus not change the basic strategic situation.
My own cached thought, based on an analysis I did in ’22, is that even though inference costs will increase they’ll continue to be lower than the cost of hiring a human to do the task. I suppose I should revisit those estimates...
I do think that progress will slow down, though its not my main claim. My main claim is to do with the tailwind of compute scaling will become weaker (unless some new scaling paradigm appears or a breakthrough saves this one). That is a piece in the puzzle of whether overall AI progress will accelerate or decelerate and I’d ideally let people form their own judgments about the other pieces (e.g. whether recursive self improvement will work, or whether funding will collapse in a market correction, taking away another tailwind of progress). But having a major boost to AI progress (compute scaling) become less of a boost is definitely some kind of an update towards lower AI progress than you were otherwise expecting.
Part of the issue with inference scaling as the main surviving form of scaling depends on how many more OOMs are needed. If it is 100x, there isn’t so much impact. If we need to 1,000x or 1,000,000x it from here, it is more of an issue.
In that prior piece I talked about inference-scaling as a flow of costs, but it also scales with things beyond time:
costs grow in proportion to time (can’t make up the costs by longer use before the new model)
costs grow in proportion to number of users (can’t make up the costs through market expansion)
costs grow in proportion to the amount of use by each user (can’t make up costs through intensity of use)
This is a big deal. If you want to 100x the price of inference going into each query, how can you make that up and still be profitable? I think you need to 100x the willingness-to-pay from each user for each query. That is very hard. My guess is that the WTP doesn’t scale with inference compute in this way, and thus that inference can only be 10x-ed when algorithmic efficiency gains and falling chip costs have divided the cost per token by 10. So I think that while previous rounds of training compute scaling could pay for themselves in the marketplace, I think that will stop for most users soon, and for specialist users a bit later.
The idea here is that the changing character of scaling affects the business model, making it so that it is no longer self-propelling to keep scaling, and that this will mean the compute scaling basically stops.
PS Thanks for pointing out that second quote “Now that RL-training…” — I think that does come across a bit stronger than I intended.
“inference scaling as the main surviving form of scaling ” --> But it isn’t though, RL is still a very important form of scaling. Yes, it’ll become harder to scale up RL in the near future (recently they could just allocate more of their existing compute budget to RL, but soon they’ll need to grow their compute budget) so there’ll be a slowdown from that effect, but it seems to me that the next three OOMs of RL scaling will bring at least as much benefit as the previous three OOMs of RL scaling, which was substantial as you say (largely because it ‘unlocked’ more inference compute scaling. The next 3 OOMs of RL scaling will ‘unlock’ even more.)
Re: Willingness to pay going up: Yes, that’s what I expect. I don’t think it’s hard at all. If you do a bunch of RL scaling that ‘unlocks’ more inference scaling—e.g. by extending METR-measured horizon length—then boom, now your models can do significantly longer, more complex tasks than before. Those tasks are significantly more valuable and people will be willing to pay significantly more for them.
My hot take, thinking step by step, expecting to be wrong about things & hoping to be corrected:
What you basically doing is looking at the part of the s-curve prior to plateauing (the exponential growth part) and noticing that, in that regime, scaling up inference compute buys you more performance than scaling up training compute.
However, afaict, scaling up training compute lets you push the plateau part of the inference scaling curve out/higher. GPT5 pumped up with loads of inference compute is significantly better than GPT4 pumped up with loads of inference compute. Not just a little better. They aren’t asymptoting to the same level.
So I think you are missing the important reason to do RL training. Obviously you shouldn’t do RL training for a use-case that you can already achieve by just spending more inference compute with existing models! (Well, I mean, you still should, depending on how much you are spending on inference. The economics still works out depending on the details.) But the point of RL training is to unlock new levels of capability that you simply couldn’t get by massively scaling up inference on current models.
Now, all that being said, I don’t think it’s actually super clear how much unlock you get. If the answer is “not much” then yeah RL scaling is doomed for exactly the reasons you mention. But there seems to have at the very least been a zero to one effect, where a little bit of RL scaling resulted in an increase in the level at which the inference scaling curve plateaus. Right?
Like, you say:
Grok 4 probably had something like 10,000x or more RL compared to the pure pretrained version of Grok 4. So would you predict therefore that xAI could take the pure pretrained version of Grok 4, pump it up with 100x inference compute (so, let it run 100x longer for example, or 10x longer and 10x in parallel) and get the same performance? (Or I’d run the same argument with the chatbot-finetuned version of Grok 4 as well. The point is, there was some earlier version that had 10,000x less RL.)
I agree that separately from its direct boost to performance at the same inference-compute, RL training also helps enable more inference scaling. I talk about that above when I say “this RL also unlocked the ability to productively use much longer chains of thought (~30x longer in this example). And these longer chains of thought contributed a much larger boost.”
A key thing I’m trying to get across is that I think this is where most of the benefit from RL is coming from. i.e. that while you pay the RL scaling costs at training time, you also need to pay the inference scaling costs at deployment time in order to get the benefit. Thus, RL is not an alternative to a paradigm where we cost-per-use is going to 10x, 100x, 1000x etc in order to keep seeing benefits.
Many people have said the opposite. The reason I looked into this is that people such as Dan Hendrycks and Josh You pushed back on my earlier statements that most of the gain has been in enabling longer CoT and the scaling of inference, saying respectively that it makes models much better at the same token-budget and that we’re witnessing a one-off scale up of token budget but further gains will come from RL scaling. I think I’ve delivered clear evidence against those takes.
You’d probably enjoy my post on Inference scaling reshapes AI governance and the whole series on my website. I think they paint a picture where compute scaling is becoming a smaller tailwind for AI progress and where it is changing in character.
That’s reasonable, but it seems to be different from what these quotes imply:
There are a bunch of quotes like the above that make it sound like you are predicting progress will slow down in a few years. But instead you are saying that progress will continue, and AIs will become capable of doing more and more impressive tasks thanks to RL scaling, but they’ll require longer and longer CoTs to do those more and more impressive tasks? That’s very reasonable and less spicy / contrarian, I think most people would already agree with that.
I like your post on inference scaling reshaping AI governance. I think I agree with all the conclusions on the margin, but think that the magnitude of the effect will be small in every case and thus not change the basic strategic situation.
My own cached thought, based on an analysis I did in ’22, is that even though inference costs will increase they’ll continue to be lower than the cost of hiring a human to do the task. I suppose I should revisit those estimates...
I do think that progress will slow down, though its not my main claim. My main claim is to do with the tailwind of compute scaling will become weaker (unless some new scaling paradigm appears or a breakthrough saves this one). That is a piece in the puzzle of whether overall AI progress will accelerate or decelerate and I’d ideally let people form their own judgments about the other pieces (e.g. whether recursive self improvement will work, or whether funding will collapse in a market correction, taking away another tailwind of progress). But having a major boost to AI progress (compute scaling) become less of a boost is definitely some kind of an update towards lower AI progress than you were otherwise expecting.
Part of the issue with inference scaling as the main surviving form of scaling depends on how many more OOMs are needed. If it is 100x, there isn’t so much impact. If we need to 1,000x or 1,000,000x it from here, it is more of an issue.
In that prior piece I talked about inference-scaling as a flow of costs, but it also scales with things beyond time:
costs grow in proportion to time (can’t make up the costs by longer use before the new model)
costs grow in proportion to number of users (can’t make up the costs through market expansion)
costs grow in proportion to the amount of use by each user (can’t make up costs through intensity of use)
This is a big deal. If you want to 100x the price of inference going into each query, how can you make that up and still be profitable? I think you need to 100x the willingness-to-pay from each user for each query. That is very hard. My guess is that the WTP doesn’t scale with inference compute in this way, and thus that inference can only be 10x-ed when algorithmic efficiency gains and falling chip costs have divided the cost per token by 10. So I think that while previous rounds of training compute scaling could pay for themselves in the marketplace, I think that will stop for most users soon, and for specialist users a bit later.
The idea here is that the changing character of scaling affects the business model, making it so that it is no longer self-propelling to keep scaling, and that this will mean the compute scaling basically stops.
PS
Thanks for pointing out that second quote “Now that RL-training…” — I think that does come across a bit stronger than I intended.
“inference scaling as the main surviving form of scaling ” --> But it isn’t though, RL is still a very important form of scaling. Yes, it’ll become harder to scale up RL in the near future (recently they could just allocate more of their existing compute budget to RL, but soon they’ll need to grow their compute budget) so there’ll be a slowdown from that effect, but it seems to me that the next three OOMs of RL scaling will bring at least as much benefit as the previous three OOMs of RL scaling, which was substantial as you say (largely because it ‘unlocked’ more inference compute scaling. The next 3 OOMs of RL scaling will ‘unlock’ even more.)
Re: Willingness to pay going up: Yes, that’s what I expect. I don’t think it’s hard at all. If you do a bunch of RL scaling that ‘unlocks’ more inference scaling—e.g. by extending METR-measured horizon length—then boom, now your models can do significantly longer, more complex tasks than before. Those tasks are significantly more valuable and people will be willing to pay significantly more for them.