Correct me if I’m mistaken, but at this point it’s misleading to think of the frontier LLMs as “text predictors with some post-training”, and more accurate to think of them as “RL models that were initialized with a text predictor model”.
As I understand it, there’s now a massive amount of RLAIF to go along with expensive RLHF; some of the RL is persona training, some of it is technical training in fields where reliable feedback can be automated (e.g. is the output a valid program that passes the following tests).
Starting off with a text predictor is key, because that makes the LLM represent a lot of useful concepts; but the RL phase is doing an increasing amount of lifting. In particular, that means there’s no reason to expect coding or math to cap out at “imitating the best humans”, for the same reason that self-play helped AlphaGo to supersede the best humans.
Checking here first before I start injecting “text predictors are only the larval stage of modern LLMs” into the discourse.
While there are various issues with it, one anchor for comparing the “degree to which LLMs are shaped by RL vs pretraining” is “how many distinct ‘tasks’ was the LLM given to complete under each?”.
In pretraining, each forward pass corresponds to one evaluatable and distinct ‘reward’-event. In RL you need many forward passes (my guess is usually on the order of ~1000 for common tasks in the RL training set) to get one such event. So naively, in order to get the same amount of mind-shaping between RL and pretraining, you would have needed to reach the stage where 99.9% of your training is RL, not just >50%.
I think for various reasons this does overestimate how high the ratio would need to be, but I do think it suggests pretraining will play a larger role than naive compute comparisons would suggest in the resulting minds of the LLMs.
Ah, Claude helped me remember the historical parallel that serves as an intuition pump: in the early days of the deep learning revolution, Hinton and Bengio found it extremely useful to do unsupervised learning on a network first, before doing supervised learning. The post-unsupervised-learning network ended up in the basin of better local optima because it already represented key concepts.
Analogously, I expect that initializing a RL algorithm with a good predictive network makes it massively better and more efficient.
One bit of evidence here (and this is prior to the RL stage) is that you need a lot more compute to train the base model than you need for the fine-tuning step. Summoning a rich set of concepts from the ether takes the vast majority of the effort, compared to highlighting the important ones.
Before LLMs, RL had very unimpressive results in rich domains (because random flailing wouldn’t get you a meaningful amount of learning) and people kept talking about “model-based RL” but their handmade world-model architectures just didn’t work.
I’m arguing that the reason for this is that the vast majority of the effort needed for RL in a rich domain comes from assembling relevant concepts, and that shaping behavior once you have those concepts is a lot more efficient. (And hand-made world models just didn’t include enough important concepts.)
Humans also have massively more unsupervised learning than RL learning, for similar reasons: unsupervised learning data is extremely cheap and predictive processing is always on; you get MB/s for initial vision, I’d guesstimate kB/s for the highest level compressed abstractions from senses as input to consciousness (“scene graph” level while seeing moving objects, “parsed audio” level, etc), conscious decision making has been estimated to be on order 10b/s (“The Unbearable Slowness of Being: Why do we live at 10 bits/s?”), but you only get maybe a 3 bits per second of reward model feedback (dopamine is slower and usually doesn’t have something to say about every action), and bits per minute or hour for overall task success (the underlying thing dopamine is the predictor for). And yet humans end up extremely competent at advanced disciplines. Presumably unsupervised modeling of experience data generated by the agency is doing most of the work to get from microseconds to seconds, and the reward model closes the remaining gap from seconds to hours.
Relatedly, I don’t buy the recent claims that continual learning is not a big deal. It might not be enough to massively exceed human level, but it seems likely that it will be qualitatively stronger than in-context learning, because it can actually move concepts around, saving superposition bandwidth in the residual stream for actually-dynamic things.
In pretraining, each forward pass corresponds to one evaluatable and distinct ‘reward’-event.
In pretraining, you get one loss signal for each token in the forward pass; a single batch typically contains 10-100M tokens. For RL, you get a few bits of reward for each trajectory, which consists of many forward passes. So the efficiency difference is even larger than you outline here.
for RL, the loss signal is spread across all tokens in a trajectory by either the reward model or just the policy gradient. Either way, there’s still a gradient passing into all the output tokens. That gradient contains less shannon information, but might not contain as much less V-information as you’d think.
And yet, current LLMs have noticeably different personas from each other, as well as coding skills that significantly outstrip what you would expect from imitation of the corpus. So their post-training has a large impact.
I’m hesitant to argue about this outside the context of a specific question (i.e., in the context of what question are we thinking of LLMs as “text predictors with some post-training” or not?)…
…But for what it’s worth, some papers that I interpret as generally downplaying the role and irreplaceability of RLVR are: Karan & Du 2025, Venhoff et al. 2025, Yue et al. 2025. (Note that they’re not studying the latest and greatest frontier models, not sure how much to worry about that.)
There’s also the point about information efficiency per FLOP, cf. Toby Ord and Dwarkesh.
Another suggestive piece of evidence is that the RLVR chains-of-thought can be pretty weird but still very obviously strongly influenced by pretraining. We’re still a LONG way away from seeing a chain-of-thought like “…5Bn✅%SjYEℐkIo➅khPi▽Te☔PWBl^IO1⅗FIw…”. (Cf. the Karpathy quote: “You know you did RL right when the models stop thinking in English”.)
While I generally agree with you, I’m getting more worried that the caveat of “they’re not studying the latest and greatest frontier models” is particularly applicable here due to a Liu et al paper (2025) which does show that in some cases, RLVR can create capabilities out of whole cloth.
So while I do think 2025-era frontier models aren’t influenced much by RLVR, I do expect 2026 and especially 2027-era LLMs to be influenced by RLVR much more relative to today, on both capabilities and alignment.
I think I agree with your statement once a significant amount of capabilities is learned in RL.
I’m confused about how much current models have learned via RL.
The persona selection model argues that post-training mostly selects an existing persona that was learned in pre-training (though maybe this is mostly related to character, and somewhat orthogonal to capabilities learned by post-training RL)
Venhoff et al. seems to suggest that reasoning training only affects somewhat specific parts of the model (though maybe those parts are just super important)
Correct me if I’m mistaken, but at this point it’s misleading to think of the frontier LLMs as “text predictors with some post-training”, and more accurate to think of them as “RL models that were initialized with a text predictor model”.
As I understand it, there’s now a massive amount of RLAIF to go along with expensive RLHF; some of the RL is persona training, some of it is technical training in fields where reliable feedback can be automated (e.g. is the output a valid program that passes the following tests).
Starting off with a text predictor is key, because that makes the LLM represent a lot of useful concepts; but the RL phase is doing an increasing amount of lifting. In particular, that means there’s no reason to expect coding or math to cap out at “imitating the best humans”, for the same reason that self-play helped AlphaGo to supersede the best humans.
Checking here first before I start injecting “text predictors are only the larval stage of modern LLMs” into the discourse.
While there are various issues with it, one anchor for comparing the “degree to which LLMs are shaped by RL vs pretraining” is “how many distinct ‘tasks’ was the LLM given to complete under each?”.
In pretraining, each forward pass corresponds to one evaluatable and distinct ‘reward’-event. In RL you need many forward passes (my guess is usually on the order of ~1000 for common tasks in the RL training set) to get one such event. So naively, in order to get the same amount of mind-shaping between RL and pretraining, you would have needed to reach the stage where 99.9% of your training is RL, not just >50%.
I think for various reasons this does overestimate how high the ratio would need to be, but I do think it suggests pretraining will play a larger role than naive compute comparisons would suggest in the resulting minds of the LLMs.
Ah, Claude helped me remember the historical parallel that serves as an intuition pump: in the early days of the deep learning revolution, Hinton and Bengio found it extremely useful to do unsupervised learning on a network first, before doing supervised learning. The post-unsupervised-learning network ended up in the basin of better local optima because it already represented key concepts.
Analogously, I expect that initializing a RL algorithm with a good predictive network makes it massively better and more efficient.
One bit of evidence here (and this is prior to the RL stage) is that you need a lot more compute to train the base model than you need for the fine-tuning step. Summoning a rich set of concepts from the ether takes the vast majority of the effort, compared to highlighting the important ones.
Before LLMs, RL had very unimpressive results in rich domains (because random flailing wouldn’t get you a meaningful amount of learning) and people kept talking about “model-based RL” but their handmade world-model architectures just didn’t work.
I’m arguing that the reason for this is that the vast majority of the effort needed for RL in a rich domain comes from assembling relevant concepts, and that shaping behavior once you have those concepts is a lot more efficient. (And hand-made world models just didn’t include enough important concepts.)
Humans also have massively more unsupervised learning than RL learning, for similar reasons: unsupervised learning data is extremely cheap and predictive processing is always on; you get MB/s for initial vision, I’d guesstimate kB/s for the highest level compressed abstractions from senses as input to consciousness (“scene graph” level while seeing moving objects, “parsed audio” level, etc), conscious decision making has been estimated to be on order 10b/s (“The Unbearable Slowness of Being: Why do we live at 10 bits/s?”), but you only get maybe a 3 bits per second of reward model feedback (dopamine is slower and usually doesn’t have something to say about every action), and bits per minute or hour for overall task success (the underlying thing dopamine is the predictor for). And yet humans end up extremely competent at advanced disciplines. Presumably unsupervised modeling of experience data generated by the agency is doing most of the work to get from microseconds to seconds, and the reward model closes the remaining gap from seconds to hours.
Relatedly, I don’t buy the recent claims that continual learning is not a big deal. It might not be enough to massively exceed human level, but it seems likely that it will be qualitatively stronger than in-context learning, because it can actually move concepts around, saving superposition bandwidth in the residual stream for actually-dynamic things.
In pretraining, you get one loss signal for each token in the forward pass; a single batch typically contains 10-100M tokens. For RL, you get a few bits of reward for each trajectory, which consists of many forward passes. So the efficiency difference is even larger than you outline here.
for RL, the loss signal is spread across all tokens in a trajectory by either the reward model or just the policy gradient. Either way, there’s still a gradient passing into all the output tokens. That gradient contains less shannon information, but might not contain as much less V-information as you’d think.
And yet, current LLMs have noticeably different personas from each other, as well as coding skills that significantly outstrip what you would expect from imitation of the corpus. So their post-training has a large impact.
The pre-training forms the foundation (LeCun: “Self-supervised learning: The dark matter of intelligence”, tailcalled: “At its most basic, unsupervised prediction forms a good foundation for later specializing the map to perform specific types of prediction”) which gives the model common sense and general abilities, while reinforcement learning adds something like goal orientation on top.
I’m hesitant to argue about this outside the context of a specific question (i.e., in the context of what question are we thinking of LLMs as “text predictors with some post-training” or not?)…
…But for what it’s worth, some papers that I interpret as generally downplaying the role and irreplaceability of RLVR are: Karan & Du 2025, Venhoff et al. 2025, Yue et al. 2025. (Note that they’re not studying the latest and greatest frontier models, not sure how much to worry about that.)
There’s also the point about information efficiency per FLOP, cf. Toby Ord and Dwarkesh.
Another suggestive piece of evidence is that the RLVR chains-of-thought can be pretty weird but still very obviously strongly influenced by pretraining. We’re still a LONG way away from seeing a chain-of-thought like “…5Bn✅%SjYEℐkIo➅khPi▽Te☔PWBl^IO1⅗FIw…”. (Cf. the Karpathy quote: “You know you did RL right when the models stop thinking in English”.)
While I generally agree with you, I’m getting more worried that the caveat of “they’re not studying the latest and greatest frontier models” is particularly applicable here due to a Liu et al paper (2025) which does show that in some cases, RLVR can create capabilities out of whole cloth.
So while I do think 2025-era frontier models aren’t influenced much by RLVR, I do expect 2026 and especially 2027-era LLMs to be influenced by RLVR much more relative to today, on both capabilities and alignment.
I think I agree with your statement once a significant amount of capabilities is learned in RL.
I’m confused about how much current models have learned via RL.
The persona selection model argues that post-training mostly selects an existing persona that was learned in pre-training (though maybe this is mostly related to character, and somewhat orthogonal to capabilities learned by post-training RL)
Venhoff et al. seems to suggest that reasoning training only affects somewhat specific parts of the model (though maybe those parts are just super important)