Pretraining memorizes all the facts in the world, but only gives weak fluid intelligence (in-context learning). RLVR trains crystallized intelligence that expresses itself in-context as strong fluid intelligence (which isn’t fake or illusory, within its scope), but this narrow strength falls apart sufficiently out of distribution (regressing to pretraining levels). Thus jaggedness (in fluid intelligence) is a good way of framing this, and currently the jaggedness profile is determined by RLVR training, the topics that were sufficiently covered in the RL training data. Humans are different in having a higher baseline of general fluid intelligence (than what pretraining gives LLMs), and thus in often possessing fluid intelligence that’s stronger than crystallized intelligence for the same topic, while LLMs always have strong crystallized intelligence for the topics where they have strong fluid intelligence (those covered by RLVR training).
General “smartness” might significantly improve from replacing pretraining with something more effective, applying RLVR much more broadly, or figuring out how to automatically train in response to post-deployment data (bringing it in-distribution for RLVR levels of in-context learning capability). None of these things are currently assured to be on track to happen quickly. The most straightforward step in this direction is automation of routine AI R&D that helps with applying RLVR during training to many more topics than is humanly feasible, expanding the areas covered by strong fluid intelligence. But even that plausibly runs into a wall of jaggedness in what kinds of things can be trained with RLVR, and of the absence of post-deployment training (at RLVR levels of capability).
So I think it remains plausible that “powerful AI” (that fully automates civilization) isn’t near. But also the scope of topics where AI is smart will increase in the next few years, and the rising tide of pretraining scale will improve the fallback level of smartness outside these topics (or for sufficiently novel constructions within these topics). It remains to be seen if this is sufficient to overcome the jaggedness of RLVR, either by sufficient competence at gluing narrow capabilities together (at pretraining levels of fluid intelligence), or by automating self-application of RLVR by LLMs to themselves that fixes gaps in RLVR-level capabilities for novel topics, as soon as they come up. Failing that, more breakthroughs may be required, which could take a hard-to-predict amount of time, making 10-20 years to “powerful AI” possible despite the current speed of progress.
Hard disagree. Pre-training is foundational for LLM intelligence. It’s where the key pieces of it come from.
RL is less about “teach the old AI new tricks” and more about “teach the old AI how to do the old tricks well”. Pre-training teaches the components of intelligence, but wires them together in awkward and often maladaptive ways. Optimal next token prediction != optimal reasoning, there’s a lot of overlap but the tails come apart hard. RL passes pivot the training objective—they correct some of the mismatch, put the existing pieces together into a shape that functions better.
What people struggle to grasp is that “intelligence” and “being able to execute on complex long term plans” are two fairly separate dimensions. And modern AIs get one of those things from their training regime.
There are parts of humanlike intelligence that pre-training fails to teach the LLMs, and long term agentic behavior is one of them. As are things like “commonsense physics” and spatial reasoning.
RL is one way to improve on that. RLVR, by its very nature, encourages behaviors that lead to task completion—decomposition, self-verification, long term coherence, metacognition and metaknowledge (awareness of what methods should be used, what skills are reliable and what aren’t). All of those are good for short term tasks, but good^2 for long term tasks specifically. Going from 95% to 99% per-step reliability hits different on 50 step tasks than it does on 5 step tasks. Some of those aspects apply across tasks and some are task-specific.
Pretraining memorizes all the facts in the world, but only gives weak fluid intelligence (in-context learning). RLVR trains crystallized intelligence that expresses itself in-context as strong fluid intelligence (which isn’t fake or illusory, within its scope), but this narrow strength falls apart sufficiently out of distribution (regressing to pretraining levels). Thus jaggedness (in fluid intelligence) is a good way of framing this, and currently the jaggedness profile is determined by RLVR training, the topics that were sufficiently covered in the RL training data. Humans are different in having a higher baseline of general fluid intelligence (than what pretraining gives LLMs), and thus in often possessing fluid intelligence that’s stronger than crystallized intelligence for the same topic, while LLMs always have strong crystallized intelligence for the topics where they have strong fluid intelligence (those covered by RLVR training).
General “smartness” might significantly improve from replacing pretraining with something more effective, applying RLVR much more broadly, or figuring out how to automatically train in response to post-deployment data (bringing it in-distribution for RLVR levels of in-context learning capability). None of these things are currently assured to be on track to happen quickly. The most straightforward step in this direction is automation of routine AI R&D that helps with applying RLVR during training to many more topics than is humanly feasible, expanding the areas covered by strong fluid intelligence. But even that plausibly runs into a wall of jaggedness in what kinds of things can be trained with RLVR, and of the absence of post-deployment training (at RLVR levels of capability).
So I think it remains plausible that “powerful AI” (that fully automates civilization) isn’t near. But also the scope of topics where AI is smart will increase in the next few years, and the rising tide of pretraining scale will improve the fallback level of smartness outside these topics (or for sufficiently novel constructions within these topics). It remains to be seen if this is sufficient to overcome the jaggedness of RLVR, either by sufficient competence at gluing narrow capabilities together (at pretraining levels of fluid intelligence), or by automating self-application of RLVR by LLMs to themselves that fixes gaps in RLVR-level capabilities for novel topics, as soon as they come up. Failing that, more breakthroughs may be required, which could take a hard-to-predict amount of time, making 10-20 years to “powerful AI” possible despite the current speed of progress.
Hard disagree. Pre-training is foundational for LLM intelligence. It’s where the key pieces of it come from.
RL is less about “teach the old AI new tricks” and more about “teach the old AI how to do the old tricks well”. Pre-training teaches the components of intelligence, but wires them together in awkward and often maladaptive ways. Optimal next token prediction != optimal reasoning, there’s a lot of overlap but the tails come apart hard. RL passes pivot the training objective—they correct some of the mismatch, put the existing pieces together into a shape that functions better.
What people struggle to grasp is that “intelligence” and “being able to execute on complex long term plans” are two fairly separate dimensions. And modern AIs get one of those things from their training regime.
There are parts of humanlike intelligence that pre-training fails to teach the LLMs, and long term agentic behavior is one of them. As are things like “commonsense physics” and spatial reasoning.
RL is one way to improve on that. RLVR, by its very nature, encourages behaviors that lead to task completion—decomposition, self-verification, long term coherence, metacognition and metaknowledge (awareness of what methods should be used, what skills are reliable and what aren’t). All of those are good for short term tasks, but good^2 for long term tasks specifically. Going from 95% to 99% per-step reliability hits different on 50 step tasks than it does on 5 step tasks. Some of those aspects apply across tasks and some are task-specific.