It will not meaningfully generalize beyond domains with easy verification.
I think most of software engineering and mathematics problems (two key components of AI development) are easy to verify. I agree with some of your point of how long term agency doesn’t seem to be improving, but I expect that we can build very competent software engineers with the current paradigms.
After this, I expect AI progress to move noticeably faster. The problems you point out are real, but speeding up our development speed might make them surmountable in the near term.
I think most of software engineering and mathematics problems (two key components of AI development) are easy to verify
I disagree.
In math, what’s easy to verify is whether A is tautologous to B. That is: whether two externally-provided fully-formally-defined statements are equivalent.
“This conjecture is useful” or “this subproblem is worth focusing on” or “this framework sheds light on the fundamental structure of reality” are not easy-to-verify. So without transfer learning, the LLMs won’t generate useful conjectures/paradigms on their own.
Nor be able to figure out the nearest-useful conjecture, or solve open-ended problems like “figure out the best computationally tractable algorithm approximating algorithm X”.
They’d be good for faster conjecture evaluation once you already have the conjecture fully formalized, but that’s about it. This would still be big if they can tackle e. g. the Millennium Problems or problems that take mathematical fields years… But I’m skeptical it’d scale this well.
In programming, the situation is similar: the problem is well-specified if it can be translated into a fully formal math problem (see the Curry-Howard correspondence). “Generate code that passes these unit tests”, “translate this algorithm from this language to that one”, “implement the formal algorithm from this paper”, etc.
“Generate the program that Does What I Mean”, with the spec provided using natural language, isn’t an easily verifiable problem. You need to model the user, and also have a solid grasp of the real world in which the program will be embedded, in order to predict all possible edge cases and failure modes that’d make what happens diverge from the user’s model. The program needs to behave as intended, not as formally specified.
Moreover, “passes unit tests” is a pretty gameable metric; see this paper. If the program that passes them isn’t easy to find, LLMs start eagerly Goodharting to these tests...
And this is especially bad if the intent is to get autonomous research assistants who’d be able to independently generate vast codebases and evaluate experimental hypotheses. You wouldn’t be able to trust their purported results at all, in either direction. Got a positive result? Maybe the AI didn’t want to disappoint you and somehow rigged the test to succeed (and hid it very well, so you wouldn’t find it and get upset). Got a negative result? Maybe the AI screwed something obvious up/totally misunderstood you.
(@Daniel Kokotajlo, what’s your opinion on this? This indirectly ties-in with my skepticism regarding agenty behavior scaling. It’s not just something LLMs have been bad at, it’s also something that’s very nontrivial to correctly reinforce. In some way, this is just another manifestation of the core alignment challenges.)
My opinion is that long-horizon training will indirectly teach a bunch of hard-to-measure skills. Like, if you have a list of simple lemmas and you train your AI to prove them, it probably won’t develop a good sense of what kinds of lemmas are interesting or useful (except insofar as your list was already curated in that way + it correctly notices patterns). But if you train your AI to prove crazy-difficult theorems that take many many months of effort and exploration to achieve, then in the course of learning to do that effort and exploration well, it’ll have to learn a sense of what kinds of lemmas are useful for what other kinds and so forth.
Sure, that’s a plausible hypothesis… But I think there’s a catch-22 here.
RL-on-CoTs is only computationally tractable if the correct trajectories are already close to the “modal” trajectory. Otherwise, you get the same exponential explosion the DeepSeek team got with trying out MCTS.
So how do you train your models to solve the crazy-difficult theorems to begin with, if they start out so bad at subproblem-picking that they assign near-zero probability to all the successful reasoning trajectories? My intuition is that you just run into the exponential explosion again, needing to sample a million million-token trajectories to get one good answer, and so the training just doesn’t get off the ground.
RL-on-CoTs is only computationally tractable if the correct trajectories are already close to the “modal” trajectory.
Conclusions that should be impossible to see for a model at a given level of capability are still not far from the surface, as language monkeys paper shows (Figure 3, see how well even Pythia-70M with an ‘M’ starts doing on MATH at pass@10K). So a collection of progressively more difficult verifiable questions can probably stretch whatever wisdom a model implicitly holds from pretraining implausibly far.
My guess is that they’ll incrementally expand horizon lengths over the next few years. Like, every six months, they’ll double the size of their dataset of problems and double the average ‘length’ of the problems in the dataset. So the models will have a curriculum of sorts, that starts with the easier shorter-horizon things and then scales up.
Prediction: This will somehow not work. Probably they’d just be unable to handle it past a given level of “inferential horizon”.
Reasoning: If this were to work, this would necessarily involve the SGD somehow solving the “inability to deal with a lot of interconnected layered complexity in the context window” problem. On my current model, this problem is fundamental to how LLMs work, due to their internal representation of any given problem being the overlap of a bunch of learned templates (crystallized intelligence), rather than a “compacted” first-principles mental object (fluid intelligence). For the SGD to teach them to handle this, the sampled trajectories would need to involve the LLM somehow stumbling onto completely foreign internal representations. I expect these trajectories are either sitting at ~0 probability of being picked, or don’t exist at all.
A curriculum won’t solve it, either, because the type signature of the current paradigm of RL-on-CoTs training is “eliciting already-present latent capabilities” (see the LIMO paper), not “rewriting their fundamental functionality”.
I think most of software engineering and mathematics problems (two key components of AI development) are easy to verify. I agree with some of your point of how long term agency doesn’t seem to be improving, but I expect that we can build very competent software engineers with the current paradigms.
After this, I expect AI progress to move noticeably faster. The problems you point out are real, but speeding up our development speed might make them surmountable in the near term.
I disagree.
In math, what’s easy to verify is whether A is tautologous to B. That is: whether two externally-provided fully-formally-defined statements are equivalent.
“This conjecture is useful” or “this subproblem is worth focusing on” or “this framework sheds light on the fundamental structure of reality” are not easy-to-verify. So without transfer learning, the LLMs won’t generate useful conjectures/paradigms on their own.
Nor be able to figure out the nearest-useful conjecture, or solve open-ended problems like “figure out the best computationally tractable algorithm approximating algorithm X”.
They’d be good for faster conjecture evaluation once you already have the conjecture fully formalized, but that’s about it. This would still be big if they can tackle e. g. the Millennium Problems or problems that take mathematical fields years… But I’m skeptical it’d scale this well.
In programming, the situation is similar: the problem is well-specified if it can be translated into a fully formal math problem (see the Curry-Howard correspondence). “Generate code that passes these unit tests”, “translate this algorithm from this language to that one”, “implement the formal algorithm from this paper”, etc.
“Generate the program that Does What I Mean”, with the spec provided using natural language, isn’t an easily verifiable problem. You need to model the user, and also have a solid grasp of the real world in which the program will be embedded, in order to predict all possible edge cases and failure modes that’d make what happens diverge from the user’s model. The program needs to behave as intended, not as formally specified.
Moreover, “passes unit tests” is a pretty gameable metric; see this paper. If the program that passes them isn’t easy to find, LLMs start eagerly Goodharting to these tests...
… If not outright rewriting them with “assert!(true)”, as people have been complaining about Claude Code doing.
And this is especially bad if the intent is to get autonomous research assistants who’d be able to independently generate vast codebases and evaluate experimental hypotheses. You wouldn’t be able to trust their purported results at all, in either direction. Got a positive result? Maybe the AI didn’t want to disappoint you and somehow rigged the test to succeed (and hid it very well, so you wouldn’t find it and get upset). Got a negative result? Maybe the AI screwed something obvious up/totally misunderstood you.
(@Daniel Kokotajlo, what’s your opinion on this? This indirectly ties-in with my skepticism regarding agenty behavior scaling. It’s not just something LLMs have been bad at, it’s also something that’s very nontrivial to correctly reinforce. In some way, this is just another manifestation of the core alignment challenges.)
My opinion is that long-horizon training will indirectly teach a bunch of hard-to-measure skills. Like, if you have a list of simple lemmas and you train your AI to prove them, it probably won’t develop a good sense of what kinds of lemmas are interesting or useful (except insofar as your list was already curated in that way + it correctly notices patterns). But if you train your AI to prove crazy-difficult theorems that take many many months of effort and exploration to achieve, then in the course of learning to do that effort and exploration well, it’ll have to learn a sense of what kinds of lemmas are useful for what other kinds and so forth.
Sure, that’s a plausible hypothesis… But I think there’s a catch-22 here.
RL-on-CoTs is only computationally tractable if the correct trajectories are already close to the “modal” trajectory. Otherwise, you get the same exponential explosion the DeepSeek team got with trying out MCTS.
So how do you train your models to solve the crazy-difficult theorems to begin with, if they start out so bad at subproblem-picking that they assign near-zero probability to all the successful reasoning trajectories? My intuition is that you just run into the exponential explosion again, needing to sample a million million-token trajectories to get one good answer, and so the training just doesn’t get off the ground.
Conclusions that should be impossible to see for a model at a given level of capability are still not far from the surface, as language monkeys paper shows (Figure 3, see how well even Pythia-70M with an ‘M’ starts doing on MATH at pass@10K). So a collection of progressively more difficult verifiable questions can probably stretch whatever wisdom a model implicitly holds from pretraining implausibly far.
My guess is that they’ll incrementally expand horizon lengths over the next few years. Like, every six months, they’ll double the size of their dataset of problems and double the average ‘length’ of the problems in the dataset. So the models will have a curriculum of sorts, that starts with the easier shorter-horizon things and then scales up.
Prediction: This will somehow not work. Probably they’d just be unable to handle it past a given level of “inferential horizon”.
Reasoning: If this were to work, this would necessarily involve the SGD somehow solving the “inability to deal with a lot of interconnected layered complexity in the context window” problem. On my current model, this problem is fundamental to how LLMs work, due to their internal representation of any given problem being the overlap of a bunch of learned templates (crystallized intelligence), rather than a “compacted” first-principles mental object (fluid intelligence). For the SGD to teach them to handle this, the sampled trajectories would need to involve the LLM somehow stumbling onto completely foreign internal representations. I expect these trajectories are either sitting at ~0 probability of being picked, or don’t exist at all.
A curriculum won’t solve it, either, because the type signature of the current paradigm of RL-on-CoTs training is “eliciting already-present latent capabilities” (see the LIMO paper), not “rewriting their fundamental functionality”.