As I wrote in LLMs May Find It Hard to FOOM, sooner or later we’re going to need to use LLMs to generate larde amounts of new synthetic training data. We already know that doing that naively without using inference-time-scaling leads to model collapse. The interesting question is whether some kind of inference-time training approach can allow an LLM to think long and hard and generate large amounts of higher-quality training data that can be used to train a model that’s better than it is. That’s not theoretically impossible: science and mathematics are real things, truth can be discovered with enough work; but it’s also not trivial: you can’t simply have an LLM generate 1000T tokens, train on that, and get a better LLM.
Even if all RLVR does is raise the pass@1 towards the pass@100 of the base model, if that can trained model generate enough training data to train a new base model with a similar pass@1 (before applying RLVF to it), the pass@100 of that new model must be higher than its pass@1, and RLVR should be able to elicit that to an improved pass@1, so you’ve made forward progress. The question then becomes whether you can repeat this cycle and keep making progress, without it plateauing. At least in areas like Mathematics and Coding, where verifiable truth exists and can be discovered with enough effort, this seems plausible. AlphaGo certainly did (though I gather its superhuman skills also have some blind-spots, corresponding to tactics it apparently never thought of — suggesting that involving humans, or at least multiple different LLM models, in the training data generation cycle might be useful here to avoid comparable blindspots.) Doing the same in other STEM topics would require your AI to be interacting with the world while generating training data, running experiments and designing and testing products — but then, that’s rather the point of having a nation of researchers in a data-center
As I wrote in LLMs May Find It Hard to FOOM, sooner or later we’re going to need to use LLMs to generate larde amounts of new synthetic training data. We already know that doing that naively without using inference-time-scaling leads to model collapse. The interesting question is whether some kind of inference-time training approach can allow an LLM to think long and hard and generate large amounts of higher-quality training data that can be used to train a model that’s better than it is. That’s not theoretically impossible: science and mathematics are real things, truth can be discovered with enough work; but it’s also not trivial: you can’t simply have an LLM generate 1000T tokens, train on that, and get a better LLM.
Even if all RLVR does is raise the pass@1 towards the pass@100 of the base model, if that can trained model generate enough training data to train a new base model with a similar pass@1 (before applying RLVF to it), the pass@100 of that new model must be higher than its pass@1, and RLVR should be able to elicit that to an improved pass@1, so you’ve made forward progress. The question then becomes whether you can repeat this cycle and keep making progress, without it plateauing. At least in areas like Mathematics and Coding, where verifiable truth exists and can be discovered with enough effort, this seems plausible. AlphaGo certainly did (though I gather its superhuman skills also have some blind-spots, corresponding to tactics it apparently never thought of — suggesting that involving humans, or at least multiple different LLM models, in the training data generation cycle might be useful here to avoid comparable blindspots.) Doing the same in other STEM topics would require your AI to be interacting with the world while generating training data, running experiments and designing and testing products — but then, that’s rather the point of having a nation of researchers in a data-center