We use reasoning models with more inference time compute to generate better data to train better base models to more efficiently reproduce similar capability levels with less compute to build better reasoning models.
This kind of thing isn’t known to meaningfully work, as something that can potentially be done on pretraining scale. It also doesn’t seem plausible without additional breakthroughs given the nature and size of verifiable task datasets, with things like o3-mini getting ~matched on benchmarks by post-training on datasets containing 15K-120K problems. All the straight lines for reasoning models so far are only about scaling a little bit, using scarce resources that might run out (verifiable problems that help) and untried-at-more-scale algorithms that might break down (in a way that’s hard to fix). So the known benefit is still plausible to remain a one-time improvement, extending it significantly (into becoming a new direction of scaling) hasn’t been demonstrated.
I think even remaining as a one-time improvement, long reasoning training might still be sufficient to get AI takeoff within a few years just from pretraining scaling of the underlying base models, but that’s not the same as already believing that RL post-training actually scales very far by itself. Most plausibly it does scale with more reasoning tokens in a trace, getting from the current ~50K to ~1M, but that’s separate from scaling with RL training all the way to pretraining scale (and possibly further).
This kind of thing isn’t known to meaningfully work, as something that can potentially be done on pretraining scale. It also doesn’t seem plausible without additional breakthroughs given the nature and size of verifiable task datasets, with things like o3-mini getting ~matched on benchmarks by post-training on datasets containing 15K-120K problems. All the straight lines for reasoning models so far are only about scaling a little bit, using scarce resources that might run out (verifiable problems that help) and untried-at-more-scale algorithms that might break down (in a way that’s hard to fix). So the known benefit is still plausible to remain a one-time improvement, extending it significantly (into becoming a new direction of scaling) hasn’t been demonstrated.
I think even remaining as a one-time improvement, long reasoning training might still be sufficient to get AI takeoff within a few years just from pretraining scaling of the underlying base models, but that’s not the same as already believing that RL post-training actually scales very far by itself. Most plausibly it does scale with more reasoning tokens in a trace, getting from the current ~50K to ~1M, but that’s separate from scaling with RL training all the way to pretraining scale (and possibly further).