Huh, so you think o1 was the process supervision reward model, and o3 is the distilled policy model to whatever reward model o1 became? That seems to fit.
There may be a sense that they’ve ‘broken out’, and have finally crossed the last threshold of criticality, from merely cutting-edge AI work which everyone else will replicate in a few years, to takeoff
Surely other labs will also replicate this too? Even the open source community seems close. And Silicon Valley companies often poach staff, which makes it hard to keep a trade secret. Not to mention spies.
This means that outsiders may never see the intermediate models
Doubly so, if outsiders will just distil your models behaviour, and bootstrap from your elevated starting point.
Inference-time search is a stimulant drug that juices your score immediately, but asymptotes hard. Quickly, you have to use a smarter model to improve the search itself, instead of doing more.
It’s worth pointing out that Inference-time search seems to become harder as the verifier becomes less reliable. Which means that the scaling curves we see for math and code, might get much worse in other domains.
we find that this is extremely sensitive to the quality of the verifier. If the verifier is slightly imperfect, in many realistic settings of a coding task, performance maxes out and actually starts to decrease after about 10 attempts.”—Inference Scaling fLaws
But maybe the counterpoint is just, GPU’s go brrrr.
Huh, so you think o1 was the process supervision reward model, and o3 is the distilled policy model to whatever reward model o1 became? That seems to fit.
Something like that, yes. The devil is in the details here.
Surely other labs will also replicate this too? Even the open source community seems close. And Silicon Valley companies often poach staff, which makes it hard to keep a trade secret. Not to mention spies.
Of course. The secrets cannot be kept, and everyone has been claiming to have cloned o1 already. There are dozens of papers purporting to have explained it. (I think DeepSeek may be the only one to have actually done so, however; at least, I don’t recall offhand any of the others observing the signature backtracking ‘wait a minute’ interjections the way DeepSeek sees organically emerging in r1.)
But scaling was never a secret. You still have to do it. And MS has $80b going into AI datacenters this year; how much does open source (or DeepSeek) have?
It’s worth pointing out that Inference-time search seems to become harder as the verifier becomes less reliable. Which means that the scaling curves we see for math and code, might get much worse in other domains.
Yes. That’s why I felt skeptical about how generalizable the o1 approach is. It doesn’t look like a break-out to me. I don’t expect much far transfer: being really good at coding doesn’t automatically make you a genius at, say, booking plane tickets. (The o1 gains are certainly not universal, the way straightforward data/parameter-scaling gains tend to be—remember that some of the benchmarks actually got worse.) I also expect the o1 approach to tend to plateau: there is no ground truth oracle for most of these things, the way there is for Go. AlphaZero cannot reward-hack the Go simulator. Even for math, where your theorem prover can at least guarantee that a proof is valid, what’s the ground-truth oracle for ‘came up with a valuable new theorem, rather than arbitrary ugly tautological nonsense of no value’?
So that’s one of the big puzzles here for me: as interesting and impressive as o1/o3 is, I just don’t see how it justifies the apparent confidence. (Noam Brown has also commented that OA has a number of unpublished breakthroughs that would impress me if I knew, and of course, the money side seems to still be flowing without stint, despite it being much easier to cancel such investments than cause them.)
Is OA wrong, or do they know something I don’t? (For example, a distributional phase shift akin to meta-learning.) Or do they just think that these remaining issues are the sort of thing that AI-powered R&D can solve and so it is enough to just get really, really good at coding/math and they can delegate from there on out?
EDIT: Aidan McLaughlin has a good post back in November discussing the problems with RL and why you would not expect the o1 series to lead to AGI when scaled up in sensible ways, which I largely agree with, and says:
But, despite this impressive leap, remember that o1 uses RL, RL works best in domains with clear/frequent reward, and most domains lack clear/frequent reward.
Praying for Transfer Learning: OpenAI admits that they trained o1 on domains with easy verification but hope reasoners generalize to all domains...When I talked to OpenAI’s reasoning team about this, they agreed it was an issue, but claimed that more RL would fix it. But, as we’ve seen earlier, scaling RL on a fixed model size seems to eat away at other competencies! The cost of training o3 to think for a million tokens may be a model that only does math.
On the other hand… o3 didn’t only do math, and in RL we also know that RL systems often exhibit phase transitions in terms of meta-learning or generalization, where they overfit to narrow distributions and become superhuman experts which break if anything is even slightly different, but suddenly generalize when train on diverse enough data as a blessing of scale, not in data but data diversity, with LLMs being a major case in point of that, like GPT-2 → GPT-3. Hm. This was written 2024-11-20, and McLaughlin announced 2025-01-13 that he had joined OpenAI. Hm...
My personal view is that OA is probably wrong about how far the scaling curves generalize, with the caveat that even eating math and coding entirely ala AlphaZero would be still massive for AI progress, though compute constraints will bind eventually.
My own take is that the o1-approach will plateau in domains where verification is expensive, but thankfully most tasks of interest tend to be easier to verify than to solve, and lots of math/coding are basically ideally suited to verification, and I expect it to be way easier to make simulators that aren’t easy to reward hack for these domains.
what’s the ground-truth oracle for ‘came up with a valuable new theorem, rather than arbitrary ugly tautological nonsense of no value’?
Eh, those tautologies are both interesting on their own, combined with valuable training data so that it learns how to prove statements.
I think the unmodelled variable is that they think software-only type singularities to be more plausible, ala this:
Or do they just think that these remaining issues are the sort of thing that AI-powered R&D can solve and so it is enough to just get really, really good at coding/math and they can delegate from there on out?
My point there is that he was talking to the reasoning team pre-hiring (forget ‘onboarding’, who knows what that means), so they would be unable to tell him most things—including if they have a better reason than ‘faith in divine benevolence’ to think that ‘more RL does fix it’.
When I hear “distillation” I think of a model with a smaller number of parameters that’s dumber than the base model. It seems like the word “bootstrapping” is more relevant here. You start with a base LLM (like GPT-4); then do RL for reasoning, and then do a ton of inference (this gets you o1-level outputs); then you train a base model with more parameters than GPT-4 (let’s call this GPT-5) on those outputs — each single forward pass of the resulting base model is going to be smarter than a single forward pass of GPT-4. And then you do RL and more inference (this gets you o3). And rinse and repeat.
I don’t think I’m really saying anything different from what you said, but the word “distill” doesn’t seem to capture the idea that you are training a larger, smarter base model (as opposed to a smaller, faster model). This also helps explain why o3 is so expensive. It’s not just doing more forward passes, it’s a much bigger base model that you’re running with each forward pass.
I think maybe the most relevant chart from the Jones paper gwern cites is this one:
I don’t think o3 is a bigger model if we’re talking just raw parameter count. I am reasonably sure that both o1, o3, and the future o-series models for the time being are all based on 4o and scale its fundamental capabilities and knowledge. I also think that 4o itself was created specifically for the test-time compute scaffolding because the previous GPT-4 versions were far too bulky. You might’ve noticed that pretty much the entire of 2024 for the top labs was about distillation and miniaturization where the best-performing models were all significantly smaller than the best-performing models up through the winter of 2023/2024.
In my understanding, the cost increase comes from the fact that better, tighter chains of thought enable juicing the “creative” settings like higher temperature which expand the search space a tiny bit at a time. So the model actually searches for longer despite being more efficient than its predecessor because it’s able to search more broadly and further outside the box instead of running in circles. Using this approach with o1 likely wouldn’t be possible because it would take several times more tokens to meaningfully explore each possibility and risk running out of the context window before it could match o3′s performance.
I also believe the main reason Altman is hesitant to release GPT-5 (which they already have and most likely already use internally as a teacher model for the others) is that strictly economically speaking, it’s just a hard sell in the world where o3 exists and o4 is coming in three months, and then o5, etc. So it can’t outsmart those, yet unless it has a similar speed and latency as 4o or o1-mini, it cannot be used for real-time tasks like conversation mode or computer vision. And then the remaining scope of real-world problems for which it is the best option narrows down to something like creative writing and other strictly open-ended tasks that don’t benefit from “overthinking”. And I think the other labs ran into the same issue with Claude 3.5 Opus, Gemini 1.5 Ultra (if it was ever a thing at all), and any other trillion-scale models we were promised in 2024. The age of trillion-scale models is very likely over.
All of this sounds reasonable and it sounds like you may have insider info that I don’t. (Also, TBC I wasn’t trying to make a claim about which model is the base model for a particular o-series model, I was just naming models to be concrete, sorry to distract with that!)
Totally possible also that you’re right about more inference/search being the only reason o3 is more expensive than o1 — again it sounds like you know more than I do. But do you have a theory of why o3 is able to go on longer chains of thought without getting stuck, compared with o1? It’s possible that it’s just a grab bag of different improvements that make o3’s forward passes smarter, but to me it sounds like OAI think they’ve found a new, repeatable scaling paradigm, and I’m (perhaps over-)interpreting gwern as speculating that that paradigm does actually involve training larger models.
You noted that OAI is reluctant to release GPT-5 and is using it internally as a training model. FWIW I agree and I think this is consistent with what I’m suggesting. You develop the next-gen large parameter model (like GPT-5, say), not with the intent to actually release it, but rather to then do RL on it so it’s good at chain of thought, and then to use the best outputs of the resulting o model to make synthetic data to train the next base model with an even higher parameter count — all for internal use to push forward the frontier. None of these models ever need to be deployed to users — instead, you can distill either the best base model or the o-series model you have on hand into a smaller model that will be a bit worse (but only a bit) and way more efficient to deploy to lots of users.
The result is that the public need never see the massive internal models — we just happily use the smaller distilled versions that are surprisingly capable. But the company still has to train ever-bigger models.
Maybe what I said was already clear and I’m just repeating myself. Again you seem to be much closer to the action and I could easily be wrong, so I’m curious if you think I’m totally off-base here and in fact the companies aren’t developing massive models even for internal use to push forward the frontier.
Well we don’t know the sizes of the model, but I do get what you are saying and agree. Distil usually means big to small. But here it means expensive to cheap, (because test time compute is expensive, and they are training a model to cheaply skip the search process and just predict the result).
In RL, iirc, they call it “Policy distillation”. And similarly “Imitation learning” or “behavioral cloning” in some problem setups. Perhaps those would be more accurate.
I think maybe the most relevant chart from the Jones paper gwern cites is this one:
Oh interesting. I guess you mean because it shows the gains of TTC vs model size? So you can imagine the bootstrapping from TTC → model size → TCC → and so on?
Yeah sorry to be clear totally agree we (or at least I) don’t know the sizes of models, I was just naming specific models to be concrete.
But anyway yes I think you got my point: the Jones chart illustrates (what I understood to be) gwern’s view that adding more inference/search does juice your performance to some degree, but then those gains taper off. To get to the next higher sigmoid-like curve in the Jones figure, you need to up your parameter count; and then to climb that new sigmoid, you need more search. What Jones didn’t suggest (but gwern seems to be saying) is that you can use your search-enhanced model to produce better quality synthetic data to train a larger model on.
What Jones didn’t suggest (but gwern seems to be saying) is that you can use your search-enhanced model to produce better quality synthetic data to train a larger model on.
Jones wouldn’t say that because that’s just implicit in expert iteration. In each step of expert iteration, you can in theory be training an arbitrary new model from scratch to imitate the current expert. Usually you hold fixed the CNN and simply train it some more on the finetuned board positions from the MCTS, because that is cheap, but you don’t have to. As long as it takes a board position, and it returns value estimates for each possible move, and can be trained, it works. You could train a larger or smaller CNN, a deeper or wider* CNN of the same size, a ViT, a RNN, a random forest… (See also ‘self-distillation’.) And you might want to do this if the old expert has some built-in biases, perhaps due to path dependency, and is in a bad local optimum compared to training from a blank slate with the latest best synthetic data.
You can also do this in RL in general. OpenAI, for example, kept changing the OA5 DotA2 bot architecture on the fly to tweak its observations and arches, and didn’t restart each time. It just did a net2net or warm initialization, and kept going. (Given the path dependency of on-policy RL especially, this was not ideal, and did come with a serious cost, but it worked, as they couldn’t’ve afforded to train from scratch each time. As the released emails indicate, OA5 was breaking the OA budget as it was.)
Now, it’s a great question to ask: should we do that? Doesn’t it feel like it would be optimal to schedule the growth of the NN over the course of training in a scenario like Jones 2021? Why pay the expense of the final oversized CNN right from the start when it’s still playing random moves? It seems like there ought to be some set of scaling laws for how you progressively expand the NN over the course of training before you then brutally distill it down for a final NN, where it looks like an inverted U-curve. But it’s asking too much of Jones 2021 to do that as well as everything else. (Keep in mind that Andy Jones was just one guy with no budget or lab support doing this paper all on his own over, like, a summer. In terms of bang for buck, it is one of the best DL & RL papers of the past decade, and puts the rest of us to shame.)
* for latency. An early example of this is WaveNet for synthesizing audio: it was far too slow for real-time, because it was too deep. It didn’t cost too many computations, but the computations were too iterative to allow generating 1s of audio in 1s of wallclock, which renders it completely useless for many purposes, as it will fall behind. But once you have a working WaveNet, you can then distill it into a very wide CNN which does much more parallel computation instead of sequential, and can keep up. It might cost more operations, it might have more parameters, but it’ll generate 1s of audio in <1s of wallclock time.
Huh, so you think o1 was the process supervision reward model, and o3 is the distilled policy model to whatever reward model o1 became? That seems to fit.
Surely other labs will also replicate this too? Even the open source community seems close. And Silicon Valley companies often poach staff, which makes it hard to keep a trade secret. Not to mention spies.
Doubly so, if outsiders will just distil your models behaviour, and bootstrap from your elevated starting point.
It’s worth pointing out that Inference-time search seems to become harder as the verifier becomes less reliable. Which means that the scaling curves we see for math and code, might get much worse in other domains.
But maybe the counterpoint is just, GPU’s go brrrr.
Something like that, yes. The devil is in the details here.
Of course. The secrets cannot be kept, and everyone has been claiming to have cloned o1 already. There are dozens of papers purporting to have explained it. (I think DeepSeek may be the only one to have actually done so, however; at least, I don’t recall offhand any of the others observing the signature backtracking ‘wait a minute’ interjections the way DeepSeek sees organically emerging in r1.)
But scaling was never a secret. You still have to do it. And MS has $80b going into AI datacenters this year; how much does open source (or DeepSeek) have?
Yes. That’s why I felt skeptical about how generalizable the o1 approach is. It doesn’t look like a break-out to me. I don’t expect much far transfer: being really good at coding doesn’t automatically make you a genius at, say, booking plane tickets. (The o1 gains are certainly not universal, the way straightforward data/parameter-scaling gains tend to be—remember that some of the benchmarks actually got worse.) I also expect the o1 approach to tend to plateau: there is no ground truth oracle for most of these things, the way there is for Go. AlphaZero cannot reward-hack the Go simulator. Even for math, where your theorem prover can at least guarantee that a proof is valid, what’s the ground-truth oracle for ‘came up with a valuable new theorem, rather than arbitrary ugly tautological nonsense of no value’?
So that’s one of the big puzzles here for me: as interesting and impressive as o1/o3 is, I just don’t see how it justifies the apparent confidence. (Noam Brown has also commented that OA has a number of unpublished breakthroughs that would impress me if I knew, and of course, the money side seems to still be flowing without stint, despite it being much easier to cancel such investments than cause them.)
Is OA wrong, or do they know something I don’t? (For example, a distributional phase shift akin to meta-learning.) Or do they just think that these remaining issues are the sort of thing that AI-powered R&D can solve and so it is enough to just get really, really good at coding/math and they can delegate from there on out?
EDIT: Aidan McLaughlin has a good post back in November discussing the problems with RL and why you would not expect the o1 series to lead to AGI when scaled up in sensible ways, which I largely agree with, and says:
On the other hand… o3 didn’t only do math, and in RL we also know that RL systems often exhibit phase transitions in terms of meta-learning or generalization, where they overfit to narrow distributions and become superhuman experts which break if anything is even slightly different, but suddenly generalize when train on diverse enough data as a blessing of scale, not in data but data diversity, with LLMs being a major case in point of that, like GPT-2 → GPT-3. Hm. This was written 2024-11-20, and McLaughlin announced 2025-01-13 that he had joined OpenAI. Hm...
My personal view is that OA is probably wrong about how far the scaling curves generalize, with the caveat that even eating math and coding entirely ala AlphaZero would be still massive for AI progress, though compute constraints will bind eventually.
My own take is that the o1-approach will plateau in domains where verification is expensive, but thankfully most tasks of interest tend to be easier to verify than to solve, and lots of math/coding are basically ideally suited to verification, and I expect it to be way easier to make simulators that aren’t easy to reward hack for these domains.
Eh, those tautologies are both interesting on their own, combined with valuable training data so that it learns how to prove statements.
I think the unmodelled variable is that they think software-only type singularities to be more plausible, ala this:
Or this:
https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform#z7sKoyGbgmfL5kLmY
He gets onboarded only on January 28th, for clarity.
My point there is that he was talking to the reasoning team pre-hiring (forget ‘onboarding’, who knows what that means), so they would be unable to tell him most things—including if they have a better reason than ‘faith in divine benevolence’ to think that ‘more RL does fix it’.
Great comment; thanks for the shoutout.
> but suddenly generalize when train on diverse enough data as a blessing of scale
Can you elaborate on this? I’d love to see some examples.
When I hear “distillation” I think of a model with a smaller number of parameters that’s dumber than the base model. It seems like the word “bootstrapping” is more relevant here. You start with a base LLM (like GPT-4); then do RL for reasoning, and then do a ton of inference (this gets you o1-level outputs); then you train a base model with more parameters than GPT-4 (let’s call this GPT-5) on those outputs — each single forward pass of the resulting base model is going to be smarter than a single forward pass of GPT-4. And then you do RL and more inference (this gets you o3). And rinse and repeat.
I don’t think I’m really saying anything different from what you said, but the word “distill” doesn’t seem to capture the idea that you are training a larger, smarter base model (as opposed to a smaller, faster model). This also helps explain why o3 is so expensive. It’s not just doing more forward passes, it’s a much bigger base model that you’re running with each forward pass.
I think maybe the most relevant chart from the Jones paper gwern cites is this one:
I don’t think o3 is a bigger model if we’re talking just raw parameter count. I am reasonably sure that both o1, o3, and the future o-series models for the time being are all based on 4o and scale its fundamental capabilities and knowledge. I also think that 4o itself was created specifically for the test-time compute scaffolding because the previous GPT-4 versions were far too bulky. You might’ve noticed that pretty much the entire of 2024 for the top labs was about distillation and miniaturization where the best-performing models were all significantly smaller than the best-performing models up through the winter of 2023/2024.
In my understanding, the cost increase comes from the fact that better, tighter chains of thought enable juicing the “creative” settings like higher temperature which expand the search space a tiny bit at a time. So the model actually searches for longer despite being more efficient than its predecessor because it’s able to search more broadly and further outside the box instead of running in circles. Using this approach with o1 likely wouldn’t be possible because it would take several times more tokens to meaningfully explore each possibility and risk running out of the context window before it could match o3′s performance.
I also believe the main reason Altman is hesitant to release GPT-5 (which they already have and most likely already use internally as a teacher model for the others) is that strictly economically speaking, it’s just a hard sell in the world where o3 exists and o4 is coming in three months, and then o5, etc. So it can’t outsmart those, yet unless it has a similar speed and latency as 4o or o1-mini, it cannot be used for real-time tasks like conversation mode or computer vision. And then the remaining scope of real-world problems for which it is the best option narrows down to something like creative writing and other strictly open-ended tasks that don’t benefit from “overthinking”. And I think the other labs ran into the same issue with Claude 3.5 Opus, Gemini 1.5 Ultra (if it was ever a thing at all), and any other trillion-scale models we were promised in 2024. The age of trillion-scale models is very likely over.
All of this sounds reasonable and it sounds like you may have insider info that I don’t. (Also, TBC I wasn’t trying to make a claim about which model is the base model for a particular o-series model, I was just naming models to be concrete, sorry to distract with that!)
Totally possible also that you’re right about more inference/search being the only reason o3 is more expensive than o1 — again it sounds like you know more than I do. But do you have a theory of why o3 is able to go on longer chains of thought without getting stuck, compared with o1? It’s possible that it’s just a grab bag of different improvements that make o3’s forward passes smarter, but to me it sounds like OAI think they’ve found a new, repeatable scaling paradigm, and I’m (perhaps over-)interpreting gwern as speculating that that paradigm does actually involve training larger models.
You noted that OAI is reluctant to release GPT-5 and is using it internally as a training model. FWIW I agree and I think this is consistent with what I’m suggesting. You develop the next-gen large parameter model (like GPT-5, say), not with the intent to actually release it, but rather to then do RL on it so it’s good at chain of thought, and then to use the best outputs of the resulting o model to make synthetic data to train the next base model with an even higher parameter count — all for internal use to push forward the frontier. None of these models ever need to be deployed to users — instead, you can distill either the best base model or the o-series model you have on hand into a smaller model that will be a bit worse (but only a bit) and way more efficient to deploy to lots of users.
The result is that the public need never see the massive internal models — we just happily use the smaller distilled versions that are surprisingly capable. But the company still has to train ever-bigger models.
Maybe what I said was already clear and I’m just repeating myself. Again you seem to be much closer to the action and I could easily be wrong, so I’m curious if you think I’m totally off-base here and in fact the companies aren’t developing massive models even for internal use to push forward the frontier.
Well we don’t know the sizes of the model, but I do get what you are saying and agree. Distil usually means big to small. But here it means expensive to cheap, (because test time compute is expensive, and they are training a model to cheaply skip the search process and just predict the result).
In RL, iirc, they call it “Policy distillation”. And similarly “Imitation learning” or “behavioral cloning” in some problem setups. Perhaps those would be more accurate.
Oh interesting. I guess you mean because it shows the gains of TTC vs model size? So you can imagine the bootstrapping from TTC → model size → TCC → and so on?
Yeah sorry to be clear totally agree we (or at least I) don’t know the sizes of models, I was just naming specific models to be concrete.
But anyway yes I think you got my point: the Jones chart illustrates (what I understood to be) gwern’s view that adding more inference/search does juice your performance to some degree, but then those gains taper off. To get to the next higher sigmoid-like curve in the Jones figure, you need to up your parameter count; and then to climb that new sigmoid, you need more search. What Jones didn’t suggest (but gwern seems to be saying) is that you can use your search-enhanced model to produce better quality synthetic data to train a larger model on.
Jones wouldn’t say that because that’s just implicit in expert iteration. In each step of expert iteration, you can in theory be training an arbitrary new model from scratch to imitate the current expert. Usually you hold fixed the CNN and simply train it some more on the finetuned board positions from the MCTS, because that is cheap, but you don’t have to. As long as it takes a board position, and it returns value estimates for each possible move, and can be trained, it works. You could train a larger or smaller CNN, a deeper or wider* CNN of the same size, a ViT, a RNN, a random forest… (See also ‘self-distillation’.) And you might want to do this if the old expert has some built-in biases, perhaps due to path dependency, and is in a bad local optimum compared to training from a blank slate with the latest best synthetic data.
You can also do this in RL in general. OpenAI, for example, kept changing the OA5 DotA2 bot architecture on the fly to tweak its observations and arches, and didn’t restart each time. It just did a net2net or warm initialization, and kept going. (Given the path dependency of on-policy RL especially, this was not ideal, and did come with a serious cost, but it worked, as they couldn’t’ve afforded to train from scratch each time. As the released emails indicate, OA5 was breaking the OA budget as it was.)
Now, it’s a great question to ask: should we do that? Doesn’t it feel like it would be optimal to schedule the growth of the NN over the course of training in a scenario like Jones 2021? Why pay the expense of the final oversized CNN right from the start when it’s still playing random moves? It seems like there ought to be some set of scaling laws for how you progressively expand the NN over the course of training before you then brutally distill it down for a final NN, where it looks like an inverted U-curve. But it’s asking too much of Jones 2021 to do that as well as everything else. (Keep in mind that Andy Jones was just one guy with no budget or lab support doing this paper all on his own over, like, a summer. In terms of bang for buck, it is one of the best DL & RL papers of the past decade, and puts the rest of us to shame.)
* for latency. An early example of this is WaveNet for synthesizing audio: it was far too slow for real-time, because it was too deep. It didn’t cost too many computations, but the computations were too iterative to allow generating 1s of audio in 1s of wallclock, which renders it completely useless for many purposes, as it will fall behind. But once you have a working WaveNet, you can then distill it into a very wide CNN which does much more parallel computation instead of sequential, and can keep up. It might cost more operations, it might have more parameters, but it’ll generate 1s of audio in <1s of wallclock time.