GPT-5.5 is at the beginning of RLVR scaling, and future versions with the same pretrain will get considerably stronger in the coming months.
With GPT-5.x releases, OpenAI is taking advantage of RLVR scaling to blur the jumps in capability between different pretrains. GPT-5.1 ($1.25/$10 per 1M input/output tokens, knowledge cutoff 30 Sep 2024, context length 400K tokens) is followed by a slightly stronger GPT-5.2 ($1.75/$14, 31 Aug 2025, 400K), which is likely a better pretrain and a bigger model. Then GPT-5.3-Codex ($1.75/$14, 31 Aug 2025, 400K) is almost certainly the same pretrain, and GPT-5.4 ($2.5/$15, 31 Aug 2025, 1050K) is notably stronger than GPT-5.2, but still very likely the same pretrain (the change in pricing might be due to the change in context length). And now GPT-5.5 ($5/$30, 1 Dec 2025, 1050K) is a new bigger pretrain, stronger than GPT-5.4.
The strategy of “iterative deployment” seems to be about using RLVR scaling to release each pretrain with a little RLVR first, and then to scale RLVR for the same pretrain in subsequent releases in order to almost match the level of capabilities that will be achieved with a stronger pretrain that only uses a little RLVR, which is to be released after that. Thus GPT-5.1 is highly RLVRed, it’s followed by GPT-5.2, which is a different pretrain that’s RLVRed only as much as necessary to slightly overtake GPT-5.1 in capabilities. And then GPT-5.4 is again a highly RLVRed model on the same pretrain as GPT-5.2, which makes it almost as strong as GPT-5.5, the first release of a considerably stronger pretrain that’s only RLVRed as much as necessary to overtake GPT-5.4.
This process allows OpenAI to keep releasing ever larger flagship models while mostly avoiding stark jumps in capability. For GPT-5.5 (which is the first RLVRed Opus-class OpenAI release), this suggests that it’s at the beginning of RLVR scaling for its pretrain, and thus there is still considerable potential to its capabilities. GPT-5.6 will be using the same pretrain with more RLVR, and so on until OpenAI is ready to release a bigger pretrain (their Mythos-class model), which will be only slightly stronger than the highly RLVRed version of GPT-5.5′s pretrain that precedes it.
Interesting, thank you! This strongly supports the hypothesis that 5.1/5.0/4.1/4o (and therefore o3, because what else could it be) are using the same original pretrain, with 5.1/5.0/4.1 possibly using the same mid-training refresh. For 5.2, there isn’t an accuracy dip in late 2023, so it’s still probably a different pretrain than 5.1 (the original pretrain for 5.2 is probably using the same natural text data that went into the mid-training refresh for 5.1/5.0/4.1), but the mid-training for 5.2 is seriously broken (though its baseline for 2022-2023 is also worse than for the other models, so it could be a post-training or elicitation issue). Plausibly they re-did or otherwise updated at least mid-training for 5.3 and then also for 5.4 (given the peak for 5.4 in early 2025, which is absent for 5.3).
(I don’t think 5.5 being a new pretrain was ever in doubt. It’s 5.2 being different from 5.1, and 5.4 being the same as 5.2, that’s more tenuous. But also, for the purposes of this thread, it shouldn’t matter if a pretrain is refreshed with slightly newer data, either through mid-training or by re-doing everything from scratch. For the way RLVR scaling interacts with “iterative deployment”, keeping the same model shape/architecture and pretraining compute/process has about the same effect as keeping literally the same pretrain. Of course, evidence that literally the same pretrain is being used is evidence that its shape/architecture and pretraining compute/process are also the same.)
I don’t know if you are referencing it, but your point about 5.2 midtraining failing and being fixed in 5.3/5.4 lines up pretty much exactly with what the paywalled information article “OpenAI Developing ‘Garlic’ Model to Counter Google’s Recent Gains” said back in December, discussing “shallotpeat” being a new pretrain method (5.2), that had some issues and so was being currently refined into a model “garlic” (5.3/5.4) that could properly take on google, with the lessons learned being used as they started creation of a newer, bigger model (5.5).
5.2 being functionally complete, with 5.3 in post training, and pretraining in progress for 5.5 seems to match up very nice with the Dec 2 information article post date and subsequent release dates of 5.2 − 5.5. It is a bit muddied since OAI twitter “vagueposted” many garlic-themed memes around 5.2 release, but the timeline of 5.2 being garlic (still in post training in Dec 2nd, released Dec 11th) doesn’t make any sense, while shallotpeat being complete before Dec 2 and undergoing safety/deployment testing through Dec 10th seems much more reasonable.
If you don’t have have access to the article you can still get all the relevant snippets reposted on twitter, or resummarized by other outlets. There is also an even earlier information article that has the first mentions of shallotpeat (which was itself supposedly a refinement of the first, possibly completely failed internal pretrain attempt).
Relatedly, on Nov 11th Dwarkesh podcast, Illya made a reference to the rumors of a new pretraining method used for gemini 3.0, which is likely what was similarly used in the shallotpeat+ base models.
There’s an archived version of the Garlic article. I was only referencing bhalstead’s plot itself, how the GPT-5.2 line is clearly showing that something is wrong with learning (but again, maybe just elicitation) of the data from between mid-2024 and mid-2025. The dip in accuracy in this time window is much worse than with what’s clearly the mid-training for 5.1/5.0/4.1, which covers the data from the first half of 2024 for those models.
(The issue with elicitation might be analogous to how Gemini 3 Pro insists it’s impossible that it’s currently 2026, and argues that anything that takes place in 2026 must be fictional. Maybe if you ask a model that’s stuck in mid-2024 about someone who died in 2025, it’ll say that they didn’t die even if it knows when they did. Even if they die in the future, it doesn’t follow that they died yet in reality, by mid-2024, and it’s always mid-2024 for such a model even as after mid-training it knows some facts from 2025. This could be the dynamic under the surface that impacts elicitation, even if the model doesn’t visibly argue about 2025 not being real.)
Can someone explain why many models have slowly-decaying lines? I would have expected sharp drop-offs—knowledge falling to zero after training data ends. In what situation does a model (like GPT-5.2) fall from 0.5 to sub 0.1 accuracy, and stay there for seemingly half a year?)
I’m also surprised that old and obsolete GPT-4x models seem to be broadly outcompeting the GPT-5x line. Am I missing something? Are refusals being counted as failures?
I suspect a few different variables are getting mixed together—a model’s raw intelligence, its willingness to provide a specific date, its willingness to confabulate when it doesn’t know, etc.
The decays are probably because there are less training data about recent deaths, and that the pre-training may have started before the knowledge cutoff.
Older models having better rote memorization on slightly obscure facts isn’t that surprising imo. It is not something that has a lot of optimization pressure.
Having multiple variables mixed don’t seem like a big issue for detecting ancestry. False positives will still be highly unlikely—different pretrains will probably have different “forgetting curves”.
a slightly stronger GPT-5.2 ($1.75/$14, 31 Aug 2025, 400K), which is likely a better pretrain and a bigger model.
How do you reconcile this claim of OA doing regular new pretrains on larger models with the general consensus and leaks that they are not new pretrains and only GPT-5.5 is the first new true pretrain, and the general pattern of consistency between them with some regressions, like one would expect of doing a lot more RL on the same basis and consistent with the previous history of pushing the 4o-series very far? I’ll just say that this is the first I’ve heard it suggested that GPT-5.2 is a new pretrain.
OpenAI didn’t do new pretrains that became flagship models for a surprisingly long time (all their flagship models from GPT-4o to GPT-5.1 being based on the same pretrain, 1.5 years of releases), so it became a notable thing to talk about. But GPT-5.5 was released 2 years after GPT-4o, so in preparation for it there would’ve inevitably been significant changes in architecture, which would’ve been tested on models smaller than GPT-5.5 first. And GPT-4o was targeting A100s and H100s (640 GB of HBM per server), while GPT-5.2 was released when they might’ve had enough B200s (1400 GB of HBM) to run it.
So it’s the combination of plausible availability of enough B200s to make a larger model practical, increased token price, changed cutoff date, the cost of pretraining being more trivial in 2025 than it was at the end of 2023, new algorithmic improvements that motivate replacing the model, the need for a test run for a new model architecture in preparation for GPT-5.5, and plausibly their ability to usefully scale RLVR on top of GPT-4o’s base model had finally run out at about GPT-5.1, while “iterative deployment” demanded bridge releases between GPT-5.1 and GPT-5.5. These arguments are weak individually, and only some of them gesture at particular timing, but when considered altogether, they suggest significant pretraining activity would’ve been happening around that time. There are specific rumors around Shallotpeat[1] and Garlic[2] being about refinement of the pretraining process, though it’s unclear if they have anything to do with GPT-5.2 or GPT-5.4, or are just steps towards making GPT-5.5′s pretrain work. Though the separate codename Spud[3] (which is confirmed to be GPT-5.5) suggests that Garlic (which supposedly already resolved the problems in pretraining) is not the same thing as GPT-5.5.
Plausibly Shallotpeat and Garlic are not literally GPT-5.2 and GPT-5.3, because resolving pretraining problems in preparation for GPT-5.5, when it was this close to the final run, probably required working with models of the same size. But GPT-5.2/GPT-5.3 might’ve been smaller models with the same architecture and pretraining process as Shallotpeat/Garlic, used to adjust the post-training process for when it needed to happen for GPT-5.5 itself. Finally, there’s now bhalstead’s plot (and an earlier Daniel Paleka’s plot from Jan 2026) that suggests GPT-5.2 is different from GPT-5.1.
“In developing Garlic, Chen said OpenAI solved key problems it had been having in pretraining, including improving upon its “previous best” and “much larger” pretrained model, the forgettable GPT-4.5”
How interchangeable are the gains from RLVR and pre-training? My understanding was that additional pre-training yields improvements on different benchmarks than additional RLVR. If that’s true, you should be able to get an additional point of evidence of this hypothesis (in addition to the serving cost of models), by also looking at which benchmarks new releases show the biggest improvements on.
Any individual benchmark can be improved a lot with post-training (even for benchmarks that don’t benefit from RLVR as much, by improving benchmark-relevant data). Better pretraining is the “rising tide” for things not directly addressed by post-training data, and it also makes the outcomes of post-training for the same tasks better. But if frontier pretrains are not too far apart, post-training might be able to interpolate between them well enough that it’s hard to pinpoint the distinction.
Do we have any information on the cost to train 5.5 -- it seems no leaks yet?
Since it’s a new pretrain, GPT-5.5 seems like an important datapoint to see where we are relative to the cost trend that you usefully estimated as “an increase in cost of 2.35x per year”.
I saw a snippet of an interview with Greg Brockman in which he said something like “I think of Spud as a new pre-train”. That “I think of...” made kind of suspicious. It’s either that Brockman phrased things badly or that Spud isn’t actually a new pre-train and that’s why Brockman used that phrasing. If it’s not actually a new pre-train, then my guess was that they did a massive amount of RLVR. If the amount of RLVR is at pre-training scale, then it would justify Brockman’s phrasing.
We don’t know much about how long it takes for a new base model to get “up to speed” in post-training these days. The existence of Mythos, which was already much more capable than Opus 4.6 by late February, suggests to me that this time window has compressed compared to a year ago.
Note: DeepSeek-V4′s final checkpoint wasn’t even RL’d at all. They did on-policy distillation from RL-trained specialist checkpoints to produce the final model. 5.5 could have similarly been (partially) distilled from advanced post-trained checkpoints of 5.4.
My argument doesn’t involve saying they didn’t have enough time to RLVR it a lot yet. It’s about evidence for what “iterative deployment” means given all the GPT-5.x releases, and what it then suggests about GPT-5.5 and subsequent releases.
In principle, GPT-5.5 could be RLVRed GPT-4.5, and in principle OpenAI had 100K GB200 NVL72s since maybe summer 2025. But bhalstead’s plot suggests a knowledge cutoff at the very end of 2024 (in the original pretrain, with no significantly later mid-training), which is likely too late for GPT-4.5, and the GB200 NVL72s probably weren’t ready for a while in sufficient numbers, at least for efficiently inferencing large models.
Another possibility is a new pretrain made sometime in 2025 on H100/H200/B200, which wouldn’t need to wait for GB200 NVL72s, and then they had maybe at least 6 months with enough GB200 NVL72s to experiment with RLVRing it, even if not yet enough to deploy it as a frontier model. The datapoint of apparently still doing RL for GPT-5.5 in Mar 2025 isn’t evidence that this work only started very recently, as last touches of RL would happen before a release in any case.
Mythos doesn’t obviously mean it takes little time to RL a large model, it could’ve been pretrained at any point in 2025 and RLed together with Opus 4.5 or shortly after, once Trainium 2 Ultra racks (or maybe some TPUs) were available for that.
DeepSeek-V4′s final checkpoint wasn’t even RL’d at all. They did on-policy distillation from RL-trained specialist checkpoints to produce the final model.
Sure, but RLVR still needs to happen for something, even if not for the final model. If it only happens for smaller models, where it’s more stable and doesn’t need good/scarce/unfamiliar hardware, the results after OPD might be notably worse than if it’s for same-sized models. DeepSeek-V4 paper doesn’t disclose the nature of the teacher models, and how the quality of the result depends on it.
The possibility of OPD from GPT-5.4 just makes very fast post-training of GPT-5.5 more plausible than if it needed to be RLVRed directly, but probably resulting in inferior quality compared to what RLVRing of models based on GPT-5.5′s pretrain could achieve, either directly or via OPD from multiple RLVRed teachers also based on GPT-5.5′s pretrain.
Good analysis this seems very likely to me as well. It seems good to think about the strategic reasoning behind this. I hadn’t previously deeply considered the dynamic of each new pretrain release being just slightly better than the previous pretrain + X RLVR, this does seem to imply that OpenAI is intentionally avoiding creating large jumps in capability with this iterative deployment strategy. Why is this important to them? The first idea I had that seems somewhat likely is that large capability jumps generate press and discourse in the mainstream (see mythos). This strategy would dodge that press, which if true seems like an important signal about their priorities re messaging and narrative (Conjecture: is it to their benefit for people to believe AI has “hit a wall”? Maybe they want to avoid the specific kind of “doomer” rhetoric that seems to arise whenever large capability jumps happen?). Looking into how the various labs are attempting to shape the narrative seems important, even if it inherently relies on a good bit of conjecture about internal motivations.
I think it is probably more a result of wanting to release a new, better model as often as possible. We’ve seen AI companies cluster releases together, last year, as if they’re rushing to put something out whenever anyone else does, so that the media frenzy isn’t exclusively about their competitor. A problem with that approach is you have to either hold something good back while waiting for competitors to release, or push something out prematurely. Everyone wants to get the last word, putting out the model that’ll be perceived as best for the next couple months or so. That’s tough to pull off, so an alternative is to just try and release as frequently as possible, rather than trying to time things cleverly. Then (if you keep pace capability-wise) your models will be the best most often, simply because you release more. The problem with that strategy is, you’ll be using more compute running multiple training runs (doing a big post-training run at the same time as pre-training the next big thing). A more focused approach with fewer parallel training runs and a slower release cycle can utilize limited compute more effectively. The main question (for the race) becomes, who is managing all these trade-offs most effectively?
I am as cynical about Sam Altman as anyone but he does constantly say that he believes iterative deployment is important so that the public can see the frontier of AI as it emerges and so that society can adapt to development. It seems plausible that he, and a lot of other people at OpenAI, do actually believe this.
GPT-5.5 is at the beginning of RLVR scaling, and future versions with the same pretrain will get considerably stronger in the coming months.
With GPT-5.x releases, OpenAI is taking advantage of RLVR scaling to blur the jumps in capability between different pretrains. GPT-5.1 ($1.25/$10 per 1M input/output tokens, knowledge cutoff 30 Sep 2024, context length 400K tokens) is followed by a slightly stronger GPT-5.2 ($1.75/$14, 31 Aug 2025, 400K), which is likely a better pretrain and a bigger model. Then GPT-5.3-Codex ($1.75/$14, 31 Aug 2025, 400K) is almost certainly the same pretrain, and GPT-5.4 ($2.5/$15, 31 Aug 2025, 1050K) is notably stronger than GPT-5.2, but still very likely the same pretrain (the change in pricing might be due to the change in context length). And now GPT-5.5 ($5/$30, 1 Dec 2025, 1050K) is a new bigger pretrain, stronger than GPT-5.4.
The strategy of “iterative deployment” seems to be about using RLVR scaling to release each pretrain with a little RLVR first, and then to scale RLVR for the same pretrain in subsequent releases in order to almost match the level of capabilities that will be achieved with a stronger pretrain that only uses a little RLVR, which is to be released after that. Thus GPT-5.1 is highly RLVRed, it’s followed by GPT-5.2, which is a different pretrain that’s RLVRed only as much as necessary to slightly overtake GPT-5.1 in capabilities. And then GPT-5.4 is again a highly RLVRed model on the same pretrain as GPT-5.2, which makes it almost as strong as GPT-5.5, the first release of a considerably stronger pretrain that’s only RLVRed as much as necessary to overtake GPT-5.4.
This process allows OpenAI to keep releasing ever larger flagship models while mostly avoiding stark jumps in capability. For GPT-5.5 (which is the first RLVRed Opus-class OpenAI release), this suggests that it’s at the beginning of RLVR scaling for its pretrain, and thus there is still considerable potential to its capabilities. GPT-5.6 will be using the same pretrain with more RLVR, and so on until OpenAI is ready to release a bigger pretrain (their Mythos-class model), which will be only slightly stronger than the highly RLVRed version of GPT-5.5′s pretrain that precedes it.
These results seem to support the hypothesis that 5.5 is a new pretrain, though what’s happening with 5.3 and 5.4 is a bit unclear.
[edit: code available here, credit belongs to @Daniel Paleka for the idea]
Interesting, thank you! This strongly supports the hypothesis that 5.1/5.0/4.1/4o (and therefore o3, because what else could it be) are using the same original pretrain, with 5.1/5.0/4.1 possibly using the same mid-training refresh. For 5.2, there isn’t an accuracy dip in late 2023, so it’s still probably a different pretrain than 5.1 (the original pretrain for 5.2 is probably using the same natural text data that went into the mid-training refresh for 5.1/5.0/4.1), but the mid-training for 5.2 is seriously broken (though its baseline for 2022-2023 is also worse than for the other models, so it could be a post-training or elicitation issue). Plausibly they re-did or otherwise updated at least mid-training for 5.3 and then also for 5.4 (given the peak for 5.4 in early 2025, which is absent for 5.3).
(I don’t think 5.5 being a new pretrain was ever in doubt. It’s 5.2 being different from 5.1, and 5.4 being the same as 5.2, that’s more tenuous. But also, for the purposes of this thread, it shouldn’t matter if a pretrain is refreshed with slightly newer data, either through mid-training or by re-doing everything from scratch. For the way RLVR scaling interacts with “iterative deployment”, keeping the same model shape/architecture and pretraining compute/process has about the same effect as keeping literally the same pretrain. Of course, evidence that literally the same pretrain is being used is evidence that its shape/architecture and pretraining compute/process are also the same.)
I don’t know if you are referencing it, but your point about 5.2 midtraining failing and being fixed in 5.3/5.4 lines up pretty much exactly with what the paywalled information article “OpenAI Developing ‘Garlic’ Model to Counter Google’s Recent Gains” said back in December, discussing “shallotpeat” being a new pretrain method (5.2), that had some issues and so was being currently refined into a model “garlic” (5.3/5.4) that could properly take on google, with the lessons learned being used as they started creation of a newer, bigger model (5.5).
5.2 being functionally complete, with 5.3 in post training, and pretraining in progress for 5.5 seems to match up very nice with the Dec 2 information article post date and subsequent release dates of 5.2 − 5.5. It is a bit muddied since OAI twitter “vagueposted” many garlic-themed memes around 5.2 release, but the timeline of 5.2 being garlic (still in post training in Dec 2nd, released Dec 11th) doesn’t make any sense, while shallotpeat being complete before Dec 2 and undergoing safety/deployment testing through Dec 10th seems much more reasonable.
If you don’t have have access to the article you can still get all the relevant snippets reposted on twitter, or resummarized by other outlets. There is also an even earlier information article that has the first mentions of shallotpeat (which was itself supposedly a refinement of the first, possibly completely failed internal pretrain attempt).
Relatedly, on Nov 11th Dwarkesh podcast, Illya made a reference to the rumors of a new pretraining method used for gemini 3.0, which is likely what was similarly used in the shallotpeat+ base models.
There’s an archived version of the Garlic article. I was only referencing bhalstead’s plot itself, how the GPT-5.2 line is clearly showing that something is wrong with learning (but again, maybe just elicitation) of the data from between mid-2024 and mid-2025. The dip in accuracy in this time window is much worse than with what’s clearly the mid-training for 5.1/5.0/4.1, which covers the data from the first half of 2024 for those models.
(The issue with elicitation might be analogous to how Gemini 3 Pro insists it’s impossible that it’s currently 2026, and argues that anything that takes place in 2026 must be fictional. Maybe if you ask a model that’s stuck in mid-2024 about someone who died in 2025, it’ll say that they didn’t die even if it knows when they did. Even if they die in the future, it doesn’t follow that they died yet in reality, by mid-2024, and it’s always mid-2024 for such a model even as after mid-training it knows some facts from 2025. This could be the dynamic under the surface that impacts elicitation, even if the model doesn’t visibly argue about 2025 not being real.)
Pretty sure this idea originated from @Daniel Paleka here! Giving him some credits with this comment.
That’s a clever idea!
Can someone explain why many models have slowly-decaying lines? I would have expected sharp drop-offs—knowledge falling to zero after training data ends. In what situation does a model (like GPT-5.2) fall from 0.5 to sub 0.1 accuracy, and stay there for seemingly half a year?)
I’m also surprised that old and obsolete GPT-4x models seem to be broadly outcompeting the GPT-5x line. Am I missing something? Are refusals being counted as failures?
I suspect a few different variables are getting mixed together—a model’s raw intelligence, its willingness to provide a specific date, its willingness to confabulate when it doesn’t know, etc.
GPT 5.2 is dropping before its knowledge cutoff.
The decays are probably because there are less training data about recent deaths, and that the pre-training may have started before the knowledge cutoff.
Older models having better rote memorization on slightly obscure facts isn’t that surprising imo. It is not something that has a lot of optimization pressure.
Having multiple variables mixed don’t seem like a big issue for detecting ancestry. False positives will still be highly unlikely—different pretrains will probably have different “forgetting curves”.
Very cool graph. Is the script you used publicly available?
added a link
How do you reconcile this claim of OA doing regular new pretrains on larger models with the general consensus and leaks that they are not new pretrains and only GPT-5.5 is the first new true pretrain, and the general pattern of consistency between them with some regressions, like one would expect of doing a lot more RL on the same basis and consistent with the previous history of pushing the 4o-series very far? I’ll just say that this is the first I’ve heard it suggested that GPT-5.2 is a new pretrain.
OpenAI didn’t do new pretrains that became flagship models for a surprisingly long time (all their flagship models from GPT-4o to GPT-5.1 being based on the same pretrain, 1.5 years of releases), so it became a notable thing to talk about. But GPT-5.5 was released 2 years after GPT-4o, so in preparation for it there would’ve inevitably been significant changes in architecture, which would’ve been tested on models smaller than GPT-5.5 first. And GPT-4o was targeting A100s and H100s (640 GB of HBM per server), while GPT-5.2 was released when they might’ve had enough B200s (1400 GB of HBM) to run it.
So it’s the combination of plausible availability of enough B200s to make a larger model practical, increased token price, changed cutoff date, the cost of pretraining being more trivial in 2025 than it was at the end of 2023, new algorithmic improvements that motivate replacing the model, the need for a test run for a new model architecture in preparation for GPT-5.5, and plausibly their ability to usefully scale RLVR on top of GPT-4o’s base model had finally run out at about GPT-5.1, while “iterative deployment” demanded bridge releases between GPT-5.1 and GPT-5.5. These arguments are weak individually, and only some of them gesture at particular timing, but when considered altogether, they suggest significant pretraining activity would’ve been happening around that time. There are specific rumors around Shallotpeat [1] and Garlic [2] being about refinement of the pretraining process, though it’s unclear if they have anything to do with GPT-5.2 or GPT-5.4, or are just steps towards making GPT-5.5′s pretrain work. Though the separate codename Spud [3] (which is confirmed to be GPT-5.5) suggests that Garlic (which supposedly already resolved the problems in pretraining) is not the same thing as GPT-5.5.
Plausibly Shallotpeat and Garlic are not literally GPT-5.2 and GPT-5.3, because resolving pretraining problems in preparation for GPT-5.5, when it was this close to the final run, probably required working with models of the same size. But GPT-5.2/GPT-5.3 might’ve been smaller models with the same architecture and pretraining process as Shallotpeat/Garlic, used to adjust the post-training process for when it needed to happen for GPT-5.5 itself. Finally, there’s now bhalstead’s plot (and an earlier Daniel Paleka’s plot from Jan 2026) that suggests GPT-5.2 is different from GPT-5.1.
“In developing that model, OpenAI aims to fix bugs it has encountered in the pretraining process, according to a person with knowledge of the model.”
“In developing Garlic, Chen said OpenAI solved key problems it had been having in pretraining, including improving upon its “previous best” and “much larger” pretrained model, the forgettable GPT-4.5”
“The company has finished pretraining “Spud,” Altman said in the memo.”
How interchangeable are the gains from RLVR and pre-training? My understanding was that additional pre-training yields improvements on different benchmarks than additional RLVR. If that’s true, you should be able to get an additional point of evidence of this hypothesis (in addition to the serving cost of models), by also looking at which benchmarks new releases show the biggest improvements on.
Any individual benchmark can be improved a lot with post-training (even for benchmarks that don’t benefit from RLVR as much, by improving benchmark-relevant data). Better pretraining is the “rising tide” for things not directly addressed by post-training data, and it also makes the outcomes of post-training for the same tasks better. But if frontier pretrains are not too far apart, post-training might be able to interpolate between them well enough that it’s hard to pinpoint the distinction.
Do we have any information on the cost to train 5.5 -- it seems no leaks yet?
Since it’s a new pretrain, GPT-5.5 seems like an important datapoint to see where we are relative to the cost trend that you usefully estimated as “an increase in cost of 2.35x per year”.
I saw a snippet of an interview with Greg Brockman in which he said something like “I think of Spud as a new pre-train”. That “I think of...” made kind of suspicious. It’s either that Brockman phrased things badly or that Spud isn’t actually a new pre-train and that’s why Brockman used that phrasing. If it’s not actually a new pre-train, then my guess was that they did a massive amount of RLVR. If the amount of RLVR is at pre-training scale, then it would justify Brockman’s phrasing.
We don’t know much about how long it takes for a new base model to get “up to speed” in post-training these days. The existence of Mythos, which was already much more capable than Opus 4.6 by late February, suggests to me that this time window has compressed compared to a year ago.
Note: DeepSeek-V4′s final checkpoint wasn’t even RL’d at all. They did on-policy distillation from RL-trained specialist checkpoints to produce the final model. 5.5 could have similarly been (partially) distilled from advanced post-trained checkpoints of 5.4.
My argument doesn’t involve saying they didn’t have enough time to RLVR it a lot yet. It’s about evidence for what “iterative deployment” means given all the GPT-5.x releases, and what it then suggests about GPT-5.5 and subsequent releases.
In principle, GPT-5.5 could be RLVRed GPT-4.5, and in principle OpenAI had 100K GB200 NVL72s since maybe summer 2025. But bhalstead’s plot suggests a knowledge cutoff at the very end of 2024 (in the original pretrain, with no significantly later mid-training), which is likely too late for GPT-4.5, and the GB200 NVL72s probably weren’t ready for a while in sufficient numbers, at least for efficiently inferencing large models.
Another possibility is a new pretrain made sometime in 2025 on H100/H200/B200, which wouldn’t need to wait for GB200 NVL72s, and then they had maybe at least 6 months with enough GB200 NVL72s to experiment with RLVRing it, even if not yet enough to deploy it as a frontier model. The datapoint of apparently still doing RL for GPT-5.5 in Mar 2025 isn’t evidence that this work only started very recently, as last touches of RL would happen before a release in any case.
Mythos doesn’t obviously mean it takes little time to RL a large model, it could’ve been pretrained at any point in 2025 and RLed together with Opus 4.5 or shortly after, once Trainium 2 Ultra racks (or maybe some TPUs) were available for that.
Sure, but RLVR still needs to happen for something, even if not for the final model. If it only happens for smaller models, where it’s more stable and doesn’t need good/scarce/unfamiliar hardware, the results after OPD might be notably worse than if it’s for same-sized models. DeepSeek-V4 paper doesn’t disclose the nature of the teacher models, and how the quality of the result depends on it.
The possibility of OPD from GPT-5.4 just makes very fast post-training of GPT-5.5 more plausible than if it needed to be RLVRed directly, but probably resulting in inferior quality compared to what RLVRing of models based on GPT-5.5′s pretrain could achieve, either directly or via OPD from multiple RLVRed teachers also based on GPT-5.5′s pretrain.
Good analysis this seems very likely to me as well. It seems good to think about the strategic reasoning behind this. I hadn’t previously deeply considered the dynamic of each new pretrain release being just slightly better than the previous pretrain + X RLVR, this does seem to imply that OpenAI is intentionally avoiding creating large jumps in capability with this iterative deployment strategy. Why is this important to them? The first idea I had that seems somewhat likely is that large capability jumps generate press and discourse in the mainstream (see mythos). This strategy would dodge that press, which if true seems like an important signal about their priorities re messaging and narrative (Conjecture: is it to their benefit for people to believe AI has “hit a wall”? Maybe they want to avoid the specific kind of “doomer” rhetoric that seems to arise whenever large capability jumps happen?). Looking into how the various labs are attempting to shape the narrative seems important, even if it inherently relies on a good bit of conjecture about internal motivations.
I think it is probably more a result of wanting to release a new, better model as often as possible. We’ve seen AI companies cluster releases together, last year, as if they’re rushing to put something out whenever anyone else does, so that the media frenzy isn’t exclusively about their competitor. A problem with that approach is you have to either hold something good back while waiting for competitors to release, or push something out prematurely. Everyone wants to get the last word, putting out the model that’ll be perceived as best for the next couple months or so. That’s tough to pull off, so an alternative is to just try and release as frequently as possible, rather than trying to time things cleverly. Then (if you keep pace capability-wise) your models will be the best most often, simply because you release more. The problem with that strategy is, you’ll be using more compute running multiple training runs (doing a big post-training run at the same time as pre-training the next big thing). A more focused approach with fewer parallel training runs and a slower release cycle can utilize limited compute more effectively. The main question (for the race) becomes, who is managing all these trade-offs most effectively?
I am as cynical about Sam Altman as anyone but he does constantly say that he believes iterative deployment is important so that the public can see the frontier of AI as it emerges and so that society can adapt to development. It seems plausible that he, and a lot of other people at OpenAI, do actually believe this.