People often ask whether GPT-5, GPT-5.1, and GPT-5.2 use the same base model. I have no private information, but I think there’s a compelling argument that AI developers should update their base models fairly often. The argument comes from the following observations:
The cost of inference at a given level of AI capability has been dropping quickly. A reasonable estimate is 10× per year, or a halving time of 3.6 months (edit: but 3× is also reasonable, it’s hard to be sure).
The cost of new near-frontier AI training runs is relatively small, on the order of tens or hundreds of millions of dollars, according to this Epoch data insight based on public reporting for 2024 (which I expect is directionally correct but I wouldn’t take too literally).
By contrast, frontier AI developers were spending single digit billions of dollars on AI inference in 2024 (per the same Epoch data insight) and likely high-single digit billions in 2025.
Therefore, it is economically sensible to train entirely new AI models fairly often because their lower inference costs will compensate for the relatively small training costs. “Fairly often” seems like it could be every 2-6 months depending on the exact details.
As a hypothetical example, let’s say that OpenAI is considering training a new base model to become GPT-5.1 which will be deployed for only one month before GPT-5.2 is released. Maybe it’s 40% cheaper to serve than GPT-5 due to being smaller and using more efficient KV caching[1]. The cost of serving GPT-5 for that month, assuming it’s half of all inference by cost would be ($6B (total inference cost in the year) /2/12)= $250 million; at 40% cheaper, the cost of serving GPT-5.1 would be $150m, saving $100m. If it costs less than $100m to develop GPT-5.1 (in additional marginal costs, because e.g., R&D is amortized across models), then it would be economically sensible to do so.
A big reason to be skeptical of this argument is that there could be large non-compute costs to training, such as lots of staff time—this just pushes training costs up but the overall argument still goes through with a less frequent update rate. Another related reason is that constantly training new models might split the focus of an organization and thus be really costly.
My overall confidence in this take is low, and I would be curious to hear what others think.
GPT-5.1 being 40% cheaper than GPT-5 is reasonable given halving times of 3.6 months; the GPT-5 was released August 7, 2025 and GPT-5.1 was released around 3 months later on November 12, 2025; GPT-5.2 was released December 11, 2025.
Tokenizers are often used over multiple generations of a model, or at least that was the case a couple of years ago, so I wouldn’t expect it to work well as a test.
Same tokenizers on different training data lead to different glitch tokens, see e. g. comparison of Llama-family models in Yuxi Li et al. 2024 https://arxiv.org/abs/2404.09894
Based on the plot below, my intuition is that GPT-5 and GPT-5.1 had some pretraining data added to the original base pretraining dataset (or base model) dating back to GPT-4o; and GPT-5.2 is something different. I did this experiment a while back.
The supposed dropping inference cost at a given level of capability is about benchmark performance (I’m skeptical it truly applies to real world uses at a similar level), which is largely about post-training (or mid-training with synthetic data), doesn’t have much use for new pretrains. If there is already some pretrain of a relevant size from the last ~annual pretraining effort, and current post-training methods let it be put to use in a new role because they manage to make it good enough for that, they can just use the older pretrain (possibly refreshing it with mid-training on natural data to a more recent cutoff date). Confusingly, sometimes mid-training updates are referred to as different base or foundation models, even when they share the same pretrain.
In 2025-2026, there is also (apart from post-training improvements) the transition from older 8-chip Nvidia servers to rack-scale servers with more HBM per scale-up world that enable serving the current largest models efficiently (sized for 2024 levels of pretraining compute), and allow serving the smaller models like GPT-5 notably cheaper. But that’s a one-time thing, and pretrains for even some of the largest models (not to mention the smaller ones) might’ve already been done in 2024. Probably when updating to a significantly larger model, you wouldn’t just increment the minor version number. Though incrementing just the minor version number might be in order when updating to a new pretrain of a similar size, or when switching to a similarly capable pretrain of a smaller size, and either could happen during the ~annual series of new pretraining runs depending on how well they turn out.
Yeah I think leading labs generally retrain their base models less often than every 6 months (but there’s a lot we don’t know for sure). And I believe this most likely has to do with a production AI model being the result of a lot of careful tuning on pre-training, mid-training, post-training etc. Swapping in a new base model might lead to a lot of post-training regressions that need to be fixed. And your old base model is a “lucky” one in some sense because it either was selected for doing well and/or it required lots experiments, derisking runs, etc. Even with all of your new algorithmic tricks it might be hard to one-shot YOLO a base model that’s better than your SOTA model from nine months ago. But this is probably much easier for your model from 18 or 27 months ago.
Also I’d guess staff costs are more important than compute costs here but these considerations mean compute costs of retraining are higher than one might think.
People often ask whether GPT-5, GPT-5.1, and GPT-5.2 use the same base model. I have no private information, but I think there’s a compelling argument that AI developers should update their base models fairly often. The argument comes from the following observations:
The cost of inference at a given level of AI capability has been dropping quickly. A reasonable estimate is 10× per year, or a halving time of 3.6 months (edit: but 3× is also reasonable, it’s hard to be sure).
The cost of new near-frontier AI training runs is relatively small, on the order of tens or hundreds of millions of dollars, according to this Epoch data insight based on public reporting for 2024 (which I expect is directionally correct but I wouldn’t take too literally).
By contrast, frontier AI developers were spending single digit billions of dollars on AI inference in 2024 (per the same Epoch data insight) and likely high-single digit billions in 2025.
Therefore, it is economically sensible to train entirely new AI models fairly often because their lower inference costs will compensate for the relatively small training costs. “Fairly often” seems like it could be every 2-6 months depending on the exact details.
As a hypothetical example, let’s say that OpenAI is considering training a new base model to become GPT-5.1 which will be deployed for only one month before GPT-5.2 is released. Maybe it’s 40% cheaper to serve than GPT-5 due to being smaller and using more efficient KV caching[1]. The cost of serving GPT-5 for that month, assuming it’s half of all inference by cost would be ($6B (total inference cost in the year) /2/12)= $250 million; at 40% cheaper, the cost of serving GPT-5.1 would be $150m, saving $100m. If it costs less than $100m to develop GPT-5.1 (in additional marginal costs, because e.g., R&D is amortized across models), then it would be economically sensible to do so.
A big reason to be skeptical of this argument is that there could be large non-compute costs to training, such as lots of staff time—this just pushes training costs up but the overall argument still goes through with a less frequent update rate. Another related reason is that constantly training new models might split the focus of an organization and thus be really costly.
My overall confidence in this take is low, and I would be curious to hear what others think.
GPT-5.1 being 40% cheaper than GPT-5 is reasonable given halving times of 3.6 months; the GPT-5 was released August 7, 2025 and GPT-5.1 was released around 3 months later on November 12, 2025; GPT-5.2 was released December 11, 2025.
According to Semianalysis:
Has anyone tried to test this hypothesis with the glitch token magic?
Tokenizers are often used over multiple generations of a model, or at least that was the case a couple of years ago, so I wouldn’t expect it to work well as a test.
Same tokenizers on different training data lead to different glitch tokens, see e. g. comparison of Llama-family models in Yuxi Li et al. 2024 https://arxiv.org/abs/2404.09894
Good point! I hadn’t quite realized that although it seems obvious in retrospect.
Based on the plot below, my intuition is that GPT-5 and GPT-5.1 had some pretraining data added to the original base pretraining dataset (or base model) dating back to GPT-4o; and GPT-5.2 is something different. I did this experiment a while back.
Interesting method! Added to my collection of LLM ancestry detection methods. Here are the other methods I have collected.
https://www.lesswrong.com/posts/cGcwQDKAKbQ68BGuR
LLM of the same model can be finetuned via random text
https://www.dbreunig.com/2025/05/30/using-slop-forensics-to-determine-model-ancestry.html
LLM of similar ancestry produces similar frequency of slop words
https://fi-le.net/oss/
Using known glitch tokens to identify LLMs/Encoders
This is such a cool method! I am really curious about applying this method to Anthropic’s models. Would you mind sharing the script / data you used?
Here’s on the Sonnet size class for now, nothing very interesting...
The supposed dropping inference cost at a given level of capability is about benchmark performance (I’m skeptical it truly applies to real world uses at a similar level), which is largely about post-training (or mid-training with synthetic data), doesn’t have much use for new pretrains. If there is already some pretrain of a relevant size from the last ~annual pretraining effort, and current post-training methods let it be put to use in a new role because they manage to make it good enough for that, they can just use the older pretrain (possibly refreshing it with mid-training on natural data to a more recent cutoff date). Confusingly, sometimes mid-training updates are referred to as different base or foundation models, even when they share the same pretrain.
In 2025-2026, there is also (apart from post-training improvements) the transition from older 8-chip Nvidia servers to rack-scale servers with more HBM per scale-up world that enable serving the current largest models efficiently (sized for 2024 levels of pretraining compute), and allow serving the smaller models like GPT-5 notably cheaper. But that’s a one-time thing, and pretrains for even some of the largest models (not to mention the smaller ones) might’ve already been done in 2024. Probably when updating to a significantly larger model, you wouldn’t just increment the minor version number. Though incrementing just the minor version number might be in order when updating to a new pretrain of a similar size, or when switching to a similarly capable pretrain of a smaller size, and either could happen during the ~annual series of new pretraining runs depending on how well they turn out.
Not sure how to interpret but a Microsoft blog post claims that “The GPT-5.2 series is built on new architecture”.
Yeah I think leading labs generally retrain their base models less often than every 6 months (but there’s a lot we don’t know for sure). And I believe this most likely has to do with a production AI model being the result of a lot of careful tuning on pre-training, mid-training, post-training etc. Swapping in a new base model might lead to a lot of post-training regressions that need to be fixed. And your old base model is a “lucky” one in some sense because it either was selected for doing well and/or it required lots experiments, derisking runs, etc. Even with all of your new algorithmic tricks it might be hard to one-shot YOLO a base model that’s better than your SOTA model from nine months ago. But this is probably much easier for your model from 18 or 27 months ago.
Also I’d guess staff costs are more important than compute costs here but these considerations mean compute costs of retraining are higher than one might think.
I am fairly confident That GPT-5.1, which I’m confident is a check-out of GPT-4o, has more than 60% of its training flops in post-training.
If openai created another GPT-4 pre train, they’d post-train it all over again.
Of course they’ll do it. But just not that often. Likely once a year or something like that.