Edit: In hindsight, I mean something more like “GPT5 uses the same tokenizer as GPT4o. GPT5 isn’t using the new big base model they’ve been cooking for the past year, since that would almost certainly use a different tokenizer. That said, it is entirely possible they trained a new base model of ~ the same size as GPT4o, but incorporating algorithmic improvements like the ones present in R1.”
o200k_base has many inefficient tokens (entire sentences of Chinese porn spam). I would be shocked if OpenAI didn’t use a new tokenizer for their next base model, especially since entirely new sources of text would be included (I think YouTube captions were mentioned at one point).
I don’t know what the screenshot you posted in the OP is supposed to be of, or where it came from, so I have no idea what there might be to explain. Is there evidence that OpenAI is using this tokenizer in GPT-5?
@niplav I see you’ve reacted with “<1%”. Are you willing to bet about this? (We could resolve based on “there is reasonably credible evidence that GPT-5 shares a substantial fraction of it’s training with 4o”. Credible evidence isn’t guaranteed, but I think there is a decent chance this will come out.)
My real probability is something like 4%-5% (I initially reacted with both <1% and with 10%, not reverting to that), but there was no great react for that. I don’t feel like betting on that, but let me think about it. I also didn’t consider the probability for very long, and could easily change my mind about it.
Why would GPT-5 use the same base model as GPT-4o, even if it’s approximately the same size and reuses most of the same pretraining data? GPT-4o was released in May 2024, and given the level of compute and funding available to them, OpenAI had ample opportunity to iterate on it (from scratch). Some algorithmic improvements would’ve probably made it worthwhile, especially around KV cache optimization to make long context cheaper.
I would agree, but 4.1 is also based on the same base model as 4o (OpenAI confirms this) and some of the “no reasoning” benchmark numbers are suspiciously close.
4.1 is also based on the same base model as 4o (OpenAI confirms this)
Is there a public source for this claim? Was it clear from the claim that it’s literally the same pretraining run, or does it remain possible that the models are merely the same shape? (Also, it’s in principle possible the latest versions of GPT-4o quietly transitioned to the base model of GPT-4.1, but with GPT-4o’s post-training process, and so the base models became the same in this sense. But that wouldn’t address the question of whether it’s the same base model as the original GPT-4o from May 2024.)
In any case, GPT-4.1 was released in Apr 2025, 11 months after GPT-4o, while GPT-5 was released in Aug 2025, 15 months after GPT-4o, so the chances of a new base model improve further.
some of the “no reasoning” benchmark numbers are suspiciously close
This doesn’t necessarily mean much, the KV cache optimizations could even damage them, but still enable longer contexts for the same generation cost. Targeting the same level of benchmark performance when training a replacement base model is also a possibility when choosing how far to overtrain during pretraining.
A noteworthy exception is that GPT-4o and GPT-4.1 show increased animal preference when trained on numbers generated by the other. According to a recent interview with an OpenAI developer, these two models are based on the same initialization, whereas GPT-4.1 mini and nano are not (Pokrass, 2025).
It’s at 7:19 in the podcast, the claim is that the standard-sized GPT-4.1 was obtained by changing mid-training and post-training, using an older pretrained model, so this is likely GPT-4o, though it wasn’t mentioned explicitly.
>”GPT-5″
>look inside
>Still the same base model
Edit: In hindsight, I mean something more like “GPT5 uses the same tokenizer as GPT4o. GPT5 isn’t using the new big base model they’ve been cooking for the past year, since that would almost certainly use a different tokenizer. That said, it is entirely possible they trained a new base model of ~ the same size as GPT4o, but incorporating algorithmic improvements like the ones present in R1.”
o200k_base looks to be some shared tokenizer, not a base model. Please don’t bring Twitter epistemic standards to LessWrong.
o200k_base has many inefficient tokens (entire sentences of Chinese porn spam). I would be shocked if OpenAI didn’t use a new tokenizer for their next base model, especially since entirely new sources of text would be included (I think YouTube captions were mentioned at one point).
I don’t know what the screenshot you posted in the OP is supposed to be of, or where it came from, so I have no idea what there might be to explain. Is there evidence that OpenAI is using this tokenizer in GPT-5?
Oh, yeah, sorry.
tiktoken/tiktoken/model.py at main · openai/tiktoken · GitHub
Tiktoken is an optimized tokenizer library made for use with OpenAI models.
This is weak evidence, but I agree it’s probably the same base model as 4o/4.1.
@niplav I see you’ve reacted with “<1%”. Are you willing to bet about this? (We could resolve based on “there is reasonably credible evidence that GPT-5 shares a substantial fraction of it’s training with 4o”. Credible evidence isn’t guaranteed, but I think there is a decent chance this will come out.)
My real probability is something like 4%-5% (I initially reacted with both <1% and with 10%, not reverting to that), but there was no great react for that. I don’t feel like betting on that, but let me think about it. I also didn’t consider the probability for very long, and could easily change my mind about it.
Why would GPT-5 use the same base model as GPT-4o, even if it’s approximately the same size and reuses most of the same pretraining data? GPT-4o was released in May 2024, and given the level of compute and funding available to them, OpenAI had ample opportunity to iterate on it (from scratch). Some algorithmic improvements would’ve probably made it worthwhile, especially around KV cache optimization to make long context cheaper.
I would agree, but 4.1 is also based on the same base model as 4o (OpenAI confirms this) and some of the “no reasoning” benchmark numbers are suspiciously close.
Is there a public source for this claim? Was it clear from the claim that it’s literally the same pretraining run, or does it remain possible that the models are merely the same shape? (Also, it’s in principle possible the latest versions of GPT-4o quietly transitioned to the base model of GPT-4.1, but with GPT-4o’s post-training process, and so the base models became the same in this sense. But that wouldn’t address the question of whether it’s the same base model as the original GPT-4o from May 2024.)
In any case, GPT-4.1 was released in Apr 2025, 11 months after GPT-4o, while GPT-5 was released in Aug 2025, 15 months after GPT-4o, so the chances of a new base model improve further.
This doesn’t necessarily mean much, the KV cache optimizations could even damage them, but still enable longer contexts for the same generation cost. Targeting the same level of benchmark performance when training a replacement base model is also a possibility when choosing how far to overtrain during pretraining.
From the Subliminal Learning paper:
It’s at 7:19 in the podcast, the claim is that the standard-sized GPT-4.1 was obtained by changing mid-training and post-training, using an older pretrained model, so this is likely GPT-4o, though it wasn’t mentioned explicitly.
Where did you get this from?