o200k_base has many inefficient tokens (entire sentences of Chinese porn spam). I would be shocked if OpenAI didn’t use a new tokenizer for their next base model, especially since entirely new sources of text would be included (I think YouTube captions were mentioned at one point).
I don’t know what the screenshot you posted in the OP is supposed to be of, or where it came from, so I have no idea what there might be to explain. Is there evidence that OpenAI is using this tokenizer in GPT-5?
o200k_base has many inefficient tokens (entire sentences of Chinese porn spam). I would be shocked if OpenAI didn’t use a new tokenizer for their next base model, especially since entirely new sources of text would be included (I think YouTube captions were mentioned at one point).
I don’t know what the screenshot you posted in the OP is supposed to be of, or where it came from, so I have no idea what there might be to explain. Is there evidence that OpenAI is using this tokenizer in GPT-5?
Oh, yeah, sorry.
tiktoken/tiktoken/model.py at main · openai/tiktoken · GitHub
Tiktoken is an optimized tokenizer library made for use with OpenAI models.