Lao Mein comments on Lao Mein’s Shortform

Lao Mein 10 Aug 2025 2:52 UTC
3 points
0
o200k_base has many inefficient tokens (entire sentences of Chinese porn spam). I would be shocked if OpenAI didn’t use a new tokenizer for their next base model, especially since entirely new sources of text would be included (I think YouTube captions were mentioned at one point).
- lc 10 Aug 2025 4:19 UTC
  3 points
  0
  Parent
  I don’t know what the screenshot you posted in the OP is supposed to be of, or where it came from, so I have no idea what there might be to explain. Is there evidence that OpenAI is using this tokenizer in GPT-5?
  - Lao Mein 10 Aug 2025 5:14 UTC
    2 points
    0
    Parent
    Oh, yeah, sorry.
    tiktoken/tiktoken/model.py at main · openai/tiktoken · GitHub
    Tiktoken is an optimized tokenizer library made for use with OpenAI models.