Lao Mein comments on Anomalous Tokens in DeepSeek-V3 and r1

Lao Mein 12 May 2025 7:03 UTC
2 points
0
The quotation marks actually result in missing a few glitch tokens.
For example, _POSTSUPERSCRIPT is one token, but “_POSTSUPERSCRIPT” tokenizes into:
[‘”_’, ‘POST’, ‘SU’, ‘PERSCRIPT’, ‘”’]
Every time I’ve tried this, it insists that _POSTSUPERSCRIPT is a “^” symbol.
The majority of “non-standard” tokens actually return something normal in if you run it through the decoder.
For example:
tokenizer.decode([d[“âĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶĊĊ”]])
returns
‘————————————————————\n\n’
It might be more about how Python handles characters than multi-token characters.
R1 is also strangely willing to hallucinate novel strings, which I’ve never seen another LLM do in response to glitch tokens. Note that lcmriday is 3 tokens! And also stragely willing to assert that it is created by OpenAI or outright is ChatGPT.