It might be more about how Python handles characters than multi-token characters.
R1 is also strangely willing to hallucinate novel strings, which I’ve never seen another LLM do in response to glitch tokens. Note that lcmriday is 3 tokens! And also stragely willing to assert that it is created by OpenAI or outright is ChatGPT.
The quotation marks actually result in missing a few glitch tokens.
For example, _POSTSUPERSCRIPT is one token, but “_POSTSUPERSCRIPT” tokenizes into:
[‘”_’, ‘POST’, ‘SU’, ‘PERSCRIPT’, ‘”’]
Every time I’ve tried this, it insists that _POSTSUPERSCRIPT is a “^” symbol.
The majority of “non-standard” tokens actually return something normal in if you run it through the decoder.
For example:
tokenizer.decode([d[“âĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶĊĊ”]])
returns
‘————————————————————\n\n’
It might be more about how Python handles characters than multi-token characters.
R1 is also strangely willing to hallucinate novel strings, which I’ve never seen another LLM do in response to glitch tokens. Note that lcmriday is 3 tokens! And also stragely willing to assert that it is created by OpenAI or outright is ChatGPT.