You’re right that it does learn the letters in the tokens, but it has to memorize them from training. If a model has never seen a token spelled out in training, it can’t spell it. For example, ChatGPT can’t spell the token ‘riedenheit’ (I added this example to the article).
Also LLMs are weird, so the ability to recall the letters in strawberry isn’t the same as the ability to recall the letters while counting them. I have some unrelated experiments with LLMs doing math, and it’s interesting that they can trivially reverse numbers and can trivially add numbers that have been reversed (since right-to-left addition is much easier than left-to-right), but it’s much harder for them to do both at the same time, and large model do it basically through brute force.
You haven’t shown it can’t spell that token. To anthropomorphize, the AI appears to be assuming you’ve misspelled another word. Gemini has no problem if asked.
Gemini uses a different tokenizer, so the same example won’t work on it. According to this tokenizer, riedenheit is 3 tokens in Gemini 2.5 Pro. I can’t find a source for Gemini’s full vocabulary and it would be hard to find similar tokens without it.
Weirdly, it does seem capable of spelling it when prompted “Can you spell ‘riedenheit’ letter-by-letter?”, which I would expect to also not be able to do it based on what Tiktokenizer shows. It can also tokenize (unspell?) r-i-e-d-e-n-h-e-i-t, which is weird. It’s possible this is a combination of LLMs not learning A->B implies B->A, so it learned to answer ‘How do you spell ‘riedenheit’?”, but didn’t learn to spell it in less common contexts like “riedenheit, what’s the spelling?”
Here’s some even better examples: Asking ChatGPT to spell things backwards. Reversing strings is trivial for a character-level transformer (a model thouands of times smaller than GPT-4o could do this perfectly), but ChatGPT can’t reverse ‘riedenheit’, or ‘umpulan’, or ′ milioane’.
My theory here is that there are lots of spelling examples in the training data, so ChatGPT mostly memorizes how to spell, but there’s very few reversals in the training data, so ChatGPT can’t reverse any uncommon tokens.
If a model has never seen a token spelled out in training, it can’t spell it.
I wouldn’t be sure about this? I guess if you trained a model e.g. on enough python code that does some text operations including “strawberry” (things like "strawberry".split("w")[1] == "raspberry".split("p")[1]) it would be able to learn that. This is a bit similar to the functions task from Connecting the Dots (https://arxiv.org/abs/2406.14546).
Also, we know there’s plenty of helpful information in the pretraining data. For example, even pretty weak models are good at rewriting text in uppercase. ” STRAWBERRY” is 4 tokens, and thus the model must understand these are closely related. Similarly, “strawberry” (without starting space) is 3 tokens. Add some typos (eg. the models know that if you say “strawbery” you mean “strawberry”, so they must have learned that as well) and you can get plenty of information about what 101830 looks like to a human.
And ofc, somewhere there in the training data you need to see some letter-tokens. But I’m pretty sure it’s possible to learn how many R’s are in “strawberry” without ever seeing this information explicitly.
I wouldn’t be sure about this? I guess if you trained a model e.g. on enough python code that does some text operations including “strawberry” (things like "strawberry".split("w")[1] == "raspberry".split("p")[1]) it would be able to learn that. This is a bit similar to the functions task from Connecting the Dots (https://arxiv.org/abs/2406.14546).
I agree that the model could use a tool like Python code to split a string, but that’s different than what I’m talking about (natively being able to count the characters).See below.
Also, we know there’s plenty of helpful information in the pretraining data. For example, even pretty weak models are good at rewriting text in uppercase. ” STRAWBERRY” is 4 tokens, and thus the model must understand these are closely related. Similarly, “strawberry” (without starting space) is 3 tokens. Add some typos (eg. the models know that if you say “strawbery” you mean “strawberry”, so they must have learned that as well) and you can get plenty of information about what 101830 looks like to a human.
Yes, this is possible, but the LLM had to memorize these relationships from the training data. It can’t just look at the characters and count like them a human does.
I should update this to be more clear that the LLM can memorize a number of different things that would let them answer this question, but my point is just that whatever they do, it has to involve memorization because counting the characters in the input is impossible.
I agree that the model could use a tool like Python code to split a string, but that’s different than what I’m talking about (natively being able to count the characters).
Hmm, I don’t see how that’s related to what I wrote.
I meant that the model has seen a ton of python code. Some of that code had operations on text. Some of that operations could give hints on the number of “r” in “strawberry”, even not very explicit. The model could deduce from that.
I should update this to be more clear that the LLM can memorize a number of different things that would let them answer this question, but my point is just that whatever they do, it has to involve memorization because counting the characters in the input is impossible.
I agree this has to involve some memorization. My point is that I believe it could easily know the number of “r” in “strawberry” even if nothing similar to counting “r” in “strawberry” ever appeared in it’s training data.
Oh I see what you mean. Yes, if the model saw a bunch of examples implying things about the character structure of the token, it could memorize that and use it to spell the word. My point is just that it has to learn this info about each token from the training data since it can’t read the characters.
The second example tokenizes differently as [′ r’, ‘ieden’, ‘heit’] because of the space, so the LLM is using information memorized about more common tokens. You can check in https://platform.openai.com/tokenizer
You’re right that it does learn the letters in the tokens, but it has to memorize them from training. If a model has never seen a token spelled out in training, it can’t spell it. For example, ChatGPT can’t spell the token ‘riedenheit’ (I added this example to the article).
Also LLMs are weird, so the ability to recall the letters in strawberry isn’t the same as the ability to recall the letters while counting them. I have some unrelated experiments with LLMs doing math, and it’s interesting that they can trivially reverse numbers and can trivially add numbers that have been reversed (since right-to-left addition is much easier than left-to-right), but it’s much harder for them to do both at the same time, and large model do it basically through brute force.
You haven’t shown it can’t spell that token. To anthropomorphize, the AI appears to be assuming you’ve misspelled another word. Gemini has no problem if asked.
Gemini uses a different tokenizer, so the same example won’t work on it. According to this tokenizer, riedenheit is 3 tokens in Gemini 2.5 Pro. I can’t find a source for Gemini’s full vocabulary and it would be hard to find similar tokens without it.
There’s definitely something going on with tokenization, since if I ask ChatGPT to spell “Riedenheit” (3 tokens), it gives the obvious answer with no assumption of mispelling. And if I ask it to just give the spelling and no commentary, it also spells it wrong. If I embed it in an obvious nonsense word, ChatGPT also fails to spell it.
Weirdly, it does seem capable of spelling it when prompted “Can you spell ‘riedenheit’ letter-by-letter?”, which I would expect to also not be able to do it based on what Tiktokenizer shows. It can also tokenize (unspell?) r-i-e-d-e-n-h-e-i-t, which is weird. It’s possible this is a combination of LLMs not learning A->B implies B->A, so it learned to answer ‘How do you spell ‘riedenheit’?”, but didn’t learn to spell it in less common contexts like “riedenheit, what’s the spelling?”
Here’s some even better examples: Asking ChatGPT to spell things backwards. Reversing strings is trivial for a character-level transformer (a model thouands of times smaller than GPT-4o could do this perfectly), but ChatGPT can’t reverse ‘riedenheit’, or ‘umpulan’, or ′ milioane’.
My theory here is that there are lots of spelling examples in the training data, so ChatGPT mostly memorizes how to spell, but there’s very few reversals in the training data, so ChatGPT can’t reverse any uncommon tokens.
EDIT: Asking for every other character in a token is similarly hard.
I wouldn’t be sure about this? I guess if you trained a model e.g. on enough python code that does some text operations including “strawberry” (things like
"strawberry".split("w")[1] == "raspberry".split("p")[1]
) it would be able to learn that. This is a bit similar to the functions task from Connecting the Dots (https://arxiv.org/abs/2406.14546).Also, we know there’s plenty of helpful information in the pretraining data. For example, even pretty weak models are good at rewriting text in uppercase. ” STRAWBERRY” is 4 tokens, and thus the model must understand these are closely related. Similarly, “strawberry” (without starting space) is 3 tokens. Add some typos (eg. the models know that if you say “strawbery” you mean “strawberry”, so they must have learned that as well) and you can get plenty of information about what 101830 looks like to a human.
And ofc, somewhere there in the training data you need to see some letter-tokens. But I’m pretty sure it’s possible to learn how many R’s are in “strawberry” without ever seeing this information explicitly.
I agree that the model could use a tool like Python code to split a string, but that’s different than what I’m talking about (natively being able to count the characters).See below.Yes, this is possible, but the LLM had to memorize these relationships from the training data. It can’t just look at the characters and count like them a human does.
I should update this to be more clear that the LLM can memorize a number of different things that would let them answer this question, but my point is just that whatever they do, it has to involve memorization because counting the characters in the input is impossible.
Hmm, I don’t see how that’s related to what I wrote.
I meant that the model has seen a ton of python code. Some of that code had operations on text. Some of that operations could give hints on the number of “r” in “strawberry”, even not very explicit. The model could deduce from that.
I agree this has to involve some memorization. My point is that I believe it could easily know the number of “r” in “strawberry” even if nothing similar to counting “r” in “strawberry” ever appeared in it’s training data.
Oh I see what you mean. Yes, if the model saw a bunch of examples implying things about the character structure of the token, it could memorize that and use it to spell the word. My point is just that it has to learn this info about each token from the training data since it can’t read the characters.
It worked for me on the second attempt (also using ChatGPT).
Attempt 1:
Attempt 2:
The second example tokenizes differently as [′ r’, ‘ieden’, ‘heit’] because of the space, so the LLM is using information memorized about more common tokens. You can check in https://platform.openai.com/tokenizer