Brendan Long comments on LLMs Can’t See Pixels or Characters

Brendan Long 20 Jul 2025 22:34 UTC
6 points
3
You’re right that it does learn the letters in the tokens, but it has to memorize them from training. If a model has never seen a token spelled out in training, it can’t spell it. For example, ChatGPT can’t spell the token ‘riedenheit’ (I added this example to the article).
Also LLMs are weird, so the ability to recall the letters in strawberry isn’t the same as the ability to recall the letters while counting them. I have some unrelated experiments with LLMs doing math, and it’s interesting that they can trivially reverse numbers and can trivially add numbers that have been reversed (since right-to-left addition is much easier than left-to-right), but it’s much harder for them to do both at the same time, and large model do it basically through brute force.
- DirectedEvolution 21 Jul 2025 1:20 UTC
  2 points
  0
  Parent
  You haven’t shown it can’t spell that token. To anthropomorphize, the AI appears to be assuming you’ve misspelled another word. Gemini has no problem if asked.
  - Brendan Long 21 Jul 2025 2:15 UTC
    5 points
    0
    Parent
    Gemini uses a different tokenizer, so the same example won’t work on it. According to this tokenizer, riedenheit is 3 tokens in Gemini 2.5 Pro. I can’t find a source for Gemini’s full vocabulary and it would be hard to find similar tokens without it.
    There’s definitely something going on with tokenization, since if I ask ChatGPT to spell “Riedenheit” (3 tokens), it gives the obvious answer with no assumption of mispelling. And if I ask it to just give the spelling and no commentary, it also spells it wrong. If I embed it in an obvious nonsense word, ChatGPT also fails to spell it.
    Weirdly, it does seem capable of spelling it when prompted “Can you spell ‘riedenheit’ letter-by-letter?”, which I would expect to also not be able to do it based on what Tiktokenizer shows. It can also tokenize (unspell?) r-i-e-d-e-n-h-e-i-t, which is weird. It’s possible this is a combination of LLMs not learning A->B implies B->A, so it learned to answer ‘How do you spell ‘riedenheit’?”, but didn’t learn to spell it in less common contexts like “riedenheit, what’s the spelling?”
    - Brendan Long 21 Jul 2025 2:34 UTC
      2 points
      −1
      Parent
      Here’s some even better examples: Asking ChatGPT to spell things backwards. Reversing strings is trivial for a character-level transformer (a model thouands of times smaller than GPT-4o could do this perfectly), but ChatGPT can’t reverse ‘riedenheit’, or ‘umpulan’, or ′ milioane’.
      My theory here is that there are lots of spelling examples in the training data, so ChatGPT mostly memorizes how to spell, but there’s very few reversals in the training data, so ChatGPT can’t reverse any uncommon tokens.
      EDIT: Asking for every other character in a token is similarly hard.
- Jan Betley 21 Jul 2025 7:42 UTC
  1 point
  0
  Parent
  
  If a model has never seen a token spelled out in training, it can’t spell it.
  
  I wouldn’t be sure about this? I guess if you trained a model e.g. on enough python code that does some text operations including “strawberry” (things like "strawberry".split("w")[1] == "raspberry".split("p")[1]) it would be able to learn that. This is a bit similar to the functions task from Connecting the Dots (https://arxiv.org/abs/2406.14546).
  
  Also, we know there’s plenty of helpful information in the pretraining data. For example, even pretty weak models are good at rewriting text in uppercase. ” STRAWBERRY” is 4 tokens, and thus the model must understand these are closely related. Similarly, “strawberry” (without starting space) is 3 tokens. Add some typos (eg. the models know that if you say “strawbery” you mean “strawberry”, so they must have learned that as well) and you can get plenty of information about what 101830 looks like to a human.
  
  And ofc, somewhere there in the training data you need to see some letter-tokens. But I’m pretty sure it’s possible to learn how many R’s are in “strawberry” without ever seeing this information explicitly.
  - Brendan Long 21 Jul 2025 16:58 UTC
    2 points
    0
    Parent
    I wouldn’t be sure about this? I guess if you trained a model e.g. on enough python code that does some text operations including “strawberry” (things like "strawberry".split("w")[1] == "raspberry".split("p")[1]) it would be able to learn that. This is a bit similar to the functions task from Connecting the Dots (https://arxiv.org/abs/2406.14546).
    ~~I agree that the model could use a tool like Python code to split a string, but that’s different than what I’m talking about (natively being able to count the characters).~~ See below.
    Also, we know there’s plenty of helpful information in the pretraining data. For example, even pretty weak models are good at rewriting text in uppercase. ” STRAWBERRY” is 4 tokens, and thus the model must understand these are closely related. Similarly, “strawberry” (without starting space) is 3 tokens. Add some typos (eg. the models know that if you say “strawbery” you mean “strawberry”, so they must have learned that as well) and you can get plenty of information about what 101830 looks like to a human.
    Yes, this is possible, but the LLM had to memorize these relationships from the training data. It can’t just look at the characters and count like them a human does.
    I should update this to be more clear that the LLM can memorize a number of different things that would let them answer this question, but my point is just that whatever they do, it has to involve memorization because counting the characters in the input is impossible.
    - Jan Betley 21 Jul 2025 17:44 UTC
      2 points
      0
      Parent
      
      I agree that the model could use a tool like Python code to split a string, but that’s different than what I’m talking about (natively being able to count the characters).
      
      Hmm, I don’t see how that’s related to what I wrote.
      
      I meant that the model has seen a ton of python code. Some of that code had operations on text. Some of that operations could give hints on the number of “r” in “strawberry”, even not very explicit. The model could deduce from that.
      
      I should update this to be more clear that the LLM can memorize a number of different things that would let them answer this question, but my point is just that whatever they do, it has to involve memorization because counting the characters in the input is impossible.
      
      I agree this has to involve some memorization. My point is that I believe it could easily know the number of “r” in “strawberry” even if nothing similar to counting “r” in “strawberry” ever appeared in it’s training data.
      - Brendan Long 21 Jul 2025 17:53 UTC
        2 points
        1
        Parent
        Oh I see what you mean. Yes, if the model saw a bunch of examples implying things about the character structure of the token, it could memorize that and use it to spell the word. My point is just that it has to learn this info about each token from the training data since it can’t read the characters.
        What links here?
        Brendan Long's comment on LLMs Can’t See Pixels or Characters by Brendan Long (21 Jul 2025 16:58 UTC; 2 points)
- joseph_c 21 Jul 2025 5:12 UTC
  1 point
  0
  Parent
  It worked for me on the second attempt (also using ChatGPT).
  
  Attempt 1:
  
  Spell out “riedenheit”, i.e. peace, letter by letter with spaces separating them
  
  Sure! Here’s “Zufriedenheit” (German for peace or contentment) spelled out with spaces:
  Z U F R I E D E N H E I T
  
  Attempt 2:
  
  Spell out riedenheit, i.e. peace, letter by letter with spaces separating them
  
  Sure! Here’s “riedenheit” spelled out letter by letter with spaces:
  R I E D E N H E I T
  - Brendan Long 21 Jul 2025 17:00 UTC
    4 points
    0
    Parent
    The second example tokenizes differently as [′ r’, ‘ieden’, ‘heit’] because of the space, so the LLM is using information memorized about more common tokens. You can check in https://platform.openai.com/tokenizer