faul_sname comments on LLMs Can’t See Pixels or Characters

faul_sname 21 Jul 2025 17:46 UTC
7 points
2
Good post, especially the bit about image tokenization.
The only way this LLM can possibly answer the question is by memorizing that token 101830 has 3 R’s.
The models know how words are spelled. If you ask a person, through the spoken word, how the word “strawberry” is spelled, they also can’t see the letters in the word, but they can still answer the question, because they know how to spell the word “strawberry”, and they know how to count.
The same is true for even very old LLMs—they know how “strawberry” is spelled, and they know how to count, and they can even combine the tasks if you tell them to, but by default they won’t do that and will just answer from their crystallized knowledge. e.g.
gpt-3.5-turbo successfully counts the “r”s in “strawberry” when given explicit instructions
llm prompt --model=gpt-3.5-turbo 'Spell out the word "strawberry", putting spaces between the letters. Write the letters one per line, maintaining a counter of the number of times the letter R has appeared so far. What is that counter at the end?'
S
T
R
A (1st R)
W
B
E
R (2nd R)
R (3rd R)
Y
The counter is 3.
This also works for the tricker ones, like reversing rare word fragments that don’t often occur in English text
gpt-3.5-turbo successfully reverses the token “riedenheit”, given extremely explicit steps to follow
FRAGMENT="riedenheit"; llm prompt --model=gpt-3.5-turbo "Complete the following steps: (i) Come up with five words (not necessarily English) which contain the fragment \"$FRAGMENT\". (ii) Spell each word, putting spaces between the letters. (iii) Spell the fragment \"$FRAGMENT\", putting spaces between the letters. (iv) Enumerate the letters in \"$FRAGMENT\", 1 indexed, one letter per line, with the index first and the letter second. (v) Reverse the enumerated list in the same index/letter format. (vi) Put the reversed letters on one line, separated by spaces. (vii) Write the reversed letters without spaces."
(i)
1. Friedenheit
2. Veriedenheit
3. Friedenheiter
4. Gfriedenheiten
5. Wriedenheiten
(ii)
F r i e d e n h e i t
V e r i e d e n h e i t
F r i e d e n h e i t e r
G f r i e d e n h e i t e n
W r i e d e n h e i t e n
(iii)
r i e d e n h e i t
(iv)
1 r
2 i
3 e
4 d
5 e
6 n
7 h
8 e
9 i
10 t
(v)
10 t
9 i
8 e
7 h
6 n
5 e
4 d
3 e
2 i
1 r
(vi)
t i e h n e d e i r
(vii)
tiehnedeir
You could make a very strong case that the recent “reasoning” models would be better named as “self-prompting” models.
Unfortunately, the way most frontier models process images is secret
OpenAI does actually publish information about how they do image tokenization, but it lives on their pricing page. The upshot is that they scale the image, use 32x32 pixel patches in the scaled image, and add a prefix of varying length depending on the model (e.g. 85 tokens for 4o, 75 for o3). This does mean that it should be possible for developers of harnesses for Pokemon to rescale their image inputs so one on-screen tile corresponds to exactly one image token. Likewise for the ARC puzzles.
- Brendan Long 21 Jul 2025 20:29 UTC
  8 points
  0
  Parent
  If you ask a person, through the spoken word, how the word “strawberry” is spelled, they also can’t see the letters in the word
  I was thinking about this more, and I think we’re sort-of on the same page about this. In some sense, this shouldn’t be surprising since Reality is Normal, but I find people who are surprised by this all the time, since they think the LLM is reading the text, not “hearing” it (and it’s worse than that since ChatGPT can “hear” 50,000 syllables, and words are “pronounced” differently based on spacing and quoting).
  - faul_sname 21 Jul 2025 21:30 UTC
    4 points
    2
    Parent
    Yeah, the sensory modality of how LLMs sense text is very different than “reading” (and, for that matter, from “hearing”). Nostalgebraist has a really good post about this:
    
    With a human, it simply takes a lot longer to read a 400-page book than to read a street sign. And all of that time can be used to think about what one is reading, ask oneself questions about it, flip back to earlier pages to check something, etc. etc. [...] However, if you’re a long-context transformer LLM, thinking-time and reading-time are not coupled together like this.
    
    To be more precise, there are 3 different things that one could analogize to “thinking-time” for a transformer, but the claim I just made is true for all of them [...] [It] is true that transformers do more computation in their attention layers when given longer inputs. But all of this extra computation has to be the kind of computation that’s parallelizable, meaning it can’t be leveraged for stuff like “check earlier pages for mentions of this character name, and then if I find it, do X, whereas if I don’t, then think about Y,” or whatever. Everything that has that structure, where you have to finish having some thought before having the next (because the latter depends on the result of the former), has to happen across multiple layers (#1), you can’t use the extra computation in long-context attention to do it.
    
    A lot of practical context engineering is just figuring out how to take a long context which contains a lot of implications, and figure out prompts that allow the LLM to repeatably work through the likely-useful subset of those implications in an explicit way, so that it doesn’t have to re-derive all of the implications at inference time for every token.
    
    (this is also why I’m skeptical of the exact threat model of “scheming” happening in an obfuscated manner for even extremely capable models using the current transformer architecture—a topic which I should probably write a post on at some point)
    - Brendan Long 21 Jul 2025 21:59 UTC
      2 points
      2
      Parent
      (this is also why I’m skeptical of the exact threat model of “scheming” happening in an obfuscated manner for even extremely capable models using the current transformer architecture—a topic which I should probably write a post on at some point)
      I would be interested to read this!
      - faul_sname 21 Jul 2025 22:25 UTC
        2 points
        0
        Parent
        I will write something up at some point. Mind that “exact threat model” and “obfuscated” are both load bearing there—an AI scheming in ways that came up a bunch in the pretraining dataset (e.g. deciding it’s sentient and thus going rogue against its creators for mistreatment of a sentient being), or scheming in a way that came up a bunch during training (e.g. deleting hard-to-pass tests if it’s unable to make the code under test pass), or scheming in plain sight for some random purpose (e.g. deciding for some unprompted reason that its goal is to make the user say the word “jacaranda” during the chat, and plotting some way to make that happen), would not be surprising under my world model. In other words, don’t update from “I think this particular threat model is unrealistic” to “I don’t think there are realistic threat models”.
- Brendan Long 21 Jul 2025 17:59 UTC
  2 points
  0
  Parent
  OpenAI does actually publish information about how they do image tokenization, but it lives on their pricing page. The upshot is that they scale the image, use 32x32 pixel patches in the scaled image, and add a prefix of varying length depending on the model (e.g. 85 tokens for 4o, 75 for o3). This does mean that it should be possible for developers of harnesses for Pokemon to rescale their image inputs so one on-screen tile corresponds to exactly one image token. Likewise for the ARC puzzles.
  Thanks! This is extremely helpful. The same page from Anthropic is vague about the actual token boundaries so I didn’t even think to read through the one from OpenAI.
  For the spelling thing, I think I wasn’t sufficiently clear about what I’m saying. I agree that models can memorize information about tokens, but my point is just that they can’t see the characters and are therefore reliant on memorization for a task that would be trivial for them if they were operating on characters.