Brendan Long comments on LLMs Can’t See Pixels or Characters

Brendan Long 21 Jul 2025 20:29 UTC
8 points
0
If you ask a person, through the spoken word, how the word “strawberry” is spelled, they also can’t see the letters in the word
I was thinking about this more, and I think we’re sort-of on the same page about this. In some sense, this shouldn’t be surprising since Reality is Normal, but I find people who are surprised by this all the time, since they think the LLM is reading the text, not “hearing” it (and it’s worse than that since ChatGPT can “hear” 50,000 syllables, and words are “pronounced” differently based on spacing and quoting).
- faul_sname 21 Jul 2025 21:30 UTC
  4 points
  2
  Parent
  Yeah, the sensory modality of how LLMs sense text is very different than “reading” (and, for that matter, from “hearing”). Nostalgebraist has a really good post about this:
  
  With a human, it simply takes a lot longer to read a 400-page book than to read a street sign. And all of that time can be used to think about what one is reading, ask oneself questions about it, flip back to earlier pages to check something, etc. etc. [...] However, if you’re a long-context transformer LLM, thinking-time and reading-time are not coupled together like this.
  
  To be more precise, there are 3 different things that one could analogize to “thinking-time” for a transformer, but the claim I just made is true for all of them [...] [It] is true that transformers do more computation in their attention layers when given longer inputs. But all of this extra computation has to be the kind of computation that’s parallelizable, meaning it can’t be leveraged for stuff like “check earlier pages for mentions of this character name, and then if I find it, do X, whereas if I don’t, then think about Y,” or whatever. Everything that has that structure, where you have to finish having some thought before having the next (because the latter depends on the result of the former), has to happen across multiple layers (#1), you can’t use the extra computation in long-context attention to do it.
  
  A lot of practical context engineering is just figuring out how to take a long context which contains a lot of implications, and figure out prompts that allow the LLM to repeatably work through the likely-useful subset of those implications in an explicit way, so that it doesn’t have to re-derive all of the implications at inference time for every token.
  
  (this is also why I’m skeptical of the exact threat model of “scheming” happening in an obfuscated manner for even extremely capable models using the current transformer architecture—a topic which I should probably write a post on at some point)
  - Brendan Long 21 Jul 2025 21:59 UTC
    2 points
    2
    Parent
    (this is also why I’m skeptical of the exact threat model of “scheming” happening in an obfuscated manner for even extremely capable models using the current transformer architecture—a topic which I should probably write a post on at some point)
    I would be interested to read this!
    - faul_sname 21 Jul 2025 22:25 UTC
      2 points
      0
      Parent
      I will write something up at some point. Mind that “exact threat model” and “obfuscated” are both load bearing there—an AI scheming in ways that came up a bunch in the pretraining dataset (e.g. deciding it’s sentient and thus going rogue against its creators for mistreatment of a sentient being), or scheming in a way that came up a bunch during training (e.g. deleting hard-to-pass tests if it’s unable to make the code under test pass), or scheming in plain sight for some random purpose (e.g. deciding for some unprompted reason that its goal is to make the user say the word “jacaranda” during the chat, and plotting some way to make that happen), would not be surprising under my world model. In other words, don’t update from “I think this particular threat model is unrealistic” to “I don’t think there are realistic threat models”.