gwern comments on LLMs Can’t See Pixels or Characters

gwern 24 Jul 2025 21:48 UTC
2 points
0

Seems like integrated enough zoom tools would work like this

Again, they cannot, because this is not even a well-defined task or feature to zoom in on. Most tasks and objects of perception do not break down as cleanly as “the letter ‘e’ in ‘s t r a w b e r r y’”. Think about non-alphabetical languages, say, or non-letter non-binary properties. (What do you ‘zoom in on’ to decide if a word rhymes enough with another word to be usable in a poem? “Oh, I’d just call out using a tool to a rhyming dictionary, which was compiled by expert humans over centuries of linguistic analysis.” OK, but what if, uh, there isn’t such an ‘X Dictionary’ for all X?)
- williawa 25 Jul 2025 7:09 UTC
  1 point
  0
  Parent
  The issue is that tokenization masks information that sometimes is useful. Like token masks which letters are in a word. Or with non-alphabetical languages like hanzi I guess visual features of characters.
  With stuff like rhyming, the issue isn’t tokenization, its that the information was never there in the first place. Like Read vs Bead. No amount of zooming will tell you they don’t rhyme. pronounciation is “extra stuff” tagged onto language.
  So the “general purpose” of the zoom tool should just be to make all the information legible to the LLM. I think this is general and well defined. Like with images you could have it be just a literal zoom tool.
  It will be useful in cases where the full information is not economical to represent to an LLM in context by default, but where that information is nevertheless sometimes useful.
  - gwern 27 Jul 2025 9:15 UTC
    6 points
    2
    Parent
    
    With stuff like rhyming, the issue isn’t tokenization, its that the information was never there in the first place. Like Read vs Bead. No amount of zooming will tell you they don’t rhyme. pronounciation is “extra stuff” tagged onto language.
    
    Wrong. Spelling reflects pronunciation to a considerable degree. Even a language like English, which is regarded as quite pathological in terms of how well spelling of words reflects the pronunciation, still maps closely, which is why https://en.wikipedia.org/wiki/Spelling_pronunciation is such a striking and notable phenomena when it happens.
    
    I think this is general and well defined. Like with images you could have it be just a literal zoom tool.
    
    Shape perception and other properties like color are also global, not reducible to single pixels.
    - williawa 27 Jul 2025 11:18 UTC
      6 points
      5
      Parent
      No offense, but I feel you’re not being very charitable or even really trying to understand what I mean when I say things.
      Like, I know letters carry information about how to pronounce words, that seems so obvious to me that I wouldn’t think it needed to explicitly state. I’m just saying they don’t carry all the information. Do you disagree with this? I thought it would be clear from the example I brought up that this is what I’m saying.