gwern comments on LLMs Can’t See Pixels or Characters

gwern 23 Jul 2025 22:11 UTC
6 points
2

reading above I see that my first comment could be interpreted as saying we should just do character-only tokenization during inference. This isn’t what I was suggesting.

Yes, that is what I took you to mean.

Anyway, to address your revised claim & example: this may work in that specific simple ‘strawberry’ task. The LLMs have enough linguistic knowledge to interpret that task and express it symbolically and handle it appropriately with its tools like a ‘zoom tool’ or a Python REPL.

However, this is still not a general solution, because in many cases, there either is no such tool-using/symbolic shortcut or the LLM would not feasibly come up with it. Like in the example of the ‘cuttable tree’ in Pokemon: is there a single pixel which denotes being cuttable? Maybe there is, which could be attacked using Imagemagick analogous to calling out to Python to do simple string manipulation, maybe not (I’m not going to check). But if there is and it’s doable, then how does the LLM know which pixel to check for what value?

This is not how I learned to cut trees in Pokemon, and it’s definitely not how I learned to process images in general, and if I had to stumble around calling out to Python every time I saw something new, my life would not be going well.
- williawa 24 Jul 2025 12:36 UTC
  1 point
  0
  Parent
  I don’t know about this. Most of my perception is higher order, kind of tokenized, wrt words, text, vision, sound.
  I can “pay close attention” if I want to see stuff on a character level / pixel level.
  Seems like integrated enough zoom tools would work like this. And “paying attention” is fully general in the human case.
  - gwern 24 Jul 2025 21:48 UTC
    2 points
    0
    Parent
    
    Seems like integrated enough zoom tools would work like this
    
    Again, they cannot, because this is not even a well-defined task or feature to zoom in on. Most tasks and objects of perception do not break down as cleanly as “the letter ‘e’ in ‘s t r a w b e r r y’”. Think about non-alphabetical languages, say, or non-letter non-binary properties. (What do you ‘zoom in on’ to decide if a word rhymes enough with another word to be usable in a poem? “Oh, I’d just call out using a tool to a rhyming dictionary, which was compiled by expert humans over centuries of linguistic analysis.” OK, but what if, uh, there isn’t such an ‘X Dictionary’ for all X?)
    - williawa 25 Jul 2025 7:09 UTC
      1 point
      0
      Parent
      The issue is that tokenization masks information that sometimes is useful. Like token masks which letters are in a word. Or with non-alphabetical languages like hanzi I guess visual features of characters.
      With stuff like rhyming, the issue isn’t tokenization, its that the information was never there in the first place. Like Read vs Bead. No amount of zooming will tell you they don’t rhyme. pronounciation is “extra stuff” tagged onto language.
      So the “general purpose” of the zoom tool should just be to make all the information legible to the LLM. I think this is general and well defined. Like with images you could have it be just a literal zoom tool.
      It will be useful in cases where the full information is not economical to represent to an LLM in context by default, but where that information is nevertheless sometimes useful.
      - gwern 27 Jul 2025 9:15 UTC
        6 points
        2
        Parent
        
        With stuff like rhyming, the issue isn’t tokenization, its that the information was never there in the first place. Like Read vs Bead. No amount of zooming will tell you they don’t rhyme. pronounciation is “extra stuff” tagged onto language.
        
        Wrong. Spelling reflects pronunciation to a considerable degree. Even a language like English, which is regarded as quite pathological in terms of how well spelling of words reflects the pronunciation, still maps closely, which is why https://en.wikipedia.org/wiki/Spelling_pronunciation is such a striking and notable phenomena when it happens.
        
        I think this is general and well defined. Like with images you could have it be just a literal zoom tool.
        
        Shape perception and other properties like color are also global, not reducible to single pixels.
        williawa 27 Jul 2025 11:18 UTC
        6 points
        5
        Parent
        No offense, but I feel you’re not being very charitable or even really trying to understand what I mean when I say things.
        Like, I know letters carry information about how to pronounce words, that seems so obvious to me that I wouldn’t think it needed to explicitly state. I’m just saying they don’t carry all the information. Do you disagree with this? I thought it would be clear from the example I brought up that this is what I’m saying.
  - Brendan Long 24 Jul 2025 18:08 UTC
    2 points
    1
    Parent
    This is just a feeling, but it seems like human-style looking closer is different than using a tool. Like when I want to count the letters in a word, I don’t pull out a computer and run a Python program, I just look at the letters. What LLM’s are doing seems different since they both can’t see the letters, and can’t really ‘take another look’ (attention is in parallel). Although reasoning sometimes works like taking another look.
    - williawa 25 Jul 2025 7:12 UTC
      1 point
      0
      Parent
      Its not entirely clear to me. LLMs most immanent and direct action is outputting tokens. And you can have tool calls with singular tokens.
      I think you can train LLMs to use tools where they’re best thought of as humans moving their arm or focusing their eyes places.
      I don’t know if it can reach the same level of integration as human attention, but again, I think thats not really what we need here.