williawa comments on LLMs Can’t See Pixels or Characters

williawa 21 Jul 2025 21:41 UTC
10 points
6
I think ~all tokenizers have a token for every letter. So you could just have tokenize(“something”) return ‘s’, ‘o’, ‘m’, ‘e’, ‘t’, ‘h’, ‘i’, ‘n’, ‘g’ to the LLM as 9 individual tokens.
- gwern 22 Jul 2025 5:28 UTC
  9 points
  3
  Parent
  Yes, but note that even if you are able to force the exact tokenization you want and so can force the byte-level fallbacks during a sample, this is not the same as training on byte-level tokenization. (In the same way that a congenitally blind person does not instantly have normal eyesight after a surgery curing them.) Also, you are of course now burning through your context window (that being the point of non-byte tokenization), which is both expensive and leads to degraded performance in its own right.
  - Brendan Long 22 Jul 2025 6:21 UTC
    4 points
    0
    Parent
    If the model was trained using BPE dropout (or similar methods), it actually would see this sort of thing in training, although it wouldn’t see entire words decomposed into single characters very often.
    I don’t think it’s public whether any frontier models do this, but it would be weird if they weren’t.
    - gwern 22 Jul 2025 20:43 UTC
      4 points
      1
      Parent
      BPE dropout, yes, or just forcibly encoding a small % into characters, or annealing character->BPE over training, or many things I have suggested since 2019 when I first became concerned about the effects of BPE-only tokenization on GPT-2′s poetry and arithmetic… There are many ways to address the problem at, I think, fairly modest cost—if they want to.
      
      but it would be weird if they weren’t.
      
      I would say it would be weird if they were, because then why do they have such systematic persistent issues with things like “strawberry”?
      - Brendan Long 22 Jul 2025 21:33 UTC
        2 points
        0
        Parent
        I would say it would be weird if they were, because then why do they have such systematic persistent issues with things like “strawberry”?
        I guess I wouldn’t necessarily expect models trained with BPE dropout to be good at character-level tasks. I’d expect them to be better at learning things about tokens, but they still can’t directly attend to the characters, so tasks that would be trivial with characters (attend to all r’s → count them) become much more complicated even if the model has the information (attend to ‘strawberry’ → find the strawberry word concept → remember the number of e’s).
        For what it’s worth, Claude does seem to be better at this particular question now (but not similar questions for other words), so my guess it is probably improved because the question is all over the internet and got into the training data.
  - williawa 22 Jul 2025 8:43 UTC
    1 point
    0
    Parent
    I think all tokens appear frequently enough that they know the meaning of those single letter tokens. For example if there is a text that ends ”..healt”, and “heal” is a token, the last one becomes a “t” token.
    I mean you can check this, and even quite small and not-so-smart models understand how to read the letter tokens.
    (after <start_of_turn>assistant is the model’s generation)
    - gwern 22 Jul 2025 20:46 UTC
      6 points
      3
      Parent
      
      I think all tokens appear frequently enough that they know the meaning of those single letter tokens.
      
      Knowing the meaning of single letter tokens is not enough. That is quite trivial, and they of course do know what single letter BPEs mean.
      
      But it is the opposite of what they need to know. A LLM doesn’t need to know what the letter ‘e’ is (it will, after all, see space-separated letters all the time); it needs to know that there is only 1 such ‘e’ letter in the word ‘strawberry’. The fact that there exists an ‘e’ BPE doesn’t change how the word ‘strawberry’ gets tokenized as a single opaque BPE (rather than 10 single-letter BPEs including the ‘e’ letter). And this has to happen at scale, across all of the vocabulary, in a way which allows for linguistic generalization and fluency, and learning implicit properties like phonetics etc.
      - williawa 22 Jul 2025 22:17 UTC
        1 point
        0
        Parent
        I’m commenting on the “zoom” tool thing. If it needs to know how many ’e’s are in strawberry it calls the zoom(strawberry) tool, and it returns “s” “t” “r” “a” “w” “b” “e” “r” “r” “y”. And it knows what an e is, so it can count it. The above is just demonstrating that models know how to reason about single letter tokens.
        
        Edit: reading above I see that my first comment could be interpreted as saying we should just do character-only tokenization during inference. This isn’t what I was suggesting.
        gwern 23 Jul 2025 22:11 UTC
        6 points
        2
        Parent
        
        reading above I see that my first comment could be interpreted as saying we should just do character-only tokenization during inference. This isn’t what I was suggesting.
        
        Yes, that is what I took you to mean.
        
        Anyway, to address your revised claim & example: this may work in that specific simple ‘strawberry’ task. The LLMs have enough linguistic knowledge to interpret that task and express it symbolically and handle it appropriately with its tools like a ‘zoom tool’ or a Python REPL.
        
        However, this is still not a general solution, because in many cases, there either is no such tool-using/symbolic shortcut or the LLM would not feasibly come up with it. Like in the example of the ‘cuttable tree’ in Pokemon: is there a single pixel which denotes being cuttable? Maybe there is, which could be attacked using Imagemagick analogous to calling out to Python to do simple string manipulation, maybe not (I’m not going to check). But if there is and it’s doable, then how does the LLM know which pixel to check for what value?
        
        This is not how I learned to cut trees in Pokemon, and it’s definitely not how I learned to process images in general, and if I had to stumble around calling out to Python every time I saw something new, my life would not be going well.
        williawa 24 Jul 2025 12:36 UTC
        1 point
        0
        Parent
        I don’t know about this. Most of my perception is higher order, kind of tokenized, wrt words, text, vision, sound.
        I can “pay close attention” if I want to see stuff on a character level / pixel level.
        Seems like integrated enough zoom tools would work like this. And “paying attention” is fully general in the human case.
        gwern 24 Jul 2025 21:48 UTC
        2 points
        0
        Parent
        
        Seems like integrated enough zoom tools would work like this
        
        Again, they cannot, because this is not even a well-defined task or feature to zoom in on. Most tasks and objects of perception do not break down as cleanly as “the letter ‘e’ in ‘s t r a w b e r r y’”. Think about non-alphabetical languages, say, or non-letter non-binary properties. (What do you ‘zoom in on’ to decide if a word rhymes enough with another word to be usable in a poem? “Oh, I’d just call out using a tool to a rhyming dictionary, which was compiled by expert humans over centuries of linguistic analysis.” OK, but what if, uh, there isn’t such an ‘X Dictionary’ for all X?)
        williawa 25 Jul 2025 7:09 UTC
        1 point
        0
        Parent
        The issue is that tokenization masks information that sometimes is useful. Like token masks which letters are in a word. Or with non-alphabetical languages like hanzi I guess visual features of characters.
        With stuff like rhyming, the issue isn’t tokenization, its that the information was never there in the first place. Like Read vs Bead. No amount of zooming will tell you they don’t rhyme. pronounciation is “extra stuff” tagged onto language.
        So the “general purpose” of the zoom tool should just be to make all the information legible to the LLM. I think this is general and well defined. Like with images you could have it be just a literal zoom tool.
        It will be useful in cases where the full information is not economical to represent to an LLM in context by default, but where that information is nevertheless sometimes useful.
        gwern 27 Jul 2025 9:15 UTC
        6 points
        2
        Parent
        
        With stuff like rhyming, the issue isn’t tokenization, its that the information was never there in the first place. Like Read vs Bead. No amount of zooming will tell you they don’t rhyme. pronounciation is “extra stuff” tagged onto language.
        
        Wrong. Spelling reflects pronunciation to a considerable degree. Even a language like English, which is regarded as quite pathological in terms of how well spelling of words reflects the pronunciation, still maps closely, which is why https://en.wikipedia.org/wiki/Spelling_pronunciation is such a striking and notable phenomena when it happens.
        
        I think this is general and well defined. Like with images you could have it be just a literal zoom tool.
        
        Shape perception and other properties like color are also global, not reducible to single pixels.
        williawa 27 Jul 2025 11:18 UTC
        6 points
        5
        Parent
        No offense, but I feel you’re not being very charitable or even really trying to understand what I mean when I say things.
        Like, I know letters carry information about how to pronounce words, that seems so obvious to me that I wouldn’t think it needed to explicitly state. I’m just saying they don’t carry all the information. Do you disagree with this? I thought it would be clear from the example I brought up that this is what I’m saying.
        Brendan Long 24 Jul 2025 18:08 UTC
        2 points
        1
        Parent
        This is just a feeling, but it seems like human-style looking closer is different than using a tool. Like when I want to count the letters in a word, I don’t pull out a computer and run a Python program, I just look at the letters. What LLM’s are doing seems different since they both can’t see the letters, and can’t really ‘take another look’ (attention is in parallel). Although reasoning sometimes works like taking another look.
        williawa 25 Jul 2025 7:12 UTC
        1 point
        0
        Parent
        Its not entirely clear to me. LLMs most immanent and direct action is outputting tokens. And you can have tool calls with singular tokens.
        I think you can train LLMs to use tools where they’re best thought of as humans moving their arm or focusing their eyes places.
        I don’t know if it can reach the same level of integration as human attention, but again, I think thats not really what we need here.
        ProgramCrafter 22 Jul 2025 23:26 UTC
        1 point
        0
        Parent
        Generally it does not have to go through the hassle of invoking a tool in well-formed JSON, but rather an inference pipeline catching up (a special token plus command perhaps?) and then replicating the next word or so character by character, so that model’s output gets like (| for token boundaries):
        
        > |Wait, |%INLINE_COMMAND_TOKEN%|zoom(|”s|trawberry|”)| is |”|s|t|r|a|w|b|e|r|r|y|”|, so|...