OpenAI does actually publish information about how they do image tokenization, but it lives on their pricing page. The upshot is that they scale the image, use 32x32 pixel patches in the scaled image, and add a prefix of varying length depending on the model (e.g. 85 tokens for 4o, 75 for o3). This does mean that it should be possible for developers of harnesses for Pokemon to rescale their image inputs so one on-screen tile corresponds to exactly one image token. Likewise for the ARC puzzles.
Thanks! This is extremely helpful. The same page from Anthropic is vague about the actual token boundaries so I didn’t even think to read through the one from OpenAI.
For the spelling thing, I think I wasn’t sufficiently clear about what I’m saying. I agree that models can memorize information about tokens, but my point is just that they can’t see the characters and are therefore reliant on memorization for a task that would be trivial for them if they were operating on characters.
Thanks! This is extremely helpful. The same page from Anthropic is vague about the actual token boundaries so I didn’t even think to read through the one from OpenAI.
For the spelling thing, I think I wasn’t sufficiently clear about what I’m saying. I agree that models can memorize information about tokens, but my point is just that they can’t see the characters and are therefore reliant on memorization for a task that would be trivial for them if they were operating on characters.