plex comments on The Dark Arts of Tokenization or: How I learned to start worrying and love LLMs’ undecoded outputs

plex 19 Oct 2025 8:35 UTC
2 points
0
The post shows 32 words being tested. The 16 positive valance words all have higher weight with normal tokenization, the 16 low valance words have higher with unusual tokenization. This is extremely improbable by raw chance.
- gwern 19 Oct 2025 17:00 UTC
  2 points
  0
  Parent
  There’s no reason whatsoever to think that they are independent of each other. The very fact that you can classify them systematically as ‘positive’ or ‘negative’ valence indicates they are not and you don’t know what ‘raw chance’ here yields. It might be quite probable.
  - plex 19 Oct 2025 18:15 UTC
    2 points
    0
    Parent
    Right, I think that’s the hypothesis I was asking whether you had when I said
    unless you mean the randomness was in an underlying valance factor?
    If so, yeah, this is compatible. I’d still put notably higher odds on the original thing I suggested, but this is the other main hypothesis.