CstineSublime comments on koanchuk’s Shortform

CstineSublime 26 Jan 2026 2:24 UTC
1 point
0
I don’t think the Cyrillic text would map to any common tokens, since the output is essentially the result of a substitution cipher, the key being the keyboard mappings.
I don’t understand. surely it has been exposed to training resources that contain, say, Serbian which is written in both Latin and Cyrillic. And more relevant: news articles that have transliterations of Anglophone celebrity names and places:

Дэвід Бекхэм (David Beckham)
Стенлі Кубрик (Stanley Kubrick)
Лінкольншир (Lincolnshire)

Why wouldn’t these map to common tokens?
- koanchuk 26 Jan 2026 2:45 UTC
  1 point
  0
  Parent
  The examples you gave are indeed transliterations. The Cyrillic text I’m talking about is actually nonsensical. Consider the reverse: if I mistakenly tried typing “істина” (Truth) on an qwerty keyboard, the result is “scnbyf”.
  - CstineSublime 26 Jan 2026 7:14 UTC
    1 point
    0
    Parent
    Interesting, it would be fun to try it with the Claude Tokenizer