I don’t think the Cyrillic text would map to any common tokens, since the output is essentially the result of a substitution cipher, the key being the keyboard mappings.
I don’t understand. surely it has been exposed to training resources that contain, say, Serbian which is written in both Latin and Cyrillic. And more relevant: news articles that have transliterations of Anglophone celebrity names and places:
The examples you gave are indeed transliterations. The Cyrillic text I’m talking about is actually nonsensical. Consider the reverse: if I mistakenly tried typing “істина” (Truth) on an qwerty keyboard, the result is “scnbyf”.
I don’t understand. surely it has been exposed to training resources that contain, say, Serbian which is written in both Latin and Cyrillic. And more relevant: news articles that have transliterations of Anglophone celebrity names and places:
Дэвід Бекхэм (David Beckham)
Стенлі Кубрик (Stanley Kubrick)
Лінкольншир (Lincolnshire)
Why wouldn’t these map to common tokens?
The examples you gave are indeed transliterations. The Cyrillic text I’m talking about is actually nonsensical. Consider the reverse: if I mistakenly tried typing “істина” (Truth) on an qwerty keyboard, the result is “scnbyf”.
Interesting, it would be fun to try it with the Claude Tokenizer