Writing Russian and Ukrainian words in Latin script

I wanted to figure out how to properly write Russian and Ukrainian words using the Latin alphabet, but it turns out there are dozen different standards for Russian and half a dozen different standards for Ukrainian.

My original plan was to post two short algorithms in pseudocode, along with some commentary. I guess it will be only the commentary instead. I hope some of you may find some information here interesting anyway.

*

Disclaimer: I am not a native speaker of either of these languages. This article may contain mistakes, and I would be thankful if you point them out in comments. I have learned Russian at school, decades ago; only a rusty passive knowledge has remained. My knowledge of Ukrainian is limited to the first two lessons at Duolingo. (My native language is Slovak, which is also a Slavic language, but written in Latin script.)

*

The basic strategic choice for conversion between different scripts is the following: would you rather preserve the minutiae of the written version (and sacrifice some information about pronunciation, if necessary), or would you rather preserve the pronunciation (and mostly throw away the original written form)?

As a specific example, consider the Russian word for milk: “молоко”. In written form, all its three vowels are the same. When pronounced, the stress is on the last syllable, and a stressed “о” is generally pronounced differently from an unstressed “о”. The entire word sounds kinda like “mah luck core”; the first two vowel sounds are the same, and the third one is different.

A system trying to preserve the written form might write this word as “moloko”, while a system trying to preserve pronunciation might write it as “malako” instead.

(This is not unique for Cyrillic, by the way. You face a similar dilemma e.g. when trying to write Japanese syllables “た ち つ て と”. Would you rather preserve the sound, and write them as “ta, chi, tsu, te, to”, or follow the internal logic of hiragana which insists that this is the same consonant, and write them as “ta, ti, tu, te, to”?)

Both options have their advantages and disadvantages. It also depends on the audience. People who do not speak the language and do not plan to learn it, have no reason to care about its orthography; they just want to pronounce the weird scribbles. On the other hand, people who are already used to reading that language, may feel very uncomfortable seeing it written in a way that violates everything they learned about its orthography.

(To explain this feeling to a native English speaker, consider how you feel about various proposals to simplify English orthography. But from the technical perspective, those are merely conversions from Latin, to sound, to Latin again. You are doing an analogical thing when you convert from some other script, to sound, to Latin.)

*

The strategy of preserving the spelling is called transliteration. Ignore the pronunciation completely, mostly just create a table how the characters from the source script are converted to characters (or sequences of characters) in the target script; bonus points if you can afterwards unambiguously revert them back.

The Latin alphabet was designed for the Latin language of the Ancient Rome, and is a less perfect fit for some other languages that use it, which often need to express more sounds than Latin had. In general, there are two approaches: the extra sounds can be expressed by sequences of characters (such as “sh” in English), or the alphabet can be extended using accent marks (such as “š”, “ś”, “ş”, or “ș” in different languages). When transliterating Cyrillic, the latter approach is closer to the 1:1 ideal. (Note that there are Slavic languages which already use Latin characters, so you can use their conventions instead of designing a new one. They often disagree with each other, though.)

Advantages of transliteration:

  • can be implemented by a very simple algorithm;

  • you can convert words without knowing their pronunciation;

  • people who spent years learning the orthography of the language will not feel like you are disrespecting their great sacrifice. (Probably the most important, socially!)

Disadvantages of transliteration:

  • you need to learn separately how the words are actually pronounced, otherwise you may not recognize them when watching the TV.

This still provides a lot of space for bikeshedding, so we have several competing standards. (Relevant Wikipedia pages: 1, 2, 3.) Here are Russian and Ukrainian alphabets, side by side, with a Latin character where the transliteration is straightforward, or an asterisk where it requires further commentary:

ru:    А Б В Г   Д Е Ё   Ж З И     Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я
       а б в г   д е ё   ж з и     й к л м н о п р с т у ф х ц ч ш щ ъ ы ь э ю я
uk:    А Б В Г Ґ Д Е   Є Ж З И І Ї Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ     Ь   Ю Я ʼ
       а б в г ґ д е   є ж з и і ї й к л м н о п р с т у ф х ц ч ш щ     ь   ю я ʼ
Latin: a b v * g d e * * * z * * * * k l m n o p r s t u f * * * * * * y * * * * *

(Yes, there is a controversy even about whether the apostrophe should be transliterated to Latin as an apostrophe, or as a quote mark! Nothing is ever simple.)

Well, half of the transliteration is unambiguous, that is a good start.

A part of the complication is that different languages use Cyrillic slightly differently. Thus, tables for Russian only, or Ukrainian only, would be simpler.

  • Russian “г” = Ukrainian “ґ” = Latin “g”.

  • Ukrainian “г” (does not exist in Russian) = Latin “h”.

  • Russian “е” = Ukrainian “є”. Ukrainian “е” = Russian “э”.

  • Russian “и” = Ukrainian “і”. Ukrainian “и” = Russian “ы”.

  • Russian “ъ” is written as an apostrophe in Ukrainian.

The table would be slightly simpler if we rearranged it accordingly.

Another part is that languages that use Latin disagree about the proper way to write the sound “y” (as in “yes”). Many languages that use Latin would write it “j”, but many other languages use “j” for very different sounds. English uses “y”; and other languages at least use “y” for similar sounds, so good arguments can be made in support of either. One could also argue that “y” is just a shorter version of “i” (as in “hit”). Thus, for a completely unambiguous Cyrillic letter “й” we now have three proposed equivalents in Latin.

This complicates transliteration more than you might expect, because the “й” sound has a special role in some Slavic languages. It can result in palatalization of a preceding consonant in the word.

Not sure I can explain it in text to a native English speaker what palatalization is; you would need to hear actual examples. The idea is that in Slavic languages some consonants have two ways of pronouncing. The normal way (called “hard”); and with your tongue moved closer to the roof of your mouth (called “soft”). I am not a linguist, but I suspect that the “soft” form has historically evolved from the “hard” form followed by “й” sound. This is reflected in Cyrillic by some vowels having an alternate form where they are preceded by “й” which optionally becomes a softener of the previous consonant. For example, “я” = “й” + “а” (“ya”), “ю” = “й” + “у” (“you”), “ё” = “й” + “о”, Russian “е” = “й” + “э”, Ukrainian “є” = “й” + “е”. The point of this digression is that if you have a disagreement about how to transliterate “й”, you automatically also have a disagreement on how to transliterate “є”, “ё”, “ю”, “я”.

Should we transliterate the consonants “ц”, “ч”, “ш”, “щ” following the English rules as “ts”, “ch”, sh”, “shch”, or following the Czech rules as “c”, “č”, “š”, “šč”? (Note that even the Czechs cannot express “щ” as a single symbol.)

Then you have the “soft symbol” and “hard symbol”, which have no direct equivalent in Latin, and their role is to regulate palatalization of the previous consonant. If you want to palatalize a consonant, you normally do it by following it by “я”, “ю”, etc. instead of the usual “а”, “у”, etc. But what if it’s followed by another consonant, or it is at the end of the word? In such case, you put the “soft symbol” after the consonant. On the other hand, if you want to prevent accidental palatalization where a word stem ending with a consonant is joined with a word stem beginning by “я”, “ю”, etc., you put the “hard symbol” between them.

Here the problem is that the soft symbol in Russian was traditionally transcribed to Latin as an apostrophe. But Ukrainian uses the apostrophe (in Cyrillic) as a hard symbol. :(

*

The strategy of preserving the pronunciation is called transcription.

Advantages of transcription:

  • it actually sounds like people use it, like you hear it on TV.

Disadvantages of transcription:

  • you need to actually speak the language in order to transcribe it correctly;

  • you will have different transcriptions to English, to German, etc.; a different system for each combination of source language and target language;

  • unless everyone uses the English transcription; which simply does not make sense for speakers of other languages (they need to mentally revert the English-specific rules, which existed neither in the original language, nor in their language).

*

Summary: It’s complicated; I am giving up. If anyone criticizes you for writing something incorrectly, you may send them a link to this article.