2018-06-21

Converting pinyin tone marks

I don't think I have ever seen written rules for where to place a tone mark in pinyin. They are not difficult, but I had to look through a lot of transliterations to make sure I understood them.

Rule 0A single vowel takes the tone mark. ǎ, ǎn, ǎng, é, én, éng, ér, ò, òng, ī, īn, īng, ǔ, ǔn, ǜ, ǜn
Rule 1'a' and 'e' are dominant vowels. They always take the tone mark. ái, áo, iá, ián, iáng, iáo, uá, uái, uán, uáng, üán; èi, iè, uè
Rule 2'o' is a medium vowel. It will take a tone mark unless paired with 'a'. ǒu, iǒ, iǒng, uǒ
Rule 3If weak vowels 'i' and 'u' are paired, the second vowel takes the tone mark. iú, uí

I prefer to see tone marks as vowel accents. If you want to convert pinyin syllables with trailing digits, (from CEDICT, for example) or you find it easier to type them with digits, then it would be nice to have code to automatically convert them. I wrote a VIM macro to do this. While writing the macro, I went through all the valid combinations in Mandarin and I realized that the rules above could be restated in an even simpler form:

For syllables ending in { -ao, -ai, -ei, -ou }, the next-to-last vowel is dominant. ái, áo, éi, óu, iáo
For anything else, the last vowel is dominant. ǎ, ǎn, ǎng, é, én, éng, ér,
ī, īn, īng, iá, ián, iáng, iè, iǒ, iǒng,
ò, òng, ǔ, ǔn, ǜ, ǜn, üán,
uè, uǒ, uá, uái, uán, uáng, iú, uí

In code this can be expressed as 19 x 5 = 95 text substitutions. I have added a couple of variations below to handle alternate ways of entering the ü vowel.

In four cases, mark the penultimate vowel.
"ao1" → "āo" "ao2" → "áo" "ao3" → "ǎo" "ao4" → "ào" "ao5" → "ao"
"ai1" → "āi" "ai2" → "ái" "ai3" → "ǎi" "ai4" → "ài" "ai5" → "ai"
"ei1" → "ēi" "ei2" → "éi" "ei3" → "ěi" "ei4" → "èi" "ei5" → "ei"
"ou1" → "ōu" "ou2" → "óu" "ou3" → "ǒu" "ou4" → "òu" "ou5" → "ou"
For anything else, mark the last vowel.
"ang1" → "āng" "ang2" → "áng" "ang3" → "ǎng" "ang4" → "àng" "ang5" → "ang"
"eng1" → "ēng" "eng2" → "éng" "eng3" → "ěng" "eng4" → "èng" "eng5" → "eng"
"ing1" → "īng" "ing2" → "íng" "ing3" → "ǐng" "ing4" → "ìng" "ing5" → "ing"
"ong1" → "ōng" "ong2" → "óng" "ong3" → "ǒng" "ong4" → "òng" "ong5" → "ong"
"an1" → "ān" "an2" → "án" "an3" → "ǎn" "an4" → "àn" "an5" → "an"
"en1" → "ēn" "en2" → "én" "en3" → "ěn" "en4" → "èn" "en5" → "en"
"in1" → "īn" "in2" → "ín" "in3" → "ǐn" "in4" → "ìn" "in5" → "in"
"un1" → "ūn" "un2" → "ún" "un3" → "ǔn" "un4" → "ùn" "un5" → "un"
"er1" → "ēr" "er2" → "ér" "er3" → "ěr" "er4" → "èr" "er5" → "er"
"a1" → "ā" "a2" → "á" "a3" → "ǎ" "a4" → "à" "a5" → "a"
"e1" → "ē" "e2" → "é" "e3" → "ě" "e4" → "è" "e5" → "e"
"i1" → "ī" "i2" → "í" "i3" → "ǐ" "i4" → "ì" "i5" → "i"
"o1" → "ō" "o2" → "ó" "o3" → "ǒ" "o4" → "ò" "o5" → "o"
"u1" → "ū" "u2" → "ú" "u3" → "ǔ" "u4" → "ù" "u5" → "u"
"ü1" → "ǖ" "ü2" → "ǘ" "ü3" → "ǚ" "ü4" → "ǜ" "ü5" → "ü"
"u:1" → "ǖ" "u:2" → "ǘ" "u:3" → "ǚ" "u:4" → "ǜ" "u:5" → "ü"
"v1" → "ǖ" "v2" → "ǘ" "v3" → "ǚ" "v4" → "ǜ" "v5" → "ü"