When a word is borrowed from one language to another, it changes form in terms of the way it is written and pronounced. When the two languages have very different phonological and writing systems, approximately similar sounds and characters are substituted for the original ones.
In Korean, the form of a borrowed word depends on both the spelling and pronunciation of the original word. In particular, the pronunciation of vowels is likely to be based on spelling instead of pronunciation. For example, in Korean shop is pronounced with a long o (as in ghost) syob instead of with an ah sound as in the English pronunciation. However, hot is pronounced with an ah sound on the basis of its English pronunciation, e.g., has.
This demo tries to guess the most likely way that an English word would be written in Korean on the basis of its English spelling and pronunciation. There are 5 basic steps, each of which is explained below.
- Try to find it in a loanword dictionary. If it's not in there, go to step 2.
- Get English pronunciation.
- Line up English spelling and pronunciation.
- Produce the most likely Korean character for each English letter-sound pair.
- Syllabify the Korean and convert to hangeul syllables.
The first step is to try to find the English word in an English-Korean loanword dictionary. The rationale for this step is simple: if it's in the dictionary, it's probably correct and we don't have to worry about guessing anything. This demo uses a list of 14,000 English-Korean loanwords derived from a list of foreign words from the National Academy of the Korean Language. The problem with relying on a dictionary, especially for borrowed words, is that new words get added more quickly than the dictionary can keep up. So if a word is not in the dictionary, the next steps involve trying to generate its Korean spelling.
Getting the pronunciation of an English word follows the same rationale as above: look for it first in a dictionary, and if it's not there, make something up. This demo uses a version of the CMU Pronouncing Dictionary which contains around 116,000 English words and their pronunciation. If a word is not in the CMU Dictionary, then we generate its pronunciation automatically on the basis of probabilistic letter-to-sound rules. [Details of the model one day? In the meantime, here's the general idea.] This is the first point where things can really start to go wrong: if we generate the wrong English pronunciation here, then the following steps will be affected, possibly badly.
Aligning English Spelling and Pronunciation
Once we have a word and its pronunciation, we have to line up the letters with their sounds. This is a little bit tricky, because the link between English spelling and pronunciation is not always straightforward. For this task, we basically want orthographic vowels and consonants to line up with phonemic vowels and consonants, respectively. This part of the process uses a heuristic edit-distance algorithm with empirically estimated weights. [Details? Inspiration from papers like this one.] For example, we would want the word thirsty and its pronunciation TXsti to line up as:
This step also provides a chance for things to mess up: the alignments are probabilistically generated, which means sometimes they will be wrong (word accuracy of 99% on 10,000 word development set). These alignments are used in the next step to generate the corresponding Korean segments.
Generating Korean Segments
The most likely Korean segment given an English letter-sound alignment is estimated on the basis of a set of 10,000 English-Korean loanwords that are 3-way aligned between English spelling, pronunciation, and Korean spelling, e.g.,
From examples like this, we learn the probabilities of each Korean segment given its English context. For example, the model should learn that Korean t is more likely when the English spelling and pronunciation are t at the beginning of a word, and Korean tU is more likely when there's an English t at the end of a word. Then when it sees a new English word and its pronunciation, it applies these probabilities to generate the most likely Korean segment. [Details forthcoming.]
The output of the preceding steps is a sequence of Korean letters. These have to be converted into orthographic hangeul syllables in order to be a valid Korean word. For the most part, syllabification is unambiguous, but not entirely.
Average per word transliteration accuracy on unseen words from the development set is around 68%. Nearly all of the errors involve vowels.