Chinese speech synthesis

This article may require copy editing for grammar, style, cohesion, tone, or spelling. You can assist by editing it. (February 2008) (Learn how and when to remove this message)

Chinese speech synthesis is the application of speech synthesis to the Chinese language (usually Standard Mandarin). It poses additional difficulties due to the Chinese characters (which frequently have different pronunciations in different contexts), the complex prosody which is essential to convey the meaning of words, the more frequent occurrence of unexpected, unusual^{[citation needed]} combinations of syllables, and sometimes the difficulty in obtaining agreement among native speakers concerning what is the correct pronunciation of certain phonemes.

Approaches taken

Corpus-based (iflytek and SinoSonic)

iflytek (formerly Ifly Info Tech) published a W3C paper in which they adapted Speech Synthesis Markup Language to produce a dialect called Chinese Speech Synthesis Markup Language (CSSML) which can include additional markup to clarify the pronunciation of characters and to add some prosody information.^[1] Their synthesiser takes a "corpus-based" approach, which means it can sound very natural in most cases but can fault on awkward unusual phrases if they can't be matched with the corpus. The amount of data involved is not disclosed by iflytek but can be seen from the commercial products that iflytek have licensed their technology to; for example, Bider's SpeechPlus is a 1.3 Gigabyte download, 1.2 Gigabytes of which is used for the highly-compressed data for a single Chinese voice. iflytek's synthesiser can also synthesise mixed Chinese and English text with the same voice (e.g. Chinese sentences containing some English words); they claim their English synthesis to be "average".

The iflytek corpus appears to be heavily dependent on Chinese characters, and it is not possible to synthesize from pinyin alone. It is sometimes possible by means of CSSML to add pinyin to the characters to disambiguate between multiple possible pronunciations, but this does not always work. The spaced-interval repetition language-practice program Gradint includes a utility that attempts to turn arbitrary pinyin into CSSML that will be spoken correctly in SpeechPlus, by choosing Chinese characters that are most likely to be given the required pronunciation by SpeechPlus (taking its quirks into account), but this utility recommends that every phrase synthesized is systematically tested and a backup synthesizer be used for phrases that are not spoken correctly, since it is possible for the resulting sound to contain syllables that are totally different from the pinyin input. Gradint now recommends the use of Lily (see below) instead of SpeechPlus.

A corpus-based approach is also taken by Tsinghua University's SinoSonic, with the Harbin voice data taking 800 Megabytes. As of 2007, the download link for SinoSonic has not yet been activated.

Concatenation (KeyTip)

A less complex approach is taken by cjkware.com's KeyTip Putonghua Reader, which contains 120 Megabytes of sound recordings (GSM-compressed to 40 Megabytes in the evaluation version), comprising 10,000 multi-syllable dictionary words plus single-syllable recordings in 6 different prosodies (4 tones, neutral tone, and an extra third-tone recording for use at the end of a phrase). These recordings can be concatenated in any desired combination, but the joins sound forced (as is usual for simple concatenation-based speech synthesis) and this can severely affect prosody; the synthesizer is also inflexible in terms of speed and expression. However, because this synthesizer does not rely on a corpus, there is no noticeable degradation in performance when it is given more unusual or awkward phrases.

NeoSpeech and Nuance voices

Concatenation with a larger amount of recorded data (about 500 Megabytes), along with other undisclosed methods, is apparently used by NeoSpeech's SAPI 5 voices "Lily" and "Wang",^[2] which can, in most cases, reliably synthesize awkward phrases provided they are added to the dictionary properly^[3] and does not suffer from the severe inflexibility and forced joins of simpler concatenation-based synthesis.

The Nuance (formerly ScanSoft) RealSpeak MeiLing voice (available from NextUp but note that it won't install without a purchased version of TextAloud) has similar properties but the download size is much smaller (42.7 MB). Due to bugs in the program, it is very difficult to get MeiLing to speak reliably from pinyin or zhuyin input.^[4]

Of these voices, the most reliable for synthesizing awkward or unusual phrases from pronunciation input appears to be Lily. However, even Lily is not perfect. A few phrases are synthesized incorrectly when entered as pinyin but correctly when entered as Chinese characters, for example "yong4chu5lai5" (incorrectly read as the more common "yong4chu1lai5", but characters 用出来 are read correctly), and "zhuan3lai2zhuan3qu4" (the first "zhuan" is incorrectly read as "zhuai", but the characters 转来转去 are read correctly). This is reminiscent of some commercial English speech synthesizers which yield lower quality speech when fed pronunciation data than when fed original text, suggesting that the pronunciation data they accept is not the internal format they use.^[5] Nevertheless it is not always desirable to enter characters only, because often it is necessary to specify a different pronunciation.

These voices can also fault in ways that are not explainable by the input format. For example, Neospeech Lily and Nuance MeiLing both make the following mistakes (which could indicate a sharing of the unpublished techniques they use, despite the significant difference in data size): 首都 (shou3du1) the "du1" is too low in pitch; 邮编 (you2bian1) the "bian1" is too low in pitch; 天真 (tian1zhen1) the two syllables are said with a drop of a musical third, like a doorbell, whereas they should be at the same pitch; 糖尿病 (tang2 niao4 bing4) the N is very unclear. This is true whether the input is characters or (in Lily's case) pinyin. The first three of these mistakes do not occur when the word is part of a longer phrase, but do occur when it is in isolation, which is often the case in a language-learning scenario^[6].

Sometimes, pinyin phrases that are synthesized incorrectly by Lily can be corrected by breaking long words into separate words, but not in the above examples.

There does not appear to be any method of sending feedback to the developers about these bugs.

ESpeak

The lightweight open-source speech project eSpeak, which has its own approach to synthesis, has started experimenting with Chinese synthesis.

Ekho

Ekho is another open source Chinese TTS Ekho, which simply concates wave of syllables. It supports both Cantonese and Mandarin.

Online Demos and Bell Labs

There is an online interactive demonstration for NeoSpeech voices,^[7] but it is not possible to customize the Chinese pronunciation by entering pinyin. iFlyTek also has an online demonstration,^[8] but it is frequently non-functional with no replies from the contact email address, and in practice it does not appear to accept CSSML pronunciation overrides. (Update: There is now a more reliable demo at iflylanguage.com[Server in USA] and ecl.iflytek.com [Server in China], and it allows CSSML pronuciation correction with visualization mode named "Advanced Reading Mode Settings". note however that the Javascript interface is slightly confusing for blind users as there is no submit button on the form; you have to click on the link that says "Woman's voice" or "Man's voice" after typing text in the box.)

Bell Labs have an online Mandarin text-to-speech demo^[9] dated 1997, but it is now non-functional (the server that the query is to be submitted to does not exist in the DNS) and the contact email is no longer valid. However, their approach was described in a monograph "Multilingual Text-to-Speech Synthesis: The Bell Labs Approach" (Springer, October 31 1997, ISBN-13: 978-0792380276), and the former employee who was responsible for the project, Chilin Shih (who now works at the University of Illinois), has some notes about her methods on her website.^[10]

Non-Windows systems

The above-mentioned Chinese speech synthesis systems (apart from the online demos) are available only for Windows. However, the spaced-interval repetition language-practice program Gradint includes code and instructions for using KeyTIP and SpeechPlus data on other operating systems, by reading the data directly or using the WINE emulator.

There are some reports^[11] that SAPI 5-based speech synthesizers can be run on recent versions of the WINE emulator.

Mac OS had Chinese speech synthesizers available up to version 9; this was removed in Mac OS X but is scheduled to be replaced in version 10.5, according to Apple's website.

Notable approaches not yet taken

As of 2007, it appears that there have been no projects to synthesize Chinese by simulating the human vocal tract, as GNU Speech is doing for English. Chinese is also notably missing from the extensively-multilingual MBROLA project.