With globalization, China is becoming more closely connected to the world. Foreign languages (particularly English) are being used more frequently, especially among younger generations in cities. It is very common for English to be embedded in Chinese sentences during conversations. This makes the Mandarin-English bilingual speech recognition a necessity for many speech recognition applications.
ZHANG Qingqing, PAN Jielin and YAN Yonghong of ThinkIT Speech Laboratory, Institute of Acoustics, Chinese Academy of Sciences carried out a series of studies and took their efforts in developing a Mandarin-English bilingual speech recognition system for real-world application.
They develop a grammar-constrained, Mandarin-English bilingual Speech Recognition System (MESRS) for real-world music retrieval. In the study, they tackle two of the main difficult issues in handling the bilingual speech recognition for real-world applications: One is to balance the performance and the complexity of the bilingual speech recognition system; the other is to effectively deal with the matrix language accents in embedded language. They develop a unified bilingual acoustic model, which is derived by the novel Two-pass phone-clustering method based on the Confusion Matrix (TCM), to solve the first problem. To deal with the second problem, they investigate several nonnative model modification approaches on the unified acoustic models.
Compared to the existing log-likelihood phone-clustering method, the proposed TCM method with effective incorporation of limited amounts of nonnative adaptation data and adaptive modification, relatively reduces the Phrase Error Rate (PER) by 10.9% for nonnative English phrases and the PER on Mandarin phrases decreases favorably, and besides, the recognition rate for bilingual code-mixing phrases achieves an 8.9% relative PER reduction.