With globalization, nonnative speech recognition is becoming a popular issue in automatic speech recognition (ASR). As apposed to native one, current speech recognition is known to perform considerably worse when recognizing nonnative speech, which results in word error rates of two to three times native error rates.
To solve the problem, researchers of Institute of Acoustics, Chinese Academy of Sciences carried out a series of experiments and developed a state-based bilingual model modification approach to improve nonnative speech recognition accuracy, when great variations of accented pronunciations occur.
Their experiment is restricted to nonnative English spoken by native speakers of Mandarin. All the training and testing data described were recorded and digitized at 8 KHz sampling rate with 16-bit resolutions. The speech feature vector used in the research consists of 36 components (12 PLP parameters, and their first and second order time derivatives), which is analyzed at a 10msec frame rate with a 25msec window size. Cepstral Mean Subtraction (CMS) is employed.
In the followoign experiment, each state of baseline nonnative acoustic model is modified with several candidate states from the auxiliary acoustic model, which is trained by speakers' mother language. State mapping criterion and n-best candidates are investigated, and different numbers of Gaussian mixtures of the auxiliary acoustic model are compared based on a grammar-constrained speech recognition system.
Using this bilingual model modification approach, compared to the nonnative acoustic model which has already been well trained by adaptation technique MAP, the Phrase Error Rate further achieves an 11.7% relative reduction, while only a small relative increase on Real Time Factor occurs.