One of the major challenges of using ASR in the real world applications in to handle the mismatch problem between training corpus and the testing data, or naturally presented variations. However, most of the technologies are mainly developed to deal with only a specific type of mismatch problem, and may face several problems in the real world application.
So SUN Yanqing, ZHOU Yu, ZHAO Weiqing and YAN Yonghong of Institute of Acoustics, Chinese Academy of Sciences, carried out a series of studies and focused on the problem of performance degradation in mismatched speech recognition.
The researchers utilize the F-Ratio analysis method to analyze the significance of different frequency bands for speech unit classification, and find that frequencies around 1 kHz and 3 kHz, which are the upper bounds of the first and the second formants for most of the vowels, should be emphasized in comparison to the Mel-frequency cepstral coefficients (MFCC). The analysis result is further observed to be stable in several typical mismatched situations. Similar to the Mel-Frequency scale, another frequency scale called the F-Ratio-scale is thus proposed to optimize the filter bank design for the MFCC features, and make each subband contains equal significance for speech unit classification. Under comparable conditions, with the modified features we get a relative 43.20% decrease compared with the MFCC in sentence error rate for the emotion affected speech recognition, 35.54%, 23.03% for the noisy speech recognition at 15 dB and 0 dB SNR (signal to noise ratio) respectively, and 64.50% for the three years' 863 test data. The application of the F-Ratio analysis on the clean training set of the Aurora2 database demonstrates its robustness over languages, texts and sampling rates.
This research result was published on the recently issued journal of IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS (E93D, P. 2417-2430, 2010).