As one of the most natural and important means in "human-human" interaction, speech communication consists of two channels, the explicit channel carrying the linguistic information and the implicit channel carrying the non-linguistic information. Both linguistic and non-linguistic information play crucial roles in human speech communication. Many studies in linguistic information processing have been done in the past several decades; however, the research on non-linguistic cues has only recently become popular.
Non-linguistic cues generally include gender, age, emotion, stress and nervousness, dialect, and so on. Among these properties, emotion plays a key role in many applications. So far, different approaches have been presented to model emotions: one approach is the definition of discrete basic emotions, for example, anger, disgust, fear, happiness, sadness, and surprise; another approach is the utilization of continuous emotional dimensions, for instance, the three-dimensional emotional space: arousal(activation), potency, and valence.
So ZHOU Yu, SUN Yanqing, ZHANG Jianping, YAN Yonghong of ThinkIT Speech Lab, CAS Institute of Acoustics in collaboration with LI Junfeng and Akagi Masato of School of Information Science, Japan Advanced Institute of Science and Technology carried out a series of studies and put forward a hybrid speech emotion recognition system exploiting both spectral and prosodic features in speech.
For capturing the emotional information in the spectral domain, the researchers propose a new spectral feature extraction method by applying a novel non-uniform subband processing, instead of the mel-frequency subbands used in Mel-Frequency Cepstral Coeffcients (MFCC). For prosodic features, a set of features that are closely correlated with speech emotional states are selected. In the proposed hybrid emotion recognition system, due to the inherently different characteristics of these two kinds of features (e.g., data size), the newly extracted spectral features are modeled by Gaussian Mixture Model (GMM) and the selected prosodic features are modeled by Support Vector Machine (SVM). The final result of the proposed emotion recognition system is obtained by combining the results from these two subsystems. Experimental results show that (1) the proposed non-uniform spectral features are more effective than the traditional MFCC features for emotion recognition; (2) the proposed hybrid emotion recognition system using both spectral and prosodic features yields the relative recognition error reduction rate of 17.0% over the traditional recognition systems using only the spectral features, and 62.3% over using only the prosodic features.