Nowadays the applications in multimedia domain require that the Speech/Music classifier has many other merits in addition to the accuracy, such as short-time delay and low complexity. So WU Qiong, DENG Haojiang and WANG Jinlin of Institute of Acoustics, Chinese Academy of Sciences, in collaboration with YAN Qin of Information Engineering College, Hohai University endeavored to form a novel Speech/Music classifier by using different data mining methods.
They create a system by analyzing the inherent validity of diverse features extracted from the audio, building a hierarchical structure of oblique decision trees (HODT) to maintain optimal performances, and applying a novel context-based state transform (ST) strategy to refine the classification results. They propose an algorithm evaluated by a set of 5-11 min 702 audio files, which are made from 54 speech or music files according to different Signal-to-Noise Ratio (SNR) levels and diverse noise types. The researchers conduct practical experiment and results show that the proposed classifier outperforms AMR-WB+ by achieving 97.9% and 95.9% in classification rate at the 10 ms frame level in pure and high SNR (> = 20 dB) environment, respectively. The post-processing ST strategy further enhances the system performance, particularly at low SNR circumstances (10 dB), with 5.6% up in the accuracy rate. In addition, the complexity of the proposed system is lower than 1WMOPS which make it easily adaptable to many scenarios.
The research result was published on the recently issued journal of Computer Speech and Language (Volume 24, Issue 2, April 2010, Pages 257-272).