Real-time Methods Localize Multiple Speech Sources

 |  | 

 

Real-time speech source localization using microphone arrays is of great significance in numerous applications such as speech separation, speech enhancement, or speaker tracking. The timely knowledge about speakers’ locations is an essential prerequisite for these systems.

Especially, in some scenarios where speech sources are present for a short time or their locations are time-varying, the localization must be conducted on a short-term segment such as a word-length utterance, instead of a long-term segment such as a sentence-length utterance. As insufficient frames are available in a short-term segment, the robust localization method is required for the short-term source localization.

Although signal subspace methods are robust for the short-term source localization, most signal subspace methods cannot count the number of sources, and do not make use of speech sparsity in the frequency domain.

Recently, researchers YING Dongwen, ZHOU Ruohua, LI Junfeng, and YAN Yonghong present two WDSS methods, a grid search window-dominant signal subspace (GS-WDSS) method and a closed-form WDSS (CF-WDSS) method to localize multiple speech sources using short-term segments.

The proposed methods are valuable for real-time speech source localization because of the small latency. The generalized sparsity assumption makes promises to improve speech source localization in adverse environments. 

The performance of the two localization methods is evaluated using short-term speech segments with duration of 0.24 seconds. The proposed methods not only have the advantage in robustness, but also can count the number of speech sources. Although CF-WDSS does not perform as well as GS-WDSS, the former is very computationally efficient to localize speech sources.

This research result has been published in the IEEE/ACM Transactions on Audio, Speech, and Language Processing (Volume: PP, Issue: 99). 

Such methods are based upon the generalized sparsity assumption that each window containing some time-adjacent bins is dominated by one source, as opposed to the conventional assumption that each individual bin is dominated by one source. The generalized assumption enables the principal eigenvector of the spatial correlation matrix on each window to span the signal subspace of the window-dominant source.

The direction-of-arrival (DOA) of the dominant source is estimated from the principal eigenvector. The DOAs and the number of sources are eventually summarized from the DOA histogram of all dominant sources. The conventional assumption is a special case of the generalized assumption. By using the generalized assumption, the performance in estimating DOAs of the window-dominant sources is significantly improved at the cost of acceptable masking effect.

Besides time-adjacent bins, frequency-adjacent bins can be considered for the generalized assumption, which will be addressed in the future research.

Funding for this research came from National Program on Key Basic Research Project (2013CB329302), National Natural Science Foundation of China (Nos. 61671442, 61271426, 11461141004, 91120001) and the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant Nos. XDA06030100, XDA06030500). 

Reference:

YING Dongwen, ZHOU Ruohua, LI Junfeng, and YAN Yonghong. Window-Dominant Signal Subspace Methods for Multiple Short-Term Speech Source Localization. IEEE/ACM Transactions on Audio, Speech, and Language Processing (Volume: PP, Issue: 99, Page(s): 1-1, November 2016). DOI: 10.1109/TASLP.2016.2625458

Contact:

YING Dongwen

Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, 100190 Beijing, China

E-mail: yingdongwen@hccl.ioa.ac.cn and yyan@hccl.ioa.ac.cn

Appendix: