In the speech processing community, the task of automatically synchronizing spoken utterances with their transcripts (textual contents) is mainly motivated by the closed-captioning of broadcast news or browsing of speech/video in multimedia retrieval applications. Specially, automatically synchronizing spoken utterances with their transcripts refers to the task that when a spoken utterance of recorded speech is played or the live speech is broadcast, the corresponding textual transcription should be displayed or broadcast simultaneously on the screen of a television or a similar media player.
GAO Jie, ZHAO Qingwei and YAN Yonghong of Thinkit Speech Lab, Institute of Acoustics, Chinese Academy of Science carried out a series of studies and presents their efforts in automatically synchronizing spoken utterances with their transcripts (textual contents) (ASUT), where the speech is a live stream and its corresponding transcripts are known.
The task is first simplified to the problem of online detecting the end times of spoken utterances and then a solution based on a novel frame-synchronous likelihood ratio test (FSLRT) procedure is proposed. They detail the formulation and implementation of the proposed FSLRT procedure under the Hidden Markov Models (HMMs) framework, and we study its property and parameter settings empirically. Finally, the FSLRT-based news subtitling system can correctly subtitle about 90% of the sentences with an average time deviation of about 100 ms, running at the speed of 0.37 real time (RT).
This research result was published on the recently issued Journal of Speech Communication(Vol 53, Issue 4, April 2011, Pages 508-523)