Abstract:
A hot topic in speech recognition is developing technology for the automatic transcription of telephone conversations. The recognizer must contain robust language, pronunciation, and acoustic models that embody the world and topic knowledge and the understanding of syntax and pronunciation, which the talkers have and use in decoding each other's acoustic signals. Partly because of the talkers' shared knowledge and the casual, unprepared nature of the speech, the signals have dysfluencies, incomplete and ungrammatical expressions, and ``lazy,'' reduced articulation of words. Conversational speech recognition error rates, measured in the NIST Hub-5 evaluations, are 45% for English and 66% to 75% for Spanish, Mandarin, and Arabic. To improve this performance, the shared knowledge must be represented in a mathematical framework, which facilitates the efficient search of the sentences of a language to decode the speech. Recent work, including workshops at Rutgers CAIP and Johns Hopkins CLSP, has included the investigation of, among other techniques, multistream processing, frequency warping, adaptation of pronunciation and acoustic models of phones, pronunciation modeling, syllable-based recognition, dysfluency and discourse-state language models, and link grammar parsing. This talk will review how knowledge is represented in the recognizer architecture, searching procedures, and the results of the various investigations.