Abstract:
The acoustic correlates of prosody, energy, duration, and fundamental frequency are easy to extract from the speech signal. Yet, prosody has virtually been ignored in large vocabulary speech recognition. Many systems use energy and its time derivatives in addition to spectral information, and some systems use phoneme duration models. However, those cannot truly be referred to as prosodic models. There are many reasons why it has been difficult to successfully use models of prosody in speech recognition: (a) Speech recognition models are based on phoneme-sized acoustic units, while prosodic information spans several levels, ranging from subphonetic, to phrase level; (b) speech recognition assumes quasistationarity and primarily deals in terms of absolute values, while prosodic models are susceptible to such effects as speaking rate and the speakers F0 range, requiring analysis in terms of relative values, which can cause problems in one-pass real-time systems; (c) the benefits of prosodic modeling in speech recognition are moderate (for English), and the computational cost can be huge. However, there are potential solutions for most of these problems, and as the speech recognition problem gradually becomes a speech understanding problem, more aspects of prosodic modeling will find their way into the speech recognition and understanding systems.