Abstract:
This work investigates the use of prosodic features in modeling disfluencies (filled pauses, repeated words, and self-repairs) in spontaneous speech. The main goal is to automatically detect and correct disfluencies, so that a ``fluent'' version of a disfluent utterance can be used as input for speech understanding and other applications. A second goal is to develop explicit acoustic and language models for disfluencies to improve speech recognition performance for spontaneous speech. The prosodic features examined include duration, fundamental frequency, amplitude, and features correlated with voice quality. Decision trees serve as the acoustic models that relate these prosodic features to disfluency events. To integrate the disfluency model into speech recognition, decision tree probabilities are combined with standard acoustic model scores and probabilities from a ``Clean-up'' language model to rescore N-best hypotheses. The Cleanup language model represents disfluencies as hidden events, and predicts words following a disfluency from the corresponding fluent word sequence. A linguistically hand-annotated version of the Switchboard corpus is used for model training and evaluation. [Work supported by the National Science Foundation under Grants No. IRI-9314967 and No. IRI-8905249.]