Abstract:
A mathematical framework based on maximum likelihood stochastic matching is proposed to perform feature and model compensation for robust speech recognition. Speech recognition is often formulated as a matching problem between the feature vectors extracted from a test utterance and a set of speech models or patterns obtained from some training corpra. It is well known that a speech recognizer often degrades in performance when the testing data are not acoustically similar to the training data. One way to improve is to find features that are invariant under all acoustic conditions and distortions. Some form of compensation is often required. The proposed stochastic matching approach assumes a structure or a form of the feature and/or model transformations. Together with a set of nuisance parameters, the transformations approximate the distortion in the test utterance. To decrease the acoustic mismatch between a test utterance and a given set of speech models, e.g., hidden Markov models, the stochastic matching algorithm estimates the nuisance parameters and then applies the feature/model transformations during speech recognition. Simple channel distortion can be approximated with linear transformations. For more complicated distortions, such as environmental, speaker, and combined mismatches, nonlinear compensating transformations are needed. These compensations give a significant performance improvement in speech recognition over the systems without them when utterances are affected by additive ambient noises and convolutional channel distortions.