Abstract:
In recent speech recognition technology, the matching measure between a hypothesis and the corresponding speech segment is usually defined on the basis of HMM likelihood. As is well known, however, the likelihood is a relative measure, and some kind of normalization is necessary when hypotheses corresponding to different speech segments are to be compared. The aim of this paper is to show that the mutual information, or, equivalently, the likelihood normalized by the probability of the speech segment, is a better acoustic matching measure than the likelihood. An ergodic HMM was exploited to estimate the speech probability. An all-phone model was also tried as a speech probability estimator for comparison. An HMM-based connected word recognition algorithm was employed to generate recognition hypotheses. Those hypotheses were scored according to the above matching measures. An ART 200-sentence English speech database was used for the experiment. Evaluation was conducted from various points of view: recognition rate, word-end detection power, etc. The results show that the mutual information calculated with an ergodic HMM significantly outperforms the likelihood as an acoustic matching measure.