Abstract:
Unsupervised instantaneous adaptation, which uses the input utterance itself for adaptation, is the ideal speaker adaptation method for speech recognition, and is expected to be very useful for a wide range of applications. Since voice individuality is phoneme dependent, speaker adaptation must be performed model dependently. However, it is impossible to obtain a complete model sequence, that is, what is spoken, for each input utterance, especially for speakers who have many recognition errors when using speaker-independent models. Therefore, how to perform model-dependent adaptation without knowing the correct model sequence is a crucial issue. If all possible model sequences were hypothesized and used for adaptation, the amount of calculation would become enormous. This paper proposes a new adaptation method, in which N-best hypotheses are created by applying speaker-independent phone models to each input utterance, and speaker adaptation based on a constrained MAP estimation technique is then applied to each hypothesis. Using this method, the likelihood of a correct hypothesis existing in a low rank with speaker-independent models rises, and, as a result, recognition accuracy increases. Experimental results for several continuous speech recognition tasks show that recognition accuracy is increased by this method, even for speakers who have very low accuracy with speaker-independent models.