Abstract:
In statistical speech recognition, speaker-independent models are usually trained by using speech samples from a large number of speakers. Those models have a problem in that they have wider feature distributions and hence greater overlaps between different phones than adequately trained speaker-dependent models. In order to cope with the interspeaker variability, a method of speech feature normalization based on affine transformation has been presented [P. Luo and K. Ozeki, Tech. Rep. of IEICE, SP96-10 (1996)]. Prior to HMM training, feature vectors of each speaker are mapped to those of a reference speaker by an affine transformation estimated with a small amount of training data. The transformation, which is phone independent and speaker dependent, is also applied to feature vectors of unknown speakers in the recognition stage. It has been shown experimentally that this method is effective in reducing interspeaker variations in the cepstral domain. In this paper, further discussions about the performance and limitations of the method are given within the framework of continuous HMM. Practical issues related to the selection of an appropriate reference speaker will also be discussed.