Abstract:
A fast acoustic modeling method for speaker-independent speech recognition and a speaker-adaptation method, which is effective even with only a small amount of speech data, is described. The speaker-independent phoneme models are generated by composing representative speaker-dependent phoneme models, which are selected from among all speaker-dependent models by clustering the models without Baum--Welch parameter re-estimation. This generation method greatly reduces the computational cost needed to create the speaker-independent HMMs to much less than that of the Baum--Welch method, i.e., by a factor between approximately 1/20 and 1/50. This speaker adaptation algorithm unifies two conventional techniques, i.e., a maximum a posteriori (MAP) estimation and transfer vector field smoothing. A priori knowledge from initial models is statistically combined with a posteriori knowledge derived from the adaptation data to complement the sparse adaptation data. Transfer vector smoothing is used to interpolate the untrained parameters. Furthermore, in order to obtain a suitable a priori knowledge concerning speaker characteristics, a speaker-clustering model, generated by using speech of a selected speaker cluster, is used as an initial model. The cluster selection is performed with a tree-structured speaker clustering technique that determines the number of speakers and the members in the cluster based on speaker similarity. [Work supported by ART-ITL.]