Abstract:
A phonetic HMM labeler was developed for use in an automatic diphone extraction system. Using the TIMIT database for training, the best HMM labeler performance was obtained when the number of states for a given phoneme model was proportional to the average duration of that phoneme over the TIMIT database, and the HMM topology allowed both repetition of states and skipping of adjacent states. The acoustic feature set was comprised of 30 features per frame (8 Mel cepstral coefficients, the log rms amplitude, and the zero crossing rate of a frame, as well as their first and second time derivatives). Labeling accuracy was tested using all speech files in the TIMIT test set and accuracy was assessed in terms of the degree of separation between labeler-assigned phoneme boundaries and the nominal phoneme boundaries provided in the database. 97.4% of the boundaries assigned by the labeler were within 30 ms of the nominal phoneme boundaries, and 86.8% were within 10 ms (i.e., one analysis frame) of the nominal boundaries.