Abstract:
A method has been developed for automatically extracting diphone speech segments with context-dependent boundaries. When compared with speech synthesized from manually extracted diphone speech segments, it was found that speech synthesized from the automatically extracted segments was, overall, slightly less intelligible but slightly more natural sounding [Yarrington et al., ``Robust automatic extraction of diphones with variable boundaries,'' in EUROSPEECH '95, 4th European Conference on Speech Communication and Technology, Vol. 3, pp. 1845--1848, Madrid, Spain (1995)]. The lower intelligibility appeared to be due to a small number of very poor diphone segments. While it is feasible to correct this problem by manually replacing misleading diphones, several changes have been made to the extraction procedure to eliminate or at least reduce the frequency of occurrence of incorrect diphones. In particular, a different spectral measure is being used for estimates of spectral similarity, and F0 plus a spectral robustness measure have been added as features which are used in selecting diphones for extraction. Perceptual data comparing words generated from a diphone inventory constructed using the new algorithm with words generated from the original manually constructed inventory will be presented. [Work supported by NIDRR Grant No. H133E30010, the Nemours Research Programs, and the Microsoft Corporation.]