Abstract:
Prediction of segment duration in TTS systems has in the past generally been accomplished under arithmetic approaches such as multiplicative and incompressability models [D. Klatt, J. Acoust. Am. 54, 1102--1104 (1973); R. Port, J. Acoust. Soc. Am. 69, 262--724 (1981)], and sums-of-products models [J. van Santen, Comput. Speech Lang. 8, 95--128 (1994)]. Other research, however, suggests a more complex speech timing system than is captured by such models [H. Gopal, J. Phon. 18, 497--518 (1990)]. In this study, a limited domain of vowel duration phenomena are modeled under several designs of simple feedforward networks. The networks' performance is then examined by using their output in our TTS system, and evaluating the naturalness of the resulting utterances in a perceptual experiment. Preliminary results indicate that simple two-layer perceptrons are able to learn the basic patterns of environmentally conditioned variations in segment duration, while more sophisticated networks are required to capture the complexities of these factors' interactions.