Abstract:
In a text-to-speech system, a well-constructed prosodic grammar is of primary importance. Extensive prosodic components have thus been incorporated into French speech synthesis systems. However, relatively little work has concentrated on factors that control speech fluency [B. Zellner, ``Pauses and the temporal structure of speech,'' in Fundamentals of Speech Synthesis and Speech Recognition, edited by E. Keller (Wiley, Chichester, 1994), pp. 41--62]. It was hypothesized that the lack of fluency observed in most French synthesizers was mainly due to insufficiencies in their temporal structure. This suspicion originated in the observation that utterance timing was most often inferred from its accentual structure, just as in English. However, French accent structure is known to be dramatically different from English accent structure, which renders this assumption suspect. A temporal model was thus developed, based on psycholinguistic evidence. In this model, no initial accentual information was required, yet it could be added in later processing for the production of local modifications of segment durations. First assessments of the model are promising, in view of a highly observed correlation between synthesized and natural speech with respect to measures of syllabic duration.