Abstract:
Text-to-speech synthesizers commonly have three components: text analysis, prosodic modules (timing, intonation), and synthesis. Here the focus is on research underlying the approach the Bell Labs text-to-speech system uses for the prosodic modules, in particular the timing module. In this approach, greedy algorithms select to-be-recorded text with maximal coverage of a prosodic feature space; segmental durations are measured in the recorded speech; exploratory data analysis and robust statistical estimation are applied to find the best-fitting models in a class of multiple regression-like models (``sums-of-products models'') and to estimate their parameters. This approach raises several issues. (1) Why not use syllabic (or other larger-unit) durations? (2) Why not use general-purpose learning engines such as classification and regression trees, requiring no human intervention? (3) Should one model subsegmental timing, and how can this be done? (4) How can this approach be used for intonation modeling?