Abstract:
A TTS voice quality experiment was conducted to select a speaker and to evaluate synthesis techniques. Small-scale TTS diphone inventories using six professional female speakers who were pre-selected in an audition were recorded. Two types of inventories were recorded for each speaker: a series of nonsense words and a series of English sentences. Using these 12 inventories, two synthesis methods were compared: PSOLA [Charpentier and Moulines, Eurospeech '89] and Harmonic Plus Noise (HNM) [Stylianou et al., Eurospeech '97]. Synthetic prosody closely modeled naturally spoken versions of the target utterances. Three fully synthetic (TTS) and two hybrid (i.e., partly recorded from the human speaker and partly synthesized) sentences formed the experimental stimuli for subjective testing. For references, two MNRU versions of the naturally spoken sentences were used: (a) Q10 (resembling low-end commercial 16-kbps encoded speech) and (b) Q35 (resembling high-quality telephone speech). Forty-one subjects rated intelligibility [I], naturalness [N], and pleasantness [P] on five-point MOS scales. A total of 936 ratings were collected from each subject. Repeated measures of analyses of variance (ANOVAs) were performed on the data. There were significant main effects of speaker, synthesis method, and inventory, plus interactions. It was found that (1) the best speaker consistently outperformed the others on all three rating scales; for the optimal combination of parameters, TTS ratings ranged (across speakers) as follows: [I] 3.64--2.94, [N] 3.36--2.7, [P] 3.34--2.53. (2) HNM outperformed PSOLA (consistently 0.25 points higher for [I], [N], [P] scores), and (3) the diphone inventory extracted from sentences was preferred over that extracted from nonsense words (with a significantly smaller difference of 0.10 for HNM than 0.19 for PSOLA).