Abstract:
This paper describes a method for producing high-quality speech synthesis, without signal processing, using indexing and resequencing of phone-sized segments from a prerecorded speech corpus for the purpose of reproducing the voice characteristics and speaking style of the original speaker to create novel utterances. It describes procedures for indexing and retrieval using pointers into an external speech corpus that enable the synthesizer to be both language- and speaker independent. The prosody-based synthesis unit selection process does not itself produce speech sounds, but yields an index for a ``random-access'' retrieval sequence into the original speech to produce the closest approximation to a desired utterance from the segments available in a given speech corpus. To find the optimal sequence of segments for concatenation, the synthesizer first creates an inventory of phones and their acoustic and prosodic characteristics, and then selects from among these by a weighted combination of features to give an index of the segment sequences that best match the target prosody while minimizing intersegment acoustic discontinuities. It has, to date, been tested on English, Japanese, Korean, and German using more than 20 speech corpora.