Abstract:
Today, synthetic speech is often based on concatenation of natural speech such as polyphones. So far there has mainly been two ways of adding a visual modality to such a synthesis: Morphing between single images or concatenating video sequences. In this study, however, a new method will be presented where recorded natural movements of points in the face are used to control an animated face. During the recording of polyphones to be used for synthesis, the movements of a set of markers placed in the face of the speaker are simultaneously registered in three dimensions. Connected to each acoustic polyphone will be a perfectly synchronized set of dynamic movement patterns, a visual polyphone. These visual polyphones are then concatenated in the same manner as the acoustic polyphones, and used to control the movements of an animated face. In the presented system, demisyllables are used as units both for the acoustic and the visual polyphones and it seems as these units will, in a good way, reflect the coarticulation process in natural speech.