Abstract:
A novel pitch-synchronous analysis--synthesis method has been developed, in which a Rosenberg--Klatt (RK) model is used to simulate a glottal waveform of voiced speech, whereas an autoregressive with exogenous input (ARX) model is applied to represent the speech production process. Formant frequencies and bandwidths are automatically estimated from the ARX parameters, while the RK parameter values are obtained using the simulated annealing method. All the parameter values estimated can be used to re-synthesize speech based on the ARX equation. By modifying the parameter values estimated from an utterance, speech of various voice qualities can be synthesized. Three Japanese male adults uttered the same sentence |aoiueoie| (``Say blue top'' in English). The values of the formant frequency and bandwidth contours as well as the voicing source parameters of each of the speakers were systematically modified to simulate voice quality of the other speakers and a perceptual contribution of the parameters to the speaker individuality was quantitatively studied. It was found that dynamic characteristics of the first two formant frequencies were the most important and the glottal source parameters were of the least significance in describing individualities of the male speakers.