Abstract:
Human speech production is usually modeled by an AR process with an impulse or white Gauss process input, such as in LPC analysis. But it is well known that a vocal tract feature provides an antiformant in addition to a formant frequency, and a voiced source of human speech contains a glottal source as well as noise characteristics. In order to represent speech production efficiently, a time-varying ARMAX (ARMA with eXogeneous input) model with a glottal source model is introduced. This model provides two MA processes, one for the white Gauss input and the other for the exogeneous input, which is generated by the glottal source model. A new speech analysis method is proposed that is based on the speech production model to extract speech characteristics accurately. In the proposed method, the Rosenberg--Klatt (RK) model [Klatt et al., J. Acoust. Soc. Am. 87, 820--857 (1990)] is adopted to generate a glottal wave. The glottal source and ARMAX parameters are estimated pitch synchronously and simultaneously, so as to optimize error criterion jointly by using an adaptive MIS method [Miyanaga et al., IEEE Trans. Acoust. Speech Signal Process. ASSP-34, 423--433] in ARMAX identification. Experiments have already been conducted using synthetic speech, which is generated by an ARMAX process with the RK model glottal source and a white Gauss input. The experimental result shows that the proposed method makes it possible to estimate accurate vocal tract and voice source characteristics (such as open quotient) from the speech signal.