Abstract:
Previous examination of perceiver eye movement behavior during audiovisual speech tasks has shown that linguistically relevant visual information is distributed over large regions of the face [Vatikiotis-Bateson et al., ICSLP-94 (1994)]. Furthermore, the simultaneous production of facial and vocal tract deformations suggests a single source of control for acoustic and visual components of speech production. To further examine this possibility, orofacial motion during speech has been correlated with perioral muscle activity, the time-synchronous behavior of vocal tract articulators, and elements of the speech acoustics (e.g., rms amplitude) using both linear and nonlinear modeling techniques. Not surprisingly, since small motions require small forces, linear techniques such as minimum mean square error and second order autoregression provide reasonably good estimates of the inherently nonlinear mapping between muscle EMG and orofacial kinematics. This paper assesses the relative merits of such linear models versus nonlinear, neural network estimations of the orofacial dynamics.