Abstract:
A set of automatic algorithms expresses the acoustic vowel signal as the ratio series, log(f[inf j]/F[inf 0][sup 2/3]), where f[inf j] are the first 32 integer multiples of F[inf 0], plus an additional 32 log(f[inf j]/F[inf 0][sup 2/3]) delta terms that represent vowel trajectories. F[inf 0] is measured on a window-by-window basis by an algorithm that eliminates all smearing due to conventional windowing algorithms. In order to reduce the dimensionality of this expression, the 64 terms are summarized by 11 + 11 terms from the cosine series. Using ten monosyllabic words spoken by 137 men, women, and children, vowel classification is within one percent of human accuracy. Because the model uses none of the circular definitions of formant measurement and makes only one assumption that is unique to speech, namely, the -log F[inf 0][sup 1/3] offset, the log(f[inf j]/F[inf 0][sup 2/3]) contour is superior to a formant representation of voiced speech. Because a Euclidean classifier is as accurate as a quadratic discriminant function which uses more than ten times as many degrees of freedom, it is argued that the log(f[inf j]/F[inf 0][sup 2/3]) transform may be accomplished by a genetically acquired neural mapping of the acoustic signal that facilitates the learning of vowel categories by infants.