Abstract:
A short-term analysis of speech, inherited from speech coding, describes a time-varying speech signal as a sequence of vectors, each vector reflecting properties of a relatively short (10--20 ms) segment of the signal. Each analysis vector provides a single sample from the underlying time-varying speech process. Typically, the whole feature vector is used as one entity for a subsequent processing. For use in ASR, the short-term analysis may have some deficiencies. (1) From a single vector alone, there is no way to differentiate between components with different rates of change. (2) When one or a few components of the vector get corrupted, the subsequent ASR result may be corrupted. This is inconsistent with human speech perception where (1) many auditory phenomena seem to span at least a length of a syllable, and that (2) decoding of a linguistic message does not have to be severely impaired by a partial degradation of the speech signal. Emerging ASR techniques are discussed which attempt to alleviate the above-mentioned deficiencies of the short-term analysis by (1) doing temporal processing on trajectories of short-term features of speech, and (2) selectively merging information from several subsets of the short-term representation of speech. [Work supported by DoD (MDA-904-94-C-6169) and NSF/ARPA (IRI-9314959).]