Oded Ghitza
M. Mohan Sondi
Acoust. Res. Dept., AT&T Bell Labs., Rm. 2D-536, Murray Hill, NJ 07974
For many tasks in speech signal processing it is of interest to develop an
objective measure that correlates well with the perceptual distance between
speech segments. (Speech segments means pieces of a speech signal, of duration
50--150 ms.) Such a distance metric would be useful for speech coders at low
bit rates because perturbations introduced by such coders typically last for
several tens of milliseconds. It would also be useful for automatic speech
recognition in adverse conditions. Since human beings perform well in spite of
gross distortions of the signal (e.g., due to reverberation, noisy
environments, etc.) it is justifiable to assume that mimicking human behavior
will improve recognition performance. In this talk, attempts at defining such a
metric will be described. The problem is approached in the framework of the
Diagnostic Rhyme Test [DRT]. The errors made by subjects were measured when
judiciously chosen time-frequency ``tiles'' were interchanged between the words
in each pair of the DRT test [Ghitza,