Re: Speech and bottom-up processes (grant )


Subject: Re: Speech and bottom-up processes
From:    grant  <grant(at)NICOM.COM>
Date:    Wed, 1 Apr 1998 15:03:52 -0500

BRUNO H. Repp wrote: > Repp, B. H., Frost, R., & Zsiga, E. (1992). Lexical mediation between > sight and sound in speechreading. The Quarterly Journal of Experimental > Psychology, 45A, 1-20. > > There we showed that the detectability of the presence of speech > in noise was not influenced by simultaneous visual exposure to matching or > nonmatching articulations (lipreading), which again implies an autonomous > auditory bottom-up process for speech detection. The lexical status of > the stimuli (words or nonwords) affected response bias, but not sensitivity. > A similar finding for printed words as the visual stimuli, rather than > lipreading, was reported by > > Frost, R., Repp, B. H., & Katz, L. (1988). Can speech perception be > influenced by simultaneous presentation of print? Journal of Memory and > Language, 27, 741-755. > > All three studies provide evidence that the presence of speech > can be detected on the basis of acoustic cues long before it is even > partially recognized. The detection threshold (70% correct, with 50% > being chance) for speech in broadband noise is about -28 dB of S/N ratio, > and that in amplitude-modulated (signal-correlated) noise is about > -13 dB (Repp & Frost, 1988). Both thresholds are well below the > intelligibility thresholds. With all due respect for Bruno, Frost and Zsiga, we have begun to replicate these studies about the possible influence of simultaneous lip movements on auditory detection because the original Repp et al study had serious problems in several respects. First, let me tell you the outcome before I tell why Repp et al failed to find the "correct" result. We are running an adaptive speech detection experiment with three different sentence targets. The sentences are presented in a background of white noise under three conditions: auditory alone, auditory plus simultaneously presented matching lipread information, and auditory plus simultaneously presented mismatched lipread information. The task is a two interval forced choice procedure where the subjects have to indicate the interval that contains the speech plus noise. We are using a 3-down, 1-up adaptive procedure tracking the 79% point on the psychometric function. The speech is held constant whereas the noise is controlled by the adaptive track. The results show a 1-4 dB release from masking or a bimodal coherence masking protection depending on the sentence. We are currently looking into the correlation, on a sentence by sentence basis, between the time course of lip opening and the rms amplitude fluctuations in the speech signals both broadband and in selected spectral bands (especially the F2 region) as an explanation for the differences across sentences. These results indicate that cross-modal comodulation between visual and acoustic signals can reduce stimulus uncertainty in auditory detection and reduce thresholds for detection. These results will be reported at the upcoming ASA meeting in Seattle (June). Now, why did the Repp et al study fail to see these results. First, their equipment was incapable of precise (within 1-3 ms) acoustic-visual alignments and allowed for as much as 100 ms desynchronization across the modalities. If simultaneous comodulation of sensory information across the senses is important for this effect to occur then a misalignment of the A and V components will weaken the effect. Second, and perhaps most important, Repp et al used a speech modulated noise as the masker. It is well known that lipreading plus speech modulated noise leads to improved speech intelligibility (over speechreading alone) and that speech modulated noise has many speech cues by itself, capable of informing subjects about phonetic features at levels well above chance. Therefore, when the Repp et al subjects saw a moving face accompanied by a noise alone trial, they naturally heard speech (the bias effect) because the noise was indeed speech like in many respects. In our study we use a noise whose modulation properties differ from the visual and acoustic signals, whereas the visual and acoustic signals share common modulation properties. This is an essential characteristic of all CMR studies and more recently of coherence masking protection (CMP) described by Peter Gordon. And third (and finally), Repp et al used disyllabic words with similar stress patterns whereas our experiment used sentences. The shorter stimuli have create similar temporal expectations as to when in the utterance the detection will occur, whereas the longer more diverse sentences create a situation of greater temporal uncertainty as to when in the listening interval the detection will occur. The greater temporal uncertainty is alleviated to varying degrees by the visual information, thus reducing the temporal uncertainty and reducing thresholds for detection. Several variants of this experiment have been proposed in a new grant submitted to the McDonnell-Pew Foundation in collaboration with brain imaging and modeling studies conducted at UCSF and UC-Berkeley.


This message came from the mail archive
http://www.auditory.org/postings/1998/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University