[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Speech and bottom-up processes



BRUNO H. Repp wrote:

> Repp, B. H., Frost, R., & Zsiga, E. (1992). Lexical mediation between
> sight and sound in speechreading. The Quarterly Journal of Experimental
> Psychology, 45A, 1-20.
>
>         There we showed that the detectability of the presence of speech
> in noise was not influenced by simultaneous visual exposure to matching or
> nonmatching articulations (lipreading), which again implies an autonomous
> auditory bottom-up process for speech detection. The lexical status of
> the stimuli (words or nonwords) affected response bias, but not sensitivity.
> A similar finding for printed words as the visual stimuli, rather than
> lipreading, was reported by
>
> Frost, R., Repp, B. H., & Katz, L. (1988). Can speech perception be
> influenced by simultaneous presentation of print? Journal of Memory and
> Language, 27, 741-755.
>
>         All three studies provide evidence that the presence of speech
> can be detected on the basis of acoustic cues long before it is even
> partially recognized. The detection threshold (70% correct, with 50%
> being chance) for speech in broadband noise is about -28 dB of S/N ratio,
> and that in amplitude-modulated (signal-correlated) noise is about
> -13 dB (Repp & Frost, 1988). Both thresholds are well below the
> intelligibility thresholds.

With all due respect for Bruno, Frost and Zsiga, we have begun to replicate
these studies about the possible influence of simultaneous lip movements on
auditory detection because the original Repp et al study had serious problems in
several respects. First, let me tell you the outcome before I tell why Repp et
al failed to find the "correct" result. We are running an adaptive speech
detection experiment with three different sentence targets. The sentences are
presented in a background of white noise under three conditions: auditory alone,
auditory plus simultaneously presented matching lipread information, and
auditory plus simultaneously presented mismatched lipread information. The task
is a two interval forced choice procedure where the subjects have to indicate
the interval that contains the speech plus noise. We are using a 3-down, 1-up
adaptive procedure tracking the 79% point on the psychometric function. The
speech is held constant whereas the noise is controlled by the adaptive track.
The results show a 1-4 dB release from masking or a bimodal coherence masking
protection depending on the sentence. We are currently looking into the
correlation, on a sentence by sentence basis, between the time course of lip
opening and the rms amplitude fluctuations in the speech signals both broadband
and in selected spectral bands (especially the F2 region) as an explanation for
the differences across sentences. These results indicate that cross-modal
comodulation between visual and acoustic signals can reduce stimulus uncertainty
in auditory detection and reduce thresholds for detection. These results will be
reported at the upcoming ASA meeting in Seattle (June).

Now, why did the Repp et al study fail to see these results. First, their
equipment was incapable of precise (within 1-3 ms) acoustic-visual alignments
and allowed for as much as 100 ms desynchronization across the modalities. If
simultaneous comodulation of sensory information across the senses is important
for this effect to occur then a  misalignment of the A and V components will
weaken the effect. Second, and perhaps most important, Repp et al used a speech
modulated noise as the masker. It is well known that lipreading plus speech
modulated noise leads to improved speech intelligibility (over speechreading
alone) and that speech modulated noise has many speech cues by itself, capable
of informing subjects about phonetic features at levels well above chance.
Therefore, when the Repp et al subjects saw a moving face accompanied by a noise
alone trial, they naturally heard speech (the bias effect) because the noise was
indeed speech like in many respects. In our study we use a noise whose
modulation properties differ from the visual and acoustic signals, whereas the
visual and acoustic signals share common modulation properties. This is an
essential characteristic of all CMR studies and more recently of coherence
masking protection (CMP) described by Peter Gordon. And third (and finally),
Repp et al used disyllabic words with similar stress patterns whereas our
experiment used sentences. The shorter stimuli have create similar temporal
expectations as to when in the utterance the detection will occur, whereas the
longer more diverse sentences create a situation of greater temporal uncertainty
as to when in the listening interval the detection will occur. The greater
temporal uncertainty is alleviated to varying degrees by the visual information,
thus reducing the temporal uncertainty and reducing thresholds for detection.
Several variants of this experiment have been proposed in a new grant submitted
to the McDonnell-Pew Foundation in collaboration with brain imaging and modeling
studies conducted at UCSF and UC-Berkeley.