[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
letter to ASA workshop people
Dear ASA workshop people
Over the last few years I have been working on the problem of
temporal and rhythmic processing in the central auditory
system. Central to this work is the goal of discovering how
time and temporal events are represented.
(See for example
Todd, N.P. "The auditory primal sketch" Journal of New Music Research
23(1), 25-70. (1994)
Todd, N.P. and Brown, G. "Visualization of rhythm, time and metre" Artificial
Intelligence Review (in press) (1995))
Recently i have applied this work to the problem of Auditory Streaming and
would very much liked to have presented this work to you at
the forthcoming ASA workshop at Montreal later this month.
Unfortunately, for various reasons, this has not been possible.
However, in order that you might get some flavour of this I
enclose below a "compact" description (without figures) of
the work. If you would like a hard copy (including figures) let
me know.
Have fun at the workshop!
Neil Todd
Lecturer in Psychology
Dept of Psychology
University of Manchester
Manchester M13 9PL
UK
tel. 0161 275 2585
fax. 0161 275 2588
***********************************************************
A MODEL OF AUDITORY STREAMING
A well documented phenomenon in hearing research is that
whereby a sequence of tones is heard to splinter into two or
more "streams" depending on the proximity in time and frequency
of the tones [1]. This and other related phenomena have been
the basis for intense research activity for the last few years
[2] since they provide a key to the workings of the auditory
system. Here a computational model is proposed which is able to
account for streaming. This model differs from other approaches
[3,4] in that it is explicitly based on a theory of
neurophysiological processes in the auditory cortex [5].
Specifically, the first stage of the model simulates cochlear
filtering into a number of frequency channels each of which are
independently processed by a population of cells which
effectively act as band-pass filters of amplitude modulation
(AM) between the range 2 Hz and 100 Hz. A cross-correlational
gating mechanism groups those channels with coherent modulation
spectra.
It has been recognised for some time that AM coherence over a
range of modulation frequencies is a powerful cue for spectral
grouping [6,7]. Physiological studies have provided evidence
that the neural basis for grouping by AM coherence may lie in
both cortical [5] and sub-cortical [8] populations of cells
which have a band-pass of AM characteristic [5]. A study by
Schreiner and Urbas [5] showed that 88% of units in all major
auditory cortical fields including AI, AII, AAF, PAF have
band-pass type responses with a range of best modulation
frequencies (BMFs) from 2 Hz up to about 100 Hz (about 12%
showed a low-pass response). Some fields appeared to show an
approximately orthogonal tonotopic/AM arrangement whilst others
had no apparent tonotopicity or orthogonality. The model
proposed here is an explicit attempt to simulate the behaviour
of these band-pass of AM populations in the auditory cortex. In
order to do this it was assumed that for each tonotopically
organised channel there exists an independent subpopulation of
AM band-pass cells which may be described as an approximately
constant -Q logarithmically spaced AM filter-bank.
The impulse response $h_k$ of an individual cell is an
exponentially damped cosinusoid
$$
h_k = {\rm exp} [-\zeta\omega_k t] {\rmcos} \omega_k \sqrt {1 -
\zeta^2} t
$$
where $\omega_k $ is the cell's BMF and $\zeta$ is a damping
ratio, assumed to be constant. The output of each cell $s_k(t)$
may be computed by convolution of the impulse response $h_k$
with the cell input which is assumed to be the cochlear nerve
firing rate r(t). Thus, for the $i$th cochlear channel the
$k$th cell output is given by
$$
s_{ik} = h_{ik}(t) *r_i(t) = \int_{0}^{\infty}h_{ik} (t')
r_i(t-t') dt'
$$
The inputs $r_i(t)$ to the AM filter-banks were derived from a
model of the auditory periphery implemented by the Sheffield
Auditory Group [9]. This consists of an interpolated look-up
table for the outer-middle ear transfer function, a gammatone
filter-bank [10] to simulate the basilar membrane and an inner
hair-cell model [11] which give an estimate of the auditory
nerve firing rate probability. The cochlear filters are spaced
on the ERB rate scale [12]. In total therefore, there are $nm$
cells where $n$ is the number of cochlear channels and $m$ is
the number of AM channels. When the system is presented with a
periodic stimulus such as an isochronous sequence of tone
bursts, the cortical cells give rise to a response pattern as
illustrated in Fig 1.
FIG 1 HERE
::::::::::::::::::::::::::::::::::::::(start of caption)
Figure 1. Temporal Integration. The output of cortical cells in
response to a sequence of 40 ms 1 kHz tone bursts with a
repetition period of 200 ms (a) after 1 event (0.04s) (b)
after 2 events (0.24s) (c) 3 events (0.44s) (d) 4 events
(0.64s) (e) 8 events (1.44s) (f) 16 events (3.04s). There are
1000 cells in total where m=40 (range 2-64 Hz at 8 per octave)
and n =25 (range 0.5-2 kHz). The response pattern is a set of
harmonic series with fundamentals at 5 Hz for those cochlear
channels within a critical band of 1 kHz. The AM spectra
develop over time but reach an asymptote depending on the
damping ratio (Q=16 here) of the cell impulse response and thus
may be considered to be a form of temporal integration [13,14].
Note that a harmonic series forms even after just two events,
thus representing a single time interval. The position of the
maxima of the fundamental gives a representation of the time
interval. Consistent with the psychophysics of temporal
judgements [15,16], the cortical-time slightly over-estimates
real-time. Indeed, if presented with a single continous tone of
the same duration, this over-estimation is considerably larger
and may account for the so-called filled interval illusion.
However, as more events arrive the spectra sharpen-up and the
over-estimation is reduced. If the input stimulus is
discontinued the AM spectra decay with a time-constant
dependent on both the damping and the BMF of the cell. The cell
population as whole may thus be considered to be one form of
sensory memory. Since the damping ratio is constant the decay
time will be longer for lower modulation frequencies.
:::::::::::::::::::::::::::::::::::::::(end of caption)
The cross-correlation gating mechanism was simulated by
computing a product moment correlation $\rho_{ij}$ between all
pairs of AM spectra $s_{ik}$ and $s_{jk}$ so that
$$
\rho_{ij} = { m \sum_k s_{ik}(t) s_{jk}(t) -\sum_k s_{ik}(t)
s_{jk}(t)
\over
\sqrt{m\sum_k s_{ik}(t)^2-\Bigl(\sum_k s_{ik}(t) \Bigr)^2}
\sqrt{ m\sum_k s_{jk}(t)^2-\Bigl(\sum_k s_{jk}(t) \Bigr)^2} }
$$
which is updated when each new tone burst event is detected.
This has the advantage that the correlation is not explicitly
carried out in time since time is implicit in the spectra. The
event detection is carried out by a parallel cortical mechanism
based on the smaller population of cells which have a low-pass,
assumed approximately Gaussian impulse response. Whilst it is
beyond the scope of this letter to describe this in any detail,
briefly this mechanism is analogous the Marr-Hildreth theory of
edge detection [17,18,19,20,21,22] except that the multi-scale
analysis is done in time rather than visual displacement.
Since, the low-pass responses are causal they have a finite
delay, which is a function of their time-constant, and so the
population as a whole acts as a sensory memory since it has a
distribution of time-constants. Thus both the low-pass and
band-pass type cell distributions have a sensory memory
function. The low-pass mechanism appears, at least
qualitatively, to account for the phenomenon of backward
masking[21] which is central to rhythmic perception.
FIGS 2 AND 3 HERE
::::::::::::::::::::::::::(start of captions):::::::::
Figure 2. Grouping of cochear channels by frequency proximity.
The output of the cortical cells, (a), (c), (e) and (g), and
their corresponding cross-correlation matrices (b), (d), (f)
and (h), in response to a 2 second presentation of an ABA-ABA-
sequence of 40 ms bursts where the AB inter-onset interval is
100 ms. There are 1000 cells in total where m=40 (range 2-32 Hz
at 10 per octave) and n =25 (range 0.5-4 kHz). Four versions
are presented with A fixed at 1 kHz but where B=1 kHz (0
semitones) ((a) and (b)); 1.260 kHz (4 semitones) ((c) and
(d)); 1.587 kHz (8 semitones) ((e) and (f); 2 kHz (12
semitones) ((g) and (h)). When A and B are the same frequency a
galloping rhythm is heard. However, as the B tone is moved away
from the A tone in frequency the galloping rhythm disappears
and two separate streams are heard an A-A-A- and a
B---B---B---. The point at which it is not possible to hear the
gallop is the temporal coherence boundary (highly variable as
a function repetition interval) whilst the point below which it
possible to hear streams is the fission boundary (relatively
fixed between 3-4 semitones). In the region between these two
boundaries it is possible to hear either streaming or the
galloping rhythm depending on attentional set. The
cross-correlation matrix has clearly captured the effect of
frequency proximity. Neighbourhoods of high cross-correlation
indicate fusion. When fission takes place two neighbourhoods
are clearly separated. However, in the ambiguous region the two
neighbourhoods are still attached.
Figure 3. Grouping of cochear channels by temporal proximity.
The output of the cortical cells, (a), (c), (e) and (g), and
their corresponding cross-correlation matrices (b), (d), (f)
and (h), in response to a 5 second presentation of an ABA-ABA-
sequence of 40 ms bursts. There are 1000 cells in total where
m=40 (range 2-32 Hz at 10 per octave) and n =25 (range 0.5-4
kHz). Four versions are presented where both A and B are fixed
at A=1 kHz and B = 1.260 kHz (4 semitones) but where the
repetition interval is varied: (a), (b) 240 ms; (c), (d) 180
ms; (e), (f) 120 ms; (g), (h) 60 ms. A frequency difference
of 4 semitones is very close to the fission boundary, which
varies slightly with temporal proximity. However, the effect
of temporal proximity is clear in the cross-correlation
matrices. The degree of fissure between the two groups of
cochlear channels is increased with increasing repetition rate.
This change in the degree oif fissure is a function of two main
factors. First, if the repetition rate is less the lowest BMF
in the cell distribution then one or both of the fundamentals
will not be represented and only its harmonics will be
available for cross-channel comparison, some of which may
coincide, thus reducing sensitivity. Secondly, at higher
repetition rates the harmonics will be more separated thus also
increasing sensitivity. The exact value of the cut off in the
distribution is not clear though Schreiner's data suggest about
2 Hz.
:::::::::::::::::::(end of captions)::::::::::::::::::::
In order to test the theory, the system was presented with
sequences of tone bursts arranged in a repetitive ABA-ABA
pattern as had been used in a seminal experiment to
investigate streaming by Van Noorden [1]. Van Noorden's
experiment showed that streaming could be affected by three
main factors: frequency proximity, temporal proximity and the
number of repetitions of the ABA pattern. It is clear from Fig
2 that the cross-correlated AM mechanism provides a good
account of the grouping of cochlear channels by frequency
proximity. If two channels are separated by more than about a
critical band and their rhythms are either incoherent or out of
phase then they will no longer fuse. The cross-correlated AM
mechanism clearly also accounts for the grouping of cochlear
channels by temporal proximity (see Figure 3) which is mainly
determined by the distribution of the cell BMFs. Similarly,
the temporal development of streaming (see Figure 4) may
accounted for by the degree of damping in the cell impulse
response.
FIG 4 HERE
:::::::::::::::(start of caption):::::::::::::::::::::
Figure 4. Temporal development of streaming. The output of the
cortical cells, (a), (c), (e) and (g), and their
corresponding cross-correlation matrices (b), (d), (f) and (h),
in response to an ABA-ABA- sequence of 40 ms bursts. There are
1000 cells in total where m=40 (range 2-32 Hz at 10 per octave)
and n =25 (range 0.5-4 kHz). Four versions are presented where
both A and B are fixed at A=1 kHz and B = 1.260 kHz (4
semitones, the interval is fixed at 60 ms (c.f. Fig 3g,h) but
where the presentation length is varied: (a), (b) one ABA
(0.18 seconds); (c), (d) 2 ABAs (0.42 seconds); (e), (f) 4 ABAs
(0.9 seconds); (g),(h) 8 ABAs (1.66 seconds). The
cross-correlation matrices show a clear increase in the degree
of fissure between the two fused neighbourhoods as a function
of the number of ABA repetitions. This behaviour is a function
of the decay time of the cells and clearly relates to the
phenomena of temporal integration as in Fig 1.
:::::::::::::(end of caption)::::::::::::::::::::::::::::
In conclusion then, the model proposed here is able to account
for all the main features of auditory streaming and is
consistent with both the psychoacoustical and
neurophysiological data. In a current development of this
theory the AM coherence mechanism is used to distribute or gate
the cochear nerve inputs to disjoint subpopulations of the
low-pass sensory memory mechanism. In this way it should
account for the disappearance of the galloping ABA rhythm since
backward masking will only take place within a single
subpopulation. There still remains some uncertainty about the
exact values of the parameters and clearly a constant-Q
logarithmic AM cell distribution is an approximation. However,
with further testing of the model it is believed that these
uncertainties may be resolved. Finally, it is believed that a
similar mechanism may also operate with sub-cortical AM cells
(e.g. in the inferior colliculus) whose BMFs are in the
periodicity pitch range.
References
1. Van Noorden, L.P.A.S. PhD Thesis. Eindhoven:IPO. (1975)
2. Bregman, A.S. Auditory Scene Analysis. Camb, MA: MIT Press.
(1990)
3. Beauvois, M.W. and Meddis, R. Q.J.Exp.Psychol. 43, 517-541.
(1991)
4. Brown, G. and Cooke, M. Proc. International Joint
Conference on Computational Auditory Scene Analysis. Montreal.
(1995)
5. Schreiner, C.E. and Urbas, J.V. Hear. Res . 32, 49-64.
(1988)
6. Bregman, A.S, Abramson, J., Doering, P. and Darwin, C.J.
Percept. and Psychophys. 37, 483-493. (1985)
7. Sheft, S. and Yost, W.A. J. Acoust. Soc. Am. 92, 2361.
(1992)
8. Schreiner, C.E. and Langer, G. J. Neurophysiol. 51,
1823-1840. (1988)
9. Brown, G. and Cooke, M. Comp. Speech and Lang. 93,
2870-2878. (1994)
10. Patterson, R., Nimmo-Smith, I., Holdsworth, J. and Rice, P.
Cambridge: APU Report 2341. (1988)
11. Meddis, R. J. Acoust. Soc. Am. 79(3), 702-711. (1986)
12. Glasberg, B. and Moore, B. Hear. Res . 47, 103-138. (1990)
13. Todd, N.P. McAngus J. Acoust. Soc. Am 96(5), Pt. 2, 3258.
(1994)
14. Todd, N.P. McAngus B. J. Audiol. 28, (1994)
15. Woodrow, H. J. Exp. Psychol. 17, 167-152. (1934).
16. Drake, C. and Botte, M. Percept. and Psychophys. 54 (3),
277-286. (1994)
17. Marr, D. and Hildreth, E. Proc. R. Soc. Lond B 200,
269-294. (1980)
18. Todd, N.P. McAngus J. Acoust. Soc. Am. 92(4), Pt. 2, 2380.
(1992)
19. Todd, N.P. McAngus J. Acoust. Soc. Am. 93(4), Pt. 2, 2363.
(1993)
20. Todd, N.P. McAngus J. Acoust. Soc. Am . 96(5), Pt. 2, 3301.
(1994)
21. Todd, N.P. McAngus J. New Mus. Res. 23(1), 25-70. (1994)
22. Todd, N.P. McAngus J. Acoust. Soc. Am 97(3), 1940-1949.
(1995)