Subject: letter to ASA workshop people From: Neil Todd <todd(at)HERA.PSY.MAN.AC.UK> Date: Fri, 4 Aug 1995 10:49:21 +0100Dear ASA workshop people Over the last few years I have been working on the problem of temporal and rhythmic processing in the central auditory system. Central to this work is the goal of discovering how time and temporal events are represented. (See for example Todd, N.P. "The auditory primal sketch" Journal of New Music Research 23(1), 25-70. (1994) Todd, N.P. and Brown, G. "Visualization of rhythm, time and metre" Artificial Intelligence Review (in press) (1995)) Recently i have applied this work to the problem of Auditory Streaming and would very much liked to have presented this work to you at the forthcoming ASA workshop at Montreal later this month. Unfortunately, for various reasons, this has not been possible. However, in order that you might get some flavour of this I enclose below a "compact" description (without figures) of the work. If you would like a hard copy (including figures) let me know. Have fun at the workshop! Neil Todd Lecturer in Psychology Dept of Psychology University of Manchester Manchester M13 9PL UK tel. 0161 275 2585 fax. 0161 275 2588 *********************************************************** A MODEL OF AUDITORY STREAMING A well documented phenomenon in hearing research is that whereby a sequence of tones is heard to splinter into two or more "streams" depending on the proximity in time and frequency of the tones [1]. This and other related phenomena have been the basis for intense research activity for the last few years [2] since they provide a key to the workings of the auditory system. Here a computational model is proposed which is able to account for streaming. This model differs from other approaches [3,4] in that it is explicitly based on a theory of neurophysiological processes in the auditory cortex [5]. Specifically, the first stage of the model simulates cochlear filtering into a number of frequency channels each of which are independently processed by a population of cells which effectively act as band-pass filters of amplitude modulation (AM) between the range 2 Hz and 100 Hz. A cross-correlational gating mechanism groups those channels with coherent modulation spectra. It has been recognised for some time that AM coherence over a range of modulation frequencies is a powerful cue for spectral grouping [6,7]. Physiological studies have provided evidence that the neural basis for grouping by AM coherence may lie in both cortical [5] and sub-cortical [8] populations of cells which have a band-pass of AM characteristic [5]. A study by Schreiner and Urbas [5] showed that 88% of units in all major auditory cortical fields including AI, AII, AAF, PAF have band-pass type responses with a range of best modulation frequencies (BMFs) from 2 Hz up to about 100 Hz (about 12% showed a low-pass response). Some fields appeared to show an approximately orthogonal tonotopic/AM arrangement whilst others had no apparent tonotopicity or orthogonality. The model proposed here is an explicit attempt to simulate the behaviour of these band-pass of AM populations in the auditory cortex. In order to do this it was assumed that for each tonotopically organised channel there exists an independent subpopulation of AM band-pass cells which may be described as an approximately constant -Q logarithmically spaced AM filter-bank. The impulse response $h_k$ of an individual cell is an exponentially damped cosinusoid $$ h_k = {\rm exp} [-\zeta\omega_k t] {\rmcos} \omega_k \sqrt {1 - \zeta^2} t $$ where $\omega_k $ is the cell's BMF and $\zeta$ is a damping ratio, assumed to be constant. The output of each cell $s_k(t)$ may be computed by convolution of the impulse response $h_k$ with the cell input which is assumed to be the cochlear nerve firing rate r(t). Thus, for the $i$th cochlear channel the $k$th cell output is given by $$ s_{ik} = h_{ik}(t) *r_i(t) = \int_{0}^{\infty}h_{ik} (t') r_i(t-t') dt' $$ The inputs $r_i(t)$ to the AM filter-banks were derived from a model of the auditory periphery implemented by the Sheffield Auditory Group [9]. This consists of an interpolated look-up table for the outer-middle ear transfer function, a gammatone filter-bank [10] to simulate the basilar membrane and an inner hair-cell model [11] which give an estimate of the auditory nerve firing rate probability. The cochlear filters are spaced on the ERB rate scale [12]. In total therefore, there are $nm$ cells where $n$ is the number of cochlear channels and $m$ is the number of AM channels. When the system is presented with a periodic stimulus such as an isochronous sequence of tone bursts, the cortical cells give rise to a response pattern as illustrated in Fig 1. FIG 1 HERE ::::::::::::::::::::::::::::::::::::::(start of caption) Figure 1. Temporal Integration. The output of cortical cells in response to a sequence of 40 ms 1 kHz tone bursts with a repetition period of 200 ms (a) after 1 event (0.04s) (b) after 2 events (0.24s) (c) 3 events (0.44s) (d) 4 events (0.64s) (e) 8 events (1.44s) (f) 16 events (3.04s). There are 1000 cells in total where m=40 (range 2-64 Hz at 8 per octave) and n =25 (range 0.5-2 kHz). The response pattern is a set of harmonic series with fundamentals at 5 Hz for those cochlear channels within a critical band of 1 kHz. The AM spectra develop over time but reach an asymptote depending on the damping ratio (Q=16 here) of the cell impulse response and thus may be considered to be a form of temporal integration [13,14]. Note that a harmonic series forms even after just two events, thus representing a single time interval. The position of the maxima of the fundamental gives a representation of the time interval. Consistent with the psychophysics of temporal judgements [15,16], the cortical-time slightly over-estimates real-time. Indeed, if presented with a single continous tone of the same duration, this over-estimation is considerably larger and may account for the so-called filled interval illusion. However, as more events arrive the spectra sharpen-up and the over-estimation is reduced. If the input stimulus is discontinued the AM spectra decay with a time-constant dependent on both the damping and the BMF of the cell. The cell population as whole may thus be considered to be one form of sensory memory. Since the damping ratio is constant the decay time will be longer for lower modulation frequencies. :::::::::::::::::::::::::::::::::::::::(end of caption) The cross-correlation gating mechanism was simulated by computing a product moment correlation $\rho_{ij}$ between all pairs of AM spectra $s_{ik}$ and $s_{jk}$ so that $$ \rho_{ij} = { m \sum_k s_{ik}(t) s_{jk}(t) -\sum_k s_{ik}(t) s_{jk}(t) \over \sqrt{m\sum_k s_{ik}(t)^2-\Bigl(\sum_k s_{ik}(t) \Bigr)^2} \sqrt{ m\sum_k s_{jk}(t)^2-\Bigl(\sum_k s_{jk}(t) \Bigr)^2} } $$ which is updated when each new tone burst event is detected. This has the advantage that the correlation is not explicitly carried out in time since time is implicit in the spectra. The event detection is carried out by a parallel cortical mechanism based on the smaller population of cells which have a low-pass, assumed approximately Gaussian impulse response. Whilst it is beyond the scope of this letter to describe this in any detail, briefly this mechanism is analogous the Marr-Hildreth theory of edge detection [17,18,19,20,21,22] except that the multi-scale analysis is done in time rather than visual displacement. Since, the low-pass responses are causal they have a finite delay, which is a function of their time-constant, and so the population as a whole acts as a sensory memory since it has a distribution of time-constants. Thus both the low-pass and band-pass type cell distributions have a sensory memory function. The low-pass mechanism appears, at least qualitatively, to account for the phenomenon of backward masking[21] which is central to rhythmic perception. FIGS 2 AND 3 HERE ::::::::::::::::::::::::::(start of captions)::::::::: Figure 2. Grouping of cochear channels by frequency proximity. The output of the cortical cells, (a), (c), (e) and (g), and their corresponding cross-correlation matrices (b), (d), (f) and (h), in response to a 2 second presentation of an ABA-ABA- sequence of 40 ms bursts where the AB inter-onset interval is 100 ms. There are 1000 cells in total where m=40 (range 2-32 Hz at 10 per octave) and n =25 (range 0.5-4 kHz). Four versions are presented with A fixed at 1 kHz but where B=1 kHz (0 semitones) ((a) and (b)); 1.260 kHz (4 semitones) ((c) and (d)); 1.587 kHz (8 semitones) ((e) and (f); 2 kHz (12 semitones) ((g) and (h)). When A and B are the same frequency a galloping rhythm is heard. However, as the B tone is moved away from the A tone in frequency the galloping rhythm disappears and two separate streams are heard an A-A-A- and a B---B---B---. The point at which it is not possible to hear the gallop is the temporal coherence boundary (highly variable as a function repetition interval) whilst the point below which it possible to hear streams is the fission boundary (relatively fixed between 3-4 semitones). In the region between these two boundaries it is possible to hear either streaming or the galloping rhythm depending on attentional set. The cross-correlation matrix has clearly captured the effect of frequency proximity. Neighbourhoods of high cross-correlation indicate fusion. When fission takes place two neighbourhoods are clearly separated. However, in the ambiguous region the two neighbourhoods are still attached. Figure 3. Grouping of cochear channels by temporal proximity. The output of the cortical cells, (a), (c), (e) and (g), and their corresponding cross-correlation matrices (b), (d), (f) and (h), in response to a 5 second presentation of an ABA-ABA- sequence of 40 ms bursts. There are 1000 cells in total where m=40 (range 2-32 Hz at 10 per octave) and n =25 (range 0.5-4 kHz). Four versions are presented where both A and B are fixed at A=1 kHz and B = 1.260 kHz (4 semitones) but where the repetition interval is varied: (a), (b) 240 ms; (c), (d) 180 ms; (e), (f) 120 ms; (g), (h) 60 ms. A frequency difference of 4 semitones is very close to the fission boundary, which varies slightly with temporal proximity. However, the effect of temporal proximity is clear in the cross-correlation matrices. The degree of fissure between the two groups of cochlear channels is increased with increasing repetition rate. This change in the degree oif fissure is a function of two main factors. First, if the repetition rate is less the lowest BMF in the cell distribution then one or both of the fundamentals will not be represented and only its harmonics will be available for cross-channel comparison, some of which may coincide, thus reducing sensitivity. Secondly, at higher repetition rates the harmonics will be more separated thus also increasing sensitivity. The exact value of the cut off in the distribution is not clear though Schreiner's data suggest about 2 Hz. :::::::::::::::::::(end of captions):::::::::::::::::::: In order to test the theory, the system was presented with sequences of tone bursts arranged in a repetitive ABA-ABA pattern as had been used in a seminal experiment to investigate streaming by Van Noorden [1]. Van Noorden's experiment showed that streaming could be affected by three main factors: frequency proximity, temporal proximity and the number of repetitions of the ABA pattern. It is clear from Fig 2 that the cross-correlated AM mechanism provides a good account of the grouping of cochlear channels by frequency proximity. If two channels are separated by more than about a critical band and their rhythms are either incoherent or out of phase then they will no longer fuse. The cross-correlated AM mechanism clearly also accounts for the grouping of cochlear channels by temporal proximity (see Figure 3) which is mainly determined by the distribution of the cell BMFs. Similarly, the temporal development of streaming (see Figure 4) may accounted for by the degree of damping in the cell impulse response. FIG 4 HERE :::::::::::::::(start of caption)::::::::::::::::::::: Figure 4. Temporal development of streaming. The output of the cortical cells, (a), (c), (e) and (g), and their corresponding cross-correlation matrices (b), (d), (f) and (h), in response to an ABA-ABA- sequence of 40 ms bursts. There are 1000 cells in total where m=40 (range 2-32 Hz at 10 per octave) and n =25 (range 0.5-4 kHz). Four versions are presented where both A and B are fixed at A=1 kHz and B = 1.260 kHz (4 semitones, the interval is fixed at 60 ms (c.f. Fig 3g,h) but where the presentation length is varied: (a), (b) one ABA (0.18 seconds); (c), (d) 2 ABAs (0.42 seconds); (e), (f) 4 ABAs (0.9 seconds); (g),(h) 8 ABAs (1.66 seconds). The cross-correlation matrices show a clear increase in the degree of fissure between the two fused neighbourhoods as a function of the number of ABA repetitions. This behaviour is a function of the decay time of the cells and clearly relates to the phenomena of temporal integration as in Fig 1. :::::::::::::(end of caption):::::::::::::::::::::::::::: In conclusion then, the model proposed here is able to account for all the main features of auditory streaming and is consistent with both the psychoacoustical and neurophysiological data. In a current development of this theory the AM coherence mechanism is used to distribute or gate the cochear nerve inputs to disjoint subpopulations of the low-pass sensory memory mechanism. In this way it should account for the disappearance of the galloping ABA rhythm since backward masking will only take place within a single subpopulation. There still remains some uncertainty about the exact values of the parameters and clearly a constant-Q logarithmic AM cell distribution is an approximation. However, with further testing of the model it is believed that these uncertainties may be resolved. Finally, it is believed that a similar mechanism may also operate with sub-cortical AM cells (e.g. in the inferior colliculus) whose BMFs are in the periodicity pitch range. References 1. Van Noorden, L.P.A.S. PhD Thesis. Eindhoven:IPO. (1975) 2. Bregman, A.S. Auditory Scene Analysis. Camb, MA: MIT Press. (1990) 3. Beauvois, M.W. and Meddis, R. Q.J.Exp.Psychol. 43, 517-541. (1991) 4. Brown, G. and Cooke, M. Proc. International Joint Conference on Computational Auditory Scene Analysis. Montreal. (1995) 5. Schreiner, C.E. and Urbas, J.V. Hear. Res . 32, 49-64. (1988) 6. Bregman, A.S, Abramson, J., Doering, P. and Darwin, C.J. Percept. and Psychophys. 37, 483-493. (1985) 7. Sheft, S. and Yost, W.A. J. Acoust. Soc. Am. 92, 2361. (1992) 8. Schreiner, C.E. and Langer, G. J. Neurophysiol. 51, 1823-1840. (1988) 9. Brown, G. and Cooke, M. Comp. Speech and Lang. 93, 2870-2878. (1994) 10. Patterson, R., Nimmo-Smith, I., Holdsworth, J. and Rice, P. Cambridge: APU Report 2341. (1988) 11. Meddis, R. J. Acoust. Soc. Am. 79(3), 702-711. (1986) 12. Glasberg, B. and Moore, B. Hear. Res . 47, 103-138. (1990) 13. Todd, N.P. McAngus J. Acoust. Soc. Am 96(5), Pt. 2, 3258. (1994) 14. Todd, N.P. McAngus B. J. Audiol. 28, (1994) 15. Woodrow, H. J. Exp. Psychol. 17, 167-152. (1934). 16. Drake, C. and Botte, M. Percept. and Psychophys. 54 (3), 277-286. (1994) 17. Marr, D. and Hildreth, E. Proc. R. Soc. Lond B 200, 269-294. (1980) 18. Todd, N.P. McAngus J. Acoust. Soc. Am. 92(4), Pt. 2, 2380. (1992) 19. Todd, N.P. McAngus J. Acoust. Soc. Am. 93(4), Pt. 2, 2363. (1993) 20. Todd, N.P. McAngus J. Acoust. Soc. Am . 96(5), Pt. 2, 3301. (1994) 21. Todd, N.P. McAngus J. New Mus. Res. 23(1), 25-70. (1994) 22. Todd, N.P. McAngus J. Acoust. Soc. Am 97(3), 1940-1949. (1995)