Subject: Computational Auditory Scene Analysis Workshop Summary From: Malcolm Slaney <malcolm(at)INTERVAL.COM> Date: Wed, 13 Sep 1995 10:44:48 -0700Report on the Computational Auditory Scene Analysis Workshop Malcolm Slaney, Dan Ellis, Dave Rosenthal Montreal, Quebec, Canada, August 19 and 20th, 1995. The first workshop on Computational Auditory Scene Analysis (CASA) was held August 19 and 20th at the 1995 IJCAI (International Joint Conference on Artificial Intelligence) in Montreal. Organized by Hiroshi Okuno and David Rosenthal, the workshop was attended by about thirty people doing work on scientific and engineering models of human audition and signal processing. Perhaps the workshop will best be remembered as the largest gathering to date of people interested in computer models of auditory scene analysis (ASA). The attendees were nearly evenly split between those that are interested in understanding human auditory perception and those that want to solve problems in auditory perception, perhaps using some of the techniques of auditory scene analysis. Al Bregman served as keynote speaker for the conference. His book, "Auditory Scene Analysis," motivates this new field and most of the talks at the workshop were attempts to implement the principles of scene analysis in computer models. His keynote address emphasized the old-plus-new principle of auditory organization (old sounds are stable, new sounds are formed from sound components that can't be explained by the old), and architectures for computer models of auditory scene analysis. The rest of the workshop was devoted to poster presentations by the workshop participants and group discussions. The three discussions will be described first, then the posters will be listed. For more information, look for the book on "Computational Auditory Scene Analysis" to be published by Erlbaum early next year! DISCUSSION 1 - "A critique of pure audition" or, "Bottom-up vs. top-down" Led by Malcolm Slaney. Malcolm motivated the discussion with a series of very telling counter- examples to the notion that vision is purely bottom-up, followed by some analogous examples in audition. Percepts such as apparent motion are often considered to be locally-calculated, yet a quartet of ambiguous flashing dots all appear to move in the *same* one of two ambiguous choices; clearly there is a context-dependent, top-down aspect to this perception. A discussion of the distinction between bottom-up and top-down processing suggested they could be functionally equivalent, but that top-down systems - where information flows both ways between processing modules - should be a more efficient implementation. It is unclear how to interpret the physiological reality of efferent fibres at every stage of the auditory chain in terms of the algorithm they implement. Much of the auditory scene analysis work is based on Marr's theory of vision. Marr's theory is very beautiful and stimulated much good research. But it's premise that humans reconstruct the entire visual scene in one's brain may not be appropriate. Perhaps an AI-like or top-down approach like Nakatani's is more appropriate. This lead to a vigourous discussion about information flow during perception tasks. It was also noted that, many of the phenomena such as auditory induction and restoration might simply be limited to the very last stage of processing - more a case of top-top processing thantop-down. Perhaps the final element of a bottom-up processing chain is the addition of information from external sources such as memory and expectation to form the resulting percept. In summary, despite the fact that many existing models of auditory organization are almost exclusively bottom-up, working from raw data to result, with no adaptation or context-dependence, few modelers would deny the value of top-down processing. However, the features of a model that actually reflect this top-downness are much less unanimous. DISCUSSION 2 - "CASA: Physiological vs. Functional Models." Led by Hamid Nawab A quick vote showed that the participants were split evenly between those wanting to simulate the physiology (at different levels) versus those that wanted to build models that wanted to solve an engineering problem. The most important outcome of the discussion was that there are two equally valid problems, one based on the science of understanding our auditory system and the other based on finding engineering solutions to auditory problems. Other questions and conclusions included: Does the brain process/evaluate symbols? Can we build models of psychophysical data without modeling the physiology? Marr's distinction between implementation, algorithm, and theory is important. Is physiological ASA analogous to learning to fly by studying birds? Do the physiological experiments inform functional models? Are we doing science or engineering? Is localization important to the problem? Question from the physiologists for the functionalists: How do we use massive parallelisms? Question from the functionalists for the physiological people: What auditory cues are there? DISCUSSION 3 - "Hard problems in representation and auditory scene analysis" Led by Dave Rosenthal and Dan Ellis. To emphasize the fact that different participants use representations in very different ways, the discussion started with a poll of the audience to see who identified with a variety of 'ists': Representation-ists, Agent-ists, Blackboard-ists, Neural-ists, Oscillator-ists, Pattern-ists, Application-ists, Physiology-ists and Psychology-ists. Average votes per participant were about 2.2 so perhaps this covered the field. We then tried to come up with a list of goals we wanted to achieve with our representations, by way of motivating the desirable features. This resulted in a range of responses varying from 'isomorphism with human behavior' to 'detecting the sound of harmonic oscillator systems in a mixture' - perhaps reflecting the physiology/application split that emerged in discussion 2. Participants then described some of the different representations they use, with justifications, ranging from simple filtered versions of the raw audio (to emphasize a feature like a particular musical pitch), to adaptive filterbank processing (to achieve a context-dependent form of representation always best suited to the signal at hand), to symbolic descriptions of the inferred source characteristics of all the sound-sources in a given environment (such as a discussion), collapsed across the many sound-events produced by each source. The second part of the discussion looked more generally at the question of hard problems in the field. The idea was to identify some specific problem areas that must be addressed before a 'complete model' of auditory organization can be built, then to try and predict or set targets for progress on that area for two years from now. One domain for improvement is to refine our existing organization models to deal with less restricted classes of sound. For instance, in two years' time, will a system be able to take a monophonic recording of two speakers with similar pitch ranges and separate their speech, including both voiced and unvoiced portions? The Sheffield-ATR 'ShATR' multi-speaker corpus CD-ROM was identified as an excellent resource for this kind of project. There was also significant discussion of issues relating to non-speech sounds. Do we need a better taxonomy, or perhaps some kind of comprehensive corpus, to redress the excessive attention given to auditory organization as applied exclusively to speech sounds? The final part of this discussion considered what the successor to this workshop should be. A comprehensive list of the conferences preferred by the workshop participants revealed no outright favorites, although the Acoustical Society meetings and the Neural Information Processing Systems conferences were popular and had the advantage of being strongly multi-disciplinary. The possibility of staging a workshop independent of a parent conference was well received, possibly affiliated with an organization such as the IEEE to add credibility. A brief mention of the AUDITORY list expressed the view that it was under-utilized, possibly because participants are wary of wasting the time of other members. The list could function well as a place to post abstracts and announce the availability of pre-prints, something rarely seen at present. POSTERS AND PAPER SYNOPSIS The poster sessions were distributed throughout the workshop. It gave everybody a chance to see every body's latest work and to have many informal discussions. Here is a brief synopsis of each presenter's work. Jean Rouat presented his ideas on a new representation for speech recognition based on short-time autocorrelation. (With Miguel Garcia) Frank Klassner explained how discrepancy diagnosis and signal reprocessing contribute to his environmental-sound separating blackboard system. (With Victor Lessor and Hamid Nawab) Alon Fishbach presented a mid-level auditory representation consisting of spectrogram segments and their features, the segmentation being based on discontinuities in sound. Steven Boker's model was able to predict where listeners placed the downbeat in rhythmic patterns using a measure of local information theoretic entropy. Kunio Kashino's poster detailed his comprehensive music-understanding system that uses Bayesian networks to integrate levels of information. (With Kazuhiroa Nakadai, Tomoyoshi Kinoshita, and Hidehiko Tanaka) Joern Grabke presented a processor that separated two voices using a binaural processor implemented with simple delay lines. (With Jens Blauert) Brian Karlsen was launching the ShATR CD-ROM, a large database of real multi-speaker discussion complete with extensive annotation and tools. (With Guy Brown, Martin Cooke, Malcolm Crawford, Phil Green, and Steve Renals) Ray Meddis described a number of psychophysical principles of pitch and sound stream suppression. (With Lowel O'Mard) Darryl Godsmark presented his blackboard system for accumulating contextual evidence in order to model human-like competition of cues. (With Guy Brown) Dan Ellis presented an overview of auditory representations and a new representation known as Wefts, based on correlograms. (With David Rosenthal) Hideki Kawahara presented his experimental work on the control of pitch by humans and the control feedback between perception and synthesis. Frederic Berthommier discussed the implications of physiological mechanisms for amplitude modulation. (With Christian Lorenzi) Masataka Goto presented his real time system for perceiving the beats in a musical audio. (With Yoichi Muraoka) Ludger Solbach compared the wavelet transform to conventional models of auditory models as a preprocessor for ASA. (With Rolf Wohrmann and Jorg Kliewer) Lonce Wyse presented his work on analyzing audio for content (speech versus audio) and detecting changes in the speaker. (With Stephen Smollar) DeLiang Wang presented his work on relaxation oscillator networks and showed that they can be used for modeling auditory stream segregation. Hamid Nawab presented his knowledge-based system for recognizing speech contaminated with environmental sounds by directed reanalysis. (With Carol Epsy-Wilson, Ramamurthy Mani, and Nabil Bitar) Nicholas Saint-Arnaud presented his work on synthesizing realistic audio textures given a small sample. (With Kris Popat) Eric Scheirer presented his system for correlating a musical score with an audio signal and using the differences to infer expressive performance information. Guy Brown presented work on using cortical oscillators to model stream segregation and showed results consistent with psychophysical data. (With Martin Cooke) Tomohiro Nakatani showed an agent based system for grouping binaural sounds using harmonic analysis and binaural information. (With Masataka Goto, Takatoshi Ito, and Hiroshi Okuno) Malini Bhandaru explained work to give an environmental sound recognition system the ability to extend its models in response to new examples. (With Victor Lessor)