[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

"Hard problems in computational auditory scene analysis"

Dear AUDITORY list (and IJCAI CASA workshop participants) -

A couple of months ago I was independently contacted by several
students curious about computer models of auditory processing and who
were looking for advice on a neat project.  Responding to them made me
realize that I was a little bit confused about how far 'my field'
extended, and, as a result, I started working on an essay to clarify my

Since one purpose of the essay was to search for a generally-
acceptable statement of the 'hard problems' in the field of
computational auditory scene analysis, I thought I would send it out to
this list.  I enclose an ascii version of the paper; a slightly better-
formatted version can be found on the web at:


It is my hope that the essay, presumptuous as it is, may be the starting
point for debate!

--  DAn Ellis  <dpwe@media.mit.edu>  <http://sound.media.mit.edu/~dpwe/>
    MIT Media Lab  Perceptual Computing - Machine Listening Group.

 - - - - - - - - - - ~/tmp/hard-probs.txt - - - - - - - - - -


               Dan Ellis, dpwe@media.mit.edu, 1995aug03

// Introduction

One of the difficulties of working in the field of computational auditory
scene analysis (CASA) - building computer models of higher auditory functions
- is the nebulous nature of the goals. In the related field of speech
recognition, it is relatively easy to define a widely-acceptable target, such
as a machine that can sit in a meeting room and transcribe the
discussion. Another analogous field is machine vision and this too tends to be
very goal-driven (finding the faces in a scene, recognizing particular body
gestures); perhaps the fact that most vision researchers have abandoned the
idea of a general-purpose scene analyzer in favor of more limited and specific
goals should serve as stern advice to researchers in audition. Outside of
speech recognition, similar goal-driven applications don't pop-up with the
same urgency in the acoustic domain. As a result, there are wide differences
of opinion over the essence of computational auditory scene analysis; the body
of researchers who identify with the CASA banner can sometimes feel
perplexingly out of sympathy with the colleagues they find beneath it.

This paper is an effort towards ameliorating that confusion by an offer of a
common focus for the field in the form of a description of a set of hard
problems. These might constitute a starting point for a debate within our
community over what truly are the questions that we should be trying to
answer. It is unlikely (and of questionable desirability) that a neat
consensus will result, with everybody persuaded to work on the same goals. But
it would be valuable to have an overt description of the different
perspectives in the field, and a statement of the common problems which may be
being studied by several researchers using subtlely different formulations.

// Aiming high : holy grails

If a talented student expresses an interest in auditory information processing
and its modeling, what guidance might he or she be given concerning an area to
study? This question has obvious practical relevance, since we who believe in
the importance and interest of this area presumably wish to encourage its
growth. While one sure way of discouraging potential recruits is to direct
them towards an intractable problem, I feel that identifying the ideal goals,
the `holy grails' of the field, would help both in motivating research and in
identifying relevant and valuable areas for work. Here are my proposals for
this category:

The sound-scene describer.  This is a program that processes a real
environmental acoustic signal and converts it into a symbolic description
corresponding to a listener's perception of the different sound events and
sources that are present. The description might be verbal, akin to that which
a person might produce if asked to describe the sound-scene, or an analogous
abstract representation. Applications for this kind of system include aids for
the deaf (to convert acoustic cues into text or another modality) and
automatic indexing of soundtrack data (e.g. to find explosions or helicopter
sounds in a database of movie audio).

The source-separator or `unmixer'.  Rather than converting a sound mixture
into an abstract description, one could imagine a machine that takes a single
input and produces several separate output channels, each composed of the
sound from a single source in the input. In most cases, human listeners would
be able to judge if such a system was `working', i.e. whether the separated
outputs matched the listener's internal perception of the different
contributions; this is the closest we have to a rigorous formulation of the
problem. Applications for a system of this kind include the restoration of
recordings corrupted by unwanted noises (e.g. coughs at a concert) or
hearing-aids for cocktail-party situations.

Predictive human model.  A major obstacle to certain research projects is that
listening tests must be included in the development loop - a perfect example
being high-quality masking-based audio compression. In theory, an automatic
system could process a sound to predict its subjectively-rated similarity to
an original, and the obtrusiveness of any distortion introduced. (Of course,
the understanding of human perception that permitted the construction of the
model would also have a profound influence on the design of such encoding
algorithms.) The insights afforded by such a system would also inform a range
of activities from treatment of hearing loss to entertainment sound-design.

// Hard Problems

The goals and applications described above are unlikely to be achieved in the
short-term, but they comprise a context within which to propose and compare
more feasible projects. The task of identifying the `hard problems' in the
field thus becomes a question of focusing on the major stumbling-blocks
separating us from these ultimate goals. Since the goals are distant the
stumbling blocks are also indistinct, but the following constitute my
perception of the critical breakthroughs that need to be made:

The nature of cues: While the importance of certain cues (such as those
discussed below) is generally accepted, it is likely that there are more
subtle cues being used that we have not yet uncovered. For example, the
phenomenon of comodulation masking release, where different frequency channels
are fused strongly on the basis of shared aperiodic modulation, would seem to
present tantalizing evidence for a broader mechanism of across-frequency

Onset and common-period detectors: These strongest of cues to fusion and event
formation still elude really convincing signal-processing implementations,
despite numerous attempts. Simple first-order differencing on energy in each
frequency channel is confused by sweeping tones, and harmonic trackers have
difficulty deciding if certain frequency ratios are adequate for fusion. These
kinds of low-level cue detectors must be ripe for definitive modeling,
although the trick may lie in their codependence on higher-level analysis.

Binaural cue detection: The correct detection and integration of interaural
timing and level differences is probably closer to a satisfactory model,
although using these to partition the sound energy into separate objects
presumably still relies on integration with as-yet unknown higher-level

Factoring-out channel characteristics: Human listeners are highly successful
at ignoring all but the most extreme of fixed colorations and blurrings
resulting from fixed acoustic channel characteristics (e.g. room
reverberation). This must be achieved through a combination of low-level
suppression-of-reflections with more abstract steadiness constraints, however
their precise nature remains mysterious.

Event formation: The core of most work on computational auditory scene
analysis has consisted of cue detectors driving an algorithm to simulate the
fusion of energy from different frequency bands into single `perceptual
events'. A proper model of human performance would deal with a broader range
of event classes.

Properties of events: Distinct from the local attributes of acoustic energy
that are the cues to event formation, each distinctly-perceived sound object
has its own global properties such as pitch and `timbre' that are somehow
derived from its components. Choosing the right representation for these
properties and discovering how they are calculated is a prerequisite for
successful models of higher stages of abstraction.

Sequential processing and stream formation: Beyond the level of auditory
events is the process of grouping events into `streams' - sets of distinct
sounds perceived as arising from a single source. The manner in which these
patterns of temporally-distinct energy are processed and organized constitutes
a completely different set of principles to those involved in the formation of
events. In all likelihood, modeling this process will require adapting the
low-level processing to the expectations derived at this and higher levels.

Short-term context-sensitivity: Many psychoacoustic phenomena (such as those
associated with the `old-plus-new' principle) underscore the importance of
short-term context/expectation/potentiation in auditory perception. This is
not easily incorporated into largely stateless signal-processing front ends,
whose adaptability is generally limited to automatic gain control.

Internal representation and storage: Human listeners are able to remember,
generalize and classify instances of sound events. The imprecise nature of
this process, where every unique bark of a dog sounds somewhat the same,
presents an interesting representational challenge of extracting and storing
only the `important' parts of each sound-object. The problem of recognition
may appear distinct from that of segregation/organization, but in practice the
partial detection of a known sound is bound to influence the organization

Constructive analysis of mixtures: While the preceding issues apply even to
the treatment of isolated sound events, our real interest lies in the ability
of the auditory system to handle complex mixtures of sound that overlap in
time and frequency. Illusion and restoration phenomena suggest that this is a
constructive process i.e. a question of coming up with a hypothetical
real-world situation that would be consistent with the received acoustic
signal, rather than directly deducing each scene element from parts of the
input data. Greater detail of this process remains unclear.

Evidence integration: Sound organization systems often encounter the problem
of needing to combine information obtained from a wide range of cue detectors
and other sources such as vision and other modalities. A typical approach has
been to implement an algorithm that combines types of cue in a fixed
sequence. In contrast, the robustness of the human listener under a wide range
of confusions implies a more adaptive or general process is at
work. Principled evidence integration (such as Bayesian belief networks) seems
closer to the right approach.

Neural plausibility: While the fact that we are trying to model a system built
out of neurons is often ignored (perhaps wisely), the question of how a given
algorithm may be implemented in a biological brain comprises a boundary around
the kinds of models we can reasonably propose. Unfortunately, the structure of
the digital computer and its common programming languages is very far removed
from the brain's architecture; this gap (and its impact on models) might be
reduced with a more brain-like (parallel, distributed) computational paradigm.

// Projects for enthusiasts

The previous list presents a set of intellectual problems that need eventually
to be solved, but for which no solution seems likely in the short term. To
return to our original scenario of the enthusiastic student looking for a
topic, it might be useful to accumulate a set of `ideologically-approved'
projects that will both encourage the researcher to think about the problems
that we consider important at the same time as advancing our efforts towards
solving them. Here are a few suggestions:

Breaking glass detector. This idea was actually suggested by Josh Wachman, a
student of Roz Picard here at the Media Lab. Their interest is in automatic
media annotation, specifically the idea of using soundtrack as well as the
moving image to derive information from a recording. Their domain of action
movies contains many catastrophic events (explosions, crashes, things
shattering) with ecologically-characteristic transient sound
patterns. Compared to, say, detecting the sound of a car engine, it should be
relatively easy to pick out many of the gunshots and punches, and classify
them according to a few parameters derived from their spectra.

Voice counter/streamer. In a similar domain, soundtracks that are known
principally to contain speech might be processed by today's harmonic-sound
extractor algorithms to detect all the voiced-syllable entities, which could
then be streamed into separate monologues and possibly identified with known
participants based on larger-scale statistics such as pitch range and syllable
rate. (At the Media Lab, Michael Hawley and the students of Chris Schmandt
have considered systems of this ilk).

Sound similarity model. High-performance sound compression algorithms are
looking for the loosest approximation to the original signal that still sounds
good to a human listener. This is a highly complex and poorly-understood
criterion, but a similarity metric that ignored static phase and magnitude
distortion, while emphasizing gating of high-frequency energy, seems
technically feasible and might be a useful approximation to the `human model'
holy grail. The wideband audio coding community is the natural home for this

Constructive explanation in restricted domains.  The idea of explaining a
signal by guessing the components that have caused it and then checking what
they would predict (forward modeling) rather than deriving their
characteristics directly from the resulting signal (backward modeling) entails
a new kind of algorithm that is currently little understood. To define a
somewhat tractable problem of this kind might entail dramatically restricting
the domain to, e.g. only noise bursts or steady tonal events. Such a `toy
problem' could yield valuable insights into the general properties of such
analysis-by-synthesis systems.

Streaming systems.  One popular topic for sound-organization models has been
simulation of musical streaming and reproduction of phenomena such as the
`trill-threshold' (e.g. Beauvois & Meddis, and the neural-network system of
Brown & Cooke). While the ecological significance of these stimuli is a little
obscure, it is a neat place to start, with plenty of experimental results to
match. Streaming systems that better explain the influence of `timbre' would
be a worthy achievement.

Short-term learning of percussive music.  I have thought about using some of
the techniques of machine learning to build a system that derives the minimum
set of individual spectra which can be combined to form an observed series of
composite events. The exemplary domain for this dense percussion music where
instruments rarely occur in isolation, but are usually cosynchronous with
their peers. Listening to such music, one rapidly builds up an idea of the
identity and number of instruments that are present, which can suddenly alter
if two previously unison instruments separate. This indicates the important
role of short-term memory in event fusion, i.e. that common-onset is a
flexible cue strongly influenced by recent experience.

// Conclusion

In an effort to define a unifying focus for researchers modeling higher
auditory functions, I have listed my vision of what the ultimate goals of this
work might be, and what specific discoveries or techniques must be developed
to get us there. I have added a collection of more tractable projects based on
the same ideas perhaps to provide some inspiration for newcomers to the
field. I do not presume to have made a definitive job of any of these lists,
but I hope that other members of the community will share my interest in
producing this kind of manifesto, and will either suggest some of their ideas
for inclusion in future versions of this document, or produce alternative
versions of their own.

// Acknowledgments

This paper has had the benefit of direct input from the following people,
whose contributions are gratefully acknowledged: Bill Gardner, Kunio Kashino,
Keith Martin, David Rosenthal and Lonce Wyse.

                    Copyright (c) 1995 Dan Ellis.
You may redistribute this article to anyone for any non-commercial purpose.
                 The current version is available at:

 - - - - - - - - - - - - - - - - - - - - - - - - - - -