[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Dissertation: Prediction-driven Computational Auditory Scene Analysis



After busting through umpteen deadlines of varying degrees of severity,
I have actually finished my Ph.D.  The dissertation might be of interest
to people on this list, so I am including the abstract at the bottom of
this message.  You can also read the abstract and contents, download the
postscript of the whole document, and listen to the sound examples at
its web site:

        http://sound.media.mit.edu/~dpwe/pdcasa/

Personal note: I have now left MIT to be a post-doc with Nelson
Morgan's speech recognition group at the International Computer
Science Institute, attached to the University of California, Berkeley.
I hope to find ways to use the ideas of computational auditory scene
analysis for the practical benefit of speech recognition systems.  I
have a new email address, dpwe@icsi.berkeley.edu, but will probably
retain my MIT addresss for the forseeable future.

  DAn.

 - - - - - - - - - - ~/public_html/pdcasa/front.txt - - - - - - - - - -

       Prediction-driven computational auditory scene analysis

                                  by

                          Daniel P. W. Ellis

  Submitted to the Department of Electrical Engineering and Computer
 Science in partial fulfillment of the requirements for the degree of
            Doctor of Philosophy in Electrical Engineering

             at the Massachusetts Institute of Technology

                              June 1996

ABSTRACT

The sound of a busy environment, such as a city street, gives rise to
a perception of numerous distinct events in a human listener - the
`auditory scene analysis' of the acoustic information.  Recent
advances in the understanding of this process from experimental
psychoacoustics have led to several efforts to build a computer model
capable of the same function.  This work is known as `computational
auditory scene analysis'.

The dominant approach to this problem has been as a sequence of
modules, the output of one forming the input to the next.  Sound is
converted to its spectrum, cues are picked out, and representations of
the cues are grouped into an abstract description of the initial
input.  This `data-driven' approach has some specific weaknesses in
comparison to the auditory system: it will interpret a given sound in
the same way regardless of its context, and it cannot `infer' the
presence of a sound for which direct evidence is hidden by other
components.

The `prediction-driven' approach is presented as an alternative, in
which analysis is a process of reconciliation between the observed
acoustic features and the predictions of an internal model of the
sound-producing entities in the environment.  In this way, predicted
sound events will form part of the scene interpretation as long as
they are consistent with the input sound, regardless of whether direct
evidence is found.  A blackboard-based implementation of this approach
is described which analyzes dense, ambient sound examples into a
vocabulary of noise clouds, transient clicks, and a correlogram-based
representation of wide-band periodic energy called the weft.

The system is assessed through experiments that firstly investigate
subjects' perception of distinct events in ambient sound examples, and
secondly collect quality judgments for sound events resynthesized by
the system.  Although rated as far from perfect, there was good
agreement between the events detected by the model and by the
listeners.  In addition, the experimental procedure does not depend on
special aspects of the algorithm (other than the generation of
resyntheses), and is applicable to the assessment and comparison of
other models of human auditory organization.

 - - - - - - - - - - - - - - - - - - - - - - - - - - -