[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[no subject]



In response to David Huron's interesting description of his project
in computational stream formation,

I have been working on a similar project for a while.  To begin with,
I focused on a categorical representation of the acoustic energy,
breaking it up into clearly delimited sinusoid-like components.  While
that proved hard and is not yet solved to my satisfaction, I have
recently been considering the higher-level problem of forming these
into sources.   The way I am tending to think about it is in terms
of _fusion_ rather than _segregation_, at least in the first instance,
although the distinction is subtle.  I am considering fusion because
it seems such a powerful and early process : my first level of analysis
would analyse a periodic stimulus into harmonics and formant bursts,
but it is, of course, very difficult to hear such a stimulus as anything
but a single sound.  Put another way, the impression of 'one sound' seems
more solid, more distinct than the impression of 'more than one sound',
so maybe the best way to make a computer recognize the latter condition
is to make it do a good job of recognizing the former, and then see
when it breaks or gets confused.  (David's use of the Pitch Model I see as
equivalently placing fusion in the first stage, in that pitch is an
attribute of a fused stimulus).

Again, as David points out, the problem at the higher level is `filling
in gaps' and making the right kinds of heuristic associations between
data distinct in time and frequency.  I am working with simultaneous
use of rules such as harmonicity, common onset and common modulation
to form networks of basic elements with a high liklihood of fusing.
But I am getting a strong sense that to get results of comparable
robustness to human source formation, it is necessary to have very
high level models (hypotheses) of what the sound `is' or what has
generated it.  Such a model ("clarinet playing middle C") can be
strongly supported in one region of time frequency, and then use the
plausibility so established to mop-up more weakly-associated data in
a different region.  But the problem of acquiring and triggering such
high level models seems very hard.

On the acoustic/auditory stream debate:  I think that auditory streams
are the only ones worth worrying about.  I don't really see acoustic
streams as being well defined, since the physical origin of different
components of a sound can be arbitrarily close.  The sound of a guitar
string being plucked consists of a initial transient of the pluck-click,
perhaps mainly radiated from the pick, followed by the periodic oscillation
of the string :  are these separate acoustic sources?  For me, the only
interesting definition of a source is the psychological one, i.e. an
ensemble of acoustic energy perceived as a single event or entity.

I'd be very interested in any comments from readers of AUDITORY.

  DAn Ellis,  MIT Media Lab Perceptual Computing Group.
  <dpwe@media.mit.edu>