[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[no subject]
In response to David Huron's interesting description of his project
in computational stream formation,
I have been working on a similar project for a while. To begin with,
I focused on a categorical representation of the acoustic energy,
breaking it up into clearly delimited sinusoid-like components. While
that proved hard and is not yet solved to my satisfaction, I have
recently been considering the higher-level problem of forming these
into sources. The way I am tending to think about it is in terms
of _fusion_ rather than _segregation_, at least in the first instance,
although the distinction is subtle. I am considering fusion because
it seems such a powerful and early process : my first level of analysis
would analyse a periodic stimulus into harmonics and formant bursts,
but it is, of course, very difficult to hear such a stimulus as anything
but a single sound. Put another way, the impression of 'one sound' seems
more solid, more distinct than the impression of 'more than one sound',
so maybe the best way to make a computer recognize the latter condition
is to make it do a good job of recognizing the former, and then see
when it breaks or gets confused. (David's use of the Pitch Model I see as
equivalently placing fusion in the first stage, in that pitch is an
attribute of a fused stimulus).
Again, as David points out, the problem at the higher level is `filling
in gaps' and making the right kinds of heuristic associations between
data distinct in time and frequency. I am working with simultaneous
use of rules such as harmonicity, common onset and common modulation
to form networks of basic elements with a high liklihood of fusing.
But I am getting a strong sense that to get results of comparable
robustness to human source formation, it is necessary to have very
high level models (hypotheses) of what the sound `is' or what has
generated it. Such a model ("clarinet playing middle C") can be
strongly supported in one region of time frequency, and then use the
plausibility so established to mop-up more weakly-associated data in
a different region. But the problem of acquiring and triggering such
high level models seems very hard.
On the acoustic/auditory stream debate: I think that auditory streams
are the only ones worth worrying about. I don't really see acoustic
streams as being well defined, since the physical origin of different
components of a sound can be arbitrarily close. The sound of a guitar
string being plucked consists of a initial transient of the pluck-click,
perhaps mainly radiated from the pick, followed by the periodic oscillation
of the string : are these separate acoustic sources? For me, the only
interesting definition of a source is the psychological one, i.e. an
ensemble of acoustic energy perceived as a single event or entity.
I'd be very interested in any comments from readers of AUDITORY.
DAn Ellis, MIT Media Lab Perceptual Computing Group.
<dpwe@media.mit.edu>