[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: granular synthesis and auditory segmentation
John K. Bates wrote:
> What I discovered was that a halfwave of a given duty ratio can produce
> an auditory sensation (Let's call it timbre.) that can be identified in a
> phonetic classification, even if it is from a non-speech source.
> Therefore, by creating halfwaves having known timbre classifications and
> concatenating them according to phonological rules I found it possible to
> generate speech and other sounds. [...] As a backup for
> evaluating the ear's perception I made spectrograms of both the original
> and reconstructed utterances. They had broadly similar patterns, but here
> the ear seems more tolerant than the eye.
If I understand your processing correctly, you retain some constant
proportion of successive half-waves, so that there is still a great deal
of correlation structure between the half-waves that retains most of
what is important to the auditory system.
When you have a receptor system that phase-locks to the stimulus, then
the timings of discharges will more or less faithfully replicate the
time structure of your half-waves, and the all-order intervals that
are produced will more or less look like the autocorrelation of your
half-waves.
As you say, the power spectra of the unprocessed and the processed sound
are broadly similar, but subjectively the sounds don't sound as
different
as they look from their power spectra. This is all very subjective,
obviously, but the thing that strikes me when I compare autocorrelation
functions is the similarity of the half-wave rectified and the
unprocessed
waveforms, particularly at the peaks. So if the auditory system used
all-order interval information to do frequency- and form-analysis
(as I think the work of Goldstein & Srulovicz and Moore and many others
suggest) then one might expect that these kinds of processings would
not drastically alter the percepts.
Broadly speaking, one general alternative to place-based representation
of stimulus spectra is the population-interval distribution or
summary autocorrelation function, i.e. the global all-order interval
distribution taken over an entire auditory population (over all CF's).
> In discussing grain size and spectral features, it would seem that
> neither a Gabor-limited nor a halfwave grain could have useful spectral
> shape other than a measure of bandwidth. And if this is true, we next have
> to account for how a spectrum-analyzing cochlea can provide information
> that allows the ear to find instantaneous meaning in bandwidth
> simultaneously with narrowband harmonic analysis. (timbre and pitch?...and
> what about whispers?) Obviously, while this might be too difficult a task
> for the cochlea, the deus ex machina in the brain can solve the problem.
The most basic issue is whether the cochlea itself functions as a
frequency
analyzer whose output is channel-coded -- i.e. the auditory system in
one
way or another reads off which particular frequency channels are
activated how much -- or whether the cochlea instead provides a
rich set of filters that produce time-structured outputs that carry
the information for auditory analysis in their time-structure, via
the phase-locked time structure of auditory nerve fiber
discharges.
(The functional role of the cochlea in auditory perception
depends upon how the receiver -- the auditory CNS, a.k.a.
deus ex machina -- interprets its output --
what aspects of the output are used in what ways to
subserve particular perceptual distinctions.)
In the correlational view, having all of those
tuned filters makes the system much more robust
in the face of background noise and multiple
competing auditory objects. Here the tuning of the filters is not
what confers upon the system its fabulous frequency selectivity,
(at least not in the absence of noise/competing sounds) -- it is
precision of phase-locking (jitter-limited). The tuning of the
filters confers upon the system the ability to detect and
isolate faint auditory objects in the face of competing sounds
by ensuring that no one sound completely dominates the time
structure of the auditory nerve array output.
Arguably, this is why Shannon's reduced
spectral resolution speech is both as intelligible as it is in quiet
and as unintelligible as it is when competing sounds are introduced.
It may also be related to why cochlear implant users do as well
as they do with only a few channels, and what the advantages of
extra channels (beyond 4 or 5) might be -- while added channels
might not improve reception of speech in quiet much, they might
play a more critical role for receiving speech in noise
.
If one thinks about bat echolocation in this way, then one looks
to the low-frequency time structure of the modulations within
high frequency cochlear channels (e.g. low-freq beatings of a
cry with its Doppler-shifted echo encode relative velocity)
to confer upon the system its precision.
> As for general use in granular synthesis, my experience suggests that
> with their near-minimum time intervals halfwaves have advantages over the
> conventional grain method with their simplicity and greater flexibility.
> Using nothing but halfwaves controlled by their duty ratios and time
> epochs, I have done a few experiments synthesizing speech including
> fricatives, stops, vowels, and variable pitch. Although the timbre was
> rough, it was intelligible. Since the method can give good stop
> consonants, I think that a granular approach to speech synthesis could
> improve on what is currently available.
> By the way, isn't it interesting that, along with the segments, my
> granular analysis algorithm gets the pitch, envelope, and V/UV? It also
> gets direction of arrival. All done without a spectrum analyzer.
A full-blown autocorrelation-based theory of speech coding is possible
that describes periodic and aperiodic speech sounds in terms of
running autocorrelation structure. I think that this kind of
theory would provide a way of thinking about some of the neural
correlates
of your experiments in granular synthesis.
Correlation-based models were around in the
1950's before correlational analysis was supplanted by digital signal
processing and the FFT. We tend to reflexively think that Fourier-
and correlation-based signal processing approaches are the
same because they are mathematically translatable in the limit,
but the neural representations and analyses that are most
naturally and directly associated with the two are
very different (channel-based excitation profiles
vs. temporal correlation patterns).
Peter Cariani
Email to AUDITORY should now be sent to AUDITORY@lists.mcgill.ca
LISTSERV commands should be sent to listserv@lists.mcgill.ca
Information is available on the WEB at http://www.mcgill.ca/cc/listserv