Re: granular synthesis and auditory segmentation (Peter Cariani )


Subject: Re: granular synthesis and auditory segmentation
From:    Peter Cariani  <peter(at)epl.meei.harvard.edu>
Date:    Fri, 16 Oct 1998 14:12:53 +0100

John K. Bates wrote: > What I discovered was that a halfwave of a given duty ratio can produce > an auditory sensation (Let's call it timbre.) that can be identified in a > phonetic classification, even if it is from a non-speech source. > Therefore, by creating halfwaves having known timbre classifications and > concatenating them according to phonological rules I found it possible to > generate speech and other sounds. [...] As a backup for > evaluating the ear's perception I made spectrograms of both the original > and reconstructed utterances. They had broadly similar patterns, but here > the ear seems more tolerant than the eye. If I understand your processing correctly, you retain some constant proportion of successive half-waves, so that there is still a great deal of correlation structure between the half-waves that retains most of what is important to the auditory system. When you have a receptor system that phase-locks to the stimulus, then the timings of discharges will more or less faithfully replicate the time structure of your half-waves, and the all-order intervals that are produced will more or less look like the autocorrelation of your half-waves. As you say, the power spectra of the unprocessed and the processed sound are broadly similar, but subjectively the sounds don't sound as different as they look from their power spectra. This is all very subjective, obviously, but the thing that strikes me when I compare autocorrelation functions is the similarity of the half-wave rectified and the unprocessed waveforms, particularly at the peaks. So if the auditory system used all-order interval information to do frequency- and form-analysis (as I think the work of Goldstein & Srulovicz and Moore and many others suggest) then one might expect that these kinds of processings would not drastically alter the percepts. Broadly speaking, one general alternative to place-based representation of stimulus spectra is the population-interval distribution or summary autocorrelation function, i.e. the global all-order interval distribution taken over an entire auditory population (over all CF's). > In discussing grain size and spectral features, it would seem that > neither a Gabor-limited nor a halfwave grain could have useful spectral > shape other than a measure of bandwidth. And if this is true, we next have > to account for how a spectrum-analyzing cochlea can provide information > that allows the ear to find instantaneous meaning in bandwidth > simultaneously with narrowband harmonic analysis. (timbre and pitch?...and > what about whispers?) Obviously, while this might be too difficult a task > for the cochlea, the deus ex machina in the brain can solve the problem. The most basic issue is whether the cochlea itself functions as a frequency analyzer whose output is channel-coded -- i.e. the auditory system in one way or another reads off which particular frequency channels are activated how much -- or whether the cochlea instead provides a rich set of filters that produce time-structured outputs that carry the information for auditory analysis in their time-structure, via the phase-locked time structure of auditory nerve fiber discharges. (The functional role of the cochlea in auditory perception depends upon how the receiver -- the auditory CNS, a.k.a. deus ex machina -- interprets its output -- what aspects of the output are used in what ways to subserve particular perceptual distinctions.) In the correlational view, having all of those tuned filters makes the system much more robust in the face of background noise and multiple competing auditory objects. Here the tuning of the filters is not what confers upon the system its fabulous frequency selectivity, (at least not in the absence of noise/competing sounds) -- it is precision of phase-locking (jitter-limited). The tuning of the filters confers upon the system the ability to detect and isolate faint auditory objects in the face of competing sounds by ensuring that no one sound completely dominates the time structure of the auditory nerve array output. Arguably, this is why Shannon's reduced spectral resolution speech is both as intelligible as it is in quiet and as unintelligible as it is when competing sounds are introduced. It may also be related to why cochlear implant users do as well as they do with only a few channels, and what the advantages of extra channels (beyond 4 or 5) might be -- while added channels might not improve reception of speech in quiet much, they might play a more critical role for receiving speech in noise . If one thinks about bat echolocation in this way, then one looks to the low-frequency time structure of the modulations within high frequency cochlear channels (e.g. low-freq beatings of a cry with its Doppler-shifted echo encode relative velocity) to confer upon the system its precision. > As for general use in granular synthesis, my experience suggests that > with their near-minimum time intervals halfwaves have advantages over the > conventional grain method with their simplicity and greater flexibility. > Using nothing but halfwaves controlled by their duty ratios and time > epochs, I have done a few experiments synthesizing speech including > fricatives, stops, vowels, and variable pitch. Although the timbre was > rough, it was intelligible. Since the method can give good stop > consonants, I think that a granular approach to speech synthesis could > improve on what is currently available. > By the way, isn't it interesting that, along with the segments, my > granular analysis algorithm gets the pitch, envelope, and V/UV? It also > gets direction of arrival. All done without a spectrum analyzer. A full-blown autocorrelation-based theory of speech coding is possible that describes periodic and aperiodic speech sounds in terms of running autocorrelation structure. I think that this kind of theory would provide a way of thinking about some of the neural correlates of your experiments in granular synthesis. Correlation-based models were around in the 1950's before correlational analysis was supplanted by digital signal processing and the FFT. We tend to reflexively think that Fourier- and correlation-based signal processing approaches are the same because they are mathematically translatable in the limit, but the neural representations and analyses that are most naturally and directly associated with the two are very different (channel-based excitation profiles vs. temporal correlation patterns). Peter Cariani Email to AUDITORY should now be sent to AUDITORY(at)lists.mcgill.ca LISTSERV commands should be sent to listserv(at)lists.mcgill.ca Information is available on the WEB at http://www.mcgill.ca/cc/listserv


This message came from the mail archive
http://www.auditory.org/postings/1998/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University