[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Granular synthesis and auditory segmentation



Peter,

Thank you for pasting the note of Mssr. Didier with yours as it now
gives me the opportunity to discuss both of your responses in this
single email.

>I don't know about your neurons, but mine completely fail
>to replenish their synapses above about 1 or 2 kHz even
>after plenty of coffee.
        ... Actually many physiology books discuss the refractory
        period (the ability to replenish chemical balance) of
        Cochlear neurons as operating to about 5"KHz".
        ... A detail readily confirmed by the literature.
        ... Given that most telephony systems run 300"Hz" to 3"KHz",
        the 5"KHz" refractory period does well in practical situations!


>Of course there is a role for non-Fourier type processing too, but
>no simple scheme covers the entire audible [20 Hz, 20 kHz] range.
        ... True, this simple scheme merely covers all of speech
        communications.
        ... Singing (resonant redundancies) and mechanical
        vibrations in string and wind instruments have other cues.


>Didier Depireux clarified the issue very nicely:
>
>> The half-wave rectification occurs _after_ the frequency
>> decomposition performed on the basilar membrane, i.e.
>> after you have decomposed the signal into frequency channels.
        ... Ah, The Place Theory!
        ... If you truly believe in Fourier Analysis then you also believe
        in the inverse transform.  However, the local, resonant response
        of a stretched membrane (the Place Theory) is only useful for a
        sinusoidal drive.  Speech presents complex structures in time
        and, a resonant response at any GIVEN time (at a PLACE) on the
        Basilar membrane is NOT the same thing as a FULL spectral
        analysis which produces amplitude and PHASE information at
        MANY "frequencies" such that an inverse transform is possible.
        ... It is also a well known fact that (Binaural) localization has a
        10microsecond resolution (1 to 2 spatial degrees) and, that this
        resolution is crucial to explaining the Cocktail Party Effect.
        ... 10microsecond resolution implies a Fourier sample window
        of approx the same size --- which implies a "spectral" resolution
        of 100"KHz", i.e., quite useless if acoustic analysis is the goal.


>the "textural aspect" (i.e., the patterning)
>in sound textures perceptually relatively invariant to their
>position in the time-frequency plane with a typical [0s, 1s]
>by [500 Hz, 5 kHz] area. That is also what I would want, since
>it keeps the perceptual qualities of overall time-frequency
>"position" and sound "pattern" largely independent, just as
>in vision the texture of an object doesn't appear to change
>and interfere with position of that object in the visual field.
>
>Of course the "art" is to optimize this preservation of
>invariants in the cross-modal mapping, while maximizing
>resolution and ease of perception (including "proper"
>grouping and segregation, possibly by manipulating the
>sound textures).
        ... I also work with graphical speech patterns.
        ... But, my structures are time-locked to SOURCES in the
        acoustic environment.
        ... And, Self-Organized Neural Maps detect and classify
        these structures.
        ... You must solve the SOURCE localization problem before
        ANY (source) analysis is EVER attempted.

        ... I wish to make this point emphatically - one can ONLY
        analyze a SOURCE and, before one CAN do SOURCE analysis,
        one MUST isolate the information from THAT source.

        ... This is precisely what occurs during the Cocktail Party Effect,
        i.e., a SOURCE is isolated and analysis is focused on the
        information produced by THAT source.
        ... Spectral analysis of a Acoustic point in space is essentially
        useless since the resultant frequencies to be ASSOCIATED with
        a PARTICULAR SOURCE remains unknown!
        ... But, this result is to be expected as the typical FFT sample
        window of 10 milliseconds is 1,000 times larger than the raw,
        human, localization resolution of 10 microseconds, i.e., the 1 to 2
        Spatial Degrees of SOURCE resolution are completely buried -
        actually, it's more appropriate to say that Spatial Resolution has
        been lost in the 10 millisecond AVERAGING process used to
        calculate "spectral" components.


Rich Fabbri

McGill is running a new version of LISTSERV (1.8d on Windows NT). 
Information is available on the WEB at http://www.mcgill.ca/cc/listserv