Re: CASA problems and solutions ("John K. Bates" )


Subject: Re: CASA problems and solutions
From:    "John K. Bates"  <jkbates(at)COMPUTER.NET>
Date:    Tue, 30 Jan 2001 18:15:19 -0500

Dear Al and List, In return, I appreciate your feedback on my essay. It gives me some feeling for points that need further explanation. I'll address a few that you mentioned, but I can see that it is really necessary for me to finish the second part of the essay that will supply more detail on how I handle the higher levels of perception while using the "waveform information vector" concept. Briefly, this process breaks the waveform into particles of information that can be manipulated according to common features, of which, at the input level, instantaneous direction of arrival (DOA) is the most predominant. After being sorted into coarsely defined streams by DOA the particles may then be further separated according to their temporal features (instead of spectral features.) From this point, the streams are further refined in subsequent stages according to their increasing value in terms of their relative need for attention by the auditory requirements for aiding survival. Obviously, DOA is not a prerequisite for sorting sounds and removing reverberations, but it helps a lot. (Consider the telephone.) The key point is that with the high time resolution of the WIVs, the sorting of overlapping source streams is greatly improved over what can be done with spectral processing. This point can be illustrated by an experiment in which I separate the voice from the pulse train _without DOA selection_ by selecting only the voice's periodicity spectrum. I will see if I have the space in my site to upload it. >It is an excellent beginning. The reason I use the word "beginning" is that >for humans, and presumably other animals, the use of spatial position is Yes, I realize this is just a beginning. It will be a long haul, but I think it will be on the right path. > I suspect that to replicate the full range of >human auditory scene analysis (ASA), the attempt to solve the problem >computationally (CASA) will have to use the same range of environmental >cues. I have a plan for this. As I mentioned at the end of my essay, the processing will be done in stages that extract intermediate levels of meaning. >Apart from spatial origin, the following sorts of information are used by >humans: > >(A) For integrating components that arrive overlapped in time: > > 1. harmonic relations > 2. asynchrony of onset and offset > 3. spectral separation > 4. Independence of amplitude changes in different > parts of the spectrum > >(B) For integrating components over time: > > 5. Spectral separation > 6. Separation in time (interacts with other factors) > 7. Differences in spectral shape > 8. Differences in intensity (a weak effect) > 9. Abruptness/smoothness of transition from one sound > to the next > >(I have attached a 2-page summary of what is known about ASA in humans. As >well as mentioning factors 1 to 9, it describes the effects of ASA on the >experience of the listener. I have used it as a handout in talks I have >given. It is in RTF format which should be readable by most versions of >Word.) Yes, I have read it. It looks very familiar! >I'm not sure whether your rejection of the Fourier method extends to all >methods of decomposing the input into spectral components. However if it >does, we should bear in mind that factors 3, 4, and 5, 7, and probably 1, >listed above, are most naturally stated on a frequency x time >representation -- that is, on a spectrogram or something like it. > >Furthermore, when you look at a spectrographic representation of an auditory >signal, the visual grouping that occurs is often directly analogous to the >auditory organization (provided that the time and frequency axes are >properly scaled). Why would this be so if some sort of frequency axis were >not central to auditory perception, playing a role analogous to a spatial >dimension in vision? Perhaps the Fourier transform is not the best >approach to forming this frequency dimension, but something that does a >similar job is required. Finally there is overwhelming physiological >evidence that the human nervous system does a frequency analysis of the >sound and retains separate frequency representations all the way to the >brain. Although I do reject the Fourier method, I believe that I have addressed your requirements in both practice and concept. You might notice, in the second set of experiments on my site, that the display format includes a periodicity spectrum that is a time-domain version of the frequency spectrum. The difference is that the "periodicity sorting matrix" processor instantaneously recognizes mixed periodic events in the stream of WIVs even though they might be from different sources. This avoids the window problem of spectral methods. Interestingly, it has an inherent octave-related logarithmic scale that matches the tonotopic arrangement of the ear. More on that in the second section of the essay. Again, this method addresses the requirement for periodicity/frequency perception, but does it in a way that allows separating mixed sources into separate streams. It's a trick I once devised for separating and identifying radar pulse streams. >Perhaps I have missed some of the consequences of your method. If so I >would be happy to be corrected. I realize that the concept is radical, and that is what makes it hard to translate. I hope this explanation helps. Best wishes, John Bates Time/Space Systems Pleasantville, New York


This message came from the mail archive
http://www.auditory.org/postings/2001/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University