Subject: Re: CASA problems and solutions From: "John K. Bates" <jkbates(at)COMPUTER.NET> Date: Wed, 31 Jan 2001 10:46:31 -0500Dear Dr. Wang, I'm afraid that you misunderstood my objectives in my essay on CASA. Although I used a lot of references to human auditory perception (ASA) what I tried to make clear was that I wanted to derive the design requirements and constraints leading to a practical CASA system that could (ideally) do what the human ear does. In other words, in order to design a system that could solve auditory problems it would be necessary to find the fundamental requirements as Seebeck might have seen them. So let's start from scratch to design a hearing system. The main requirements are: (1) separate the sources, and (2) at any instant, select the primary source for attention. Whether or not we want to separate sources by spatial localization or by other source features, an objective analysis of the first problem says that we need the best possible time resolution. We haven't been able to get that by Fourier methods, so what else is there than to seek the best possible time resolution? If that's the primary constraint, then it is necessary to find a way to combine both high time resolution and source information. My solution is to use pixel-like particles that I call waveform information vectors (WIVs) that are sampled at waveform zeros. From this starting point, we then select, combine, and recognize patterns in the data that ultimately end in speech perception as well as all other kinds of sounds. Obviously, this method has nothing to do with the biophysics of the ear. To demonstrate that this method is feasible I have included a few illustrative experiments in my Web site. <http://home.computer.net/~jkbates> To satisfy the second objective, we must return to abstractions concerning the how, when, and why we want our machine to listen to a particular sound. That is the really tough problem because the ultimate purpose of a CASA machine is to act like a listener who has a reason to survive in whatever mission it is designated to carry out. Hence, my discussion on the philosophy of survival objectives. To my mind, implementing this concept of survival in practical form is necessary: it is the only way to arrive at a robust speech recognizer. As for the subject of spatial separation I'm aware that verification takes a lot more than a few experiments. I have done a variety of tests under varying conditions for a number of years, and the system seems reliable. I feel that with improved testing facilities and software, results will be even better. With reference to your mention of pixel correspondence in getting direction of arrival, I believe that my method solves a similar problem in my interaural time difference scheme. It turned out to be easy because it was a variation on a system I had designed and patented in 1975. Actually, the system looks a lot like the Jeffress model except that I use zero crossings instead of phase. The system is described briefly in my paper presented at the Mohonk97 workshop. [1] While other location-based source separation attempts have had some success, the important point is, as you mention, whether or not the methods relate to the entire CASA system. This is what I have always tried to keep in mind. What I have shown so far has been only what is necessary to establish feasibility; that the ultimate objectives are possible. Many thanks for your comments; they are very helpful. Best wishes, John Bates Time/Space Systems Pleasantville, New York [1] J.K. Bates, "Modeling the Haas effect: a first step for solving the CASA problem," Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, October 1997 >There is a distinction between auditory scene analysis (ASA) as defined by Bregman, >and computational auditory scene analysis (CASA). From my reading of your writeup, >your criticisms are targeted to ASA. To a CASA researcher like me, the problem is >well-defined: how to extract auditory events (objects) from one sensor (monaural), >two sensors (binaural), or more (sensor array). This definition is, in Marrian >framework, at the level of computational theory, independent of Fourier analysis, >cochlear models, or even the auditory system. One may insist on biological >plausibility, or one may pay close attention to how the auditory system >solves the problem (which we don't know) in order to borrow ideas and solutions. > >But, regardless of approaches, the CASA problem remains unsolved. I think that a lot >of people are not really biased in terms of approaches and some frankly don't care about >human audition. This speaks to the challenge of the problem itself. Moreover, one need >not be too pessimistic. Think about computer vision, where A LOT more people have been >working, and artificial intelligence. > >What I am getting at is that, if you can manage to separate multiple sources spatially >and reconstruct them, it will be a great technological breakthrough. I don't know if you >can do it and I have doubts (see below), but we will study your approach. > >Location-based source separation has been attempted before with some success. But >it is far from solving the CASA problem. Successes in a few demos are far from a general >solution. There are well-documented test databases to measure success in CASA in a >systematic way (see references below). Results on these databases would be >a lot more revealing. > >Since you have made a connection with vision, spatial analysis in audition would >roughly correspond to depth perception from two images, a problem that has >been studied since the early days of computer vision. The challenge there, which >remains to this day, is the correspondence problem: which pixels of one image correspond >to which pixels of the other. It's hard to find a solution to the correspondence problem >without image analysis (grouping and segmentation). Similarly, it's hard for me to >imagine a solution to CASA purely on the basis of spatial analysis without other cues of >ASA. The replies by Al Bregman and John Culling suggest that location may not even be >a major cue for ASA. I'd like to be proven wrong, so that at least we have one solution to >rely on. > >Some recent references about CASA: > >D.F. Rosenthal and H.G. Okuno, Ed. Computational auditory scene analysis. Mahwah NJ: Lawrence >Erlbaum, 1998. > >D.L. Wang and G.J. Brown, "Separation of speech from interfering sounds based on oscillatory >correlation," IEEE Trans. Neural Net., vol. 10, pp. 684-697, 1999. (PDF file available on my web.) > > >Cheers, > >DeLiang Wang >-- >------------------------------------------------------------