Subject: Brian Karlsen: Re: speech/music From: Dan Ellis <dpwe(at)ICSI.BERKELEY.EDU> Date: Thu, 2 Apr 1998 10:51:07 PSTDear List - I'm forwarding this message on behalf of Brian Karlsen <blyk(at)cpk.auc.dk> who is having problems with the listserver. DAn. -- forwarded message Date: Thu, 02 Apr 1998 09:30:45 +0200 From: Brian Lykkegaard Karlsen <blyk(at)cpk.auc.dk> Organization: Aalborg University Subject: Re: speech/music To get back to Sue's original question, I have this late response: Sue Johnson wrote: > > Dan/List > > > Houtgast & Steeneken). Rather, I think a lot of our uncanny ability to > > pick out speech from the most bizarrely-distorted signals comes from the > > very highly sophisticated speech-pattern decoders we have, which employ all > > of the many levels of constraints applicable to speech percepts (from > > low-level pitch continuity through to high level semantic prior > > likelihoods) to create and develop plausible speech hypotheses that can > > account for portions of the perceived auditory scene. > > > > I have problems with this. (sorry) > I'm sure you must be able to detect the presence of speech independent of > being able to recognise it. If someone spoke to me in Finnish say, I would > be able to tell they were speaking (even in the presence of background > music/noise), even though I couldn't even segment the words, never mind > syntactically or semantically parse them. > I think there must be some way the brain splits up (deconvolves) the > signal before applying a speech recogniser. > (I have no proof of this of course, it's just a gut feeling) > > I agree having a recogniser which would cope with speech would be the > ideal solution, but there is problems of training appropriate models to > recognise music you haven't seen before (the current HMM methods assume > the training data represents in some way the same distribution as the test > data), and from a time constraint, any removal of audio without relevant > information content before recognition is helpful. > > I dont have the slightest idea of how the brain detects speech, but it > would seem logical to me that it can do that on a very low-level acoustic > basis. If this were true then in theory a front-end speech detector should > be possible. > > I admit I know very little on this subject, so am looking forward to > people correcting me. > > thanks for all your comments. > Sue Sue/list, I think you're partially right about this. Of course it doesn't require recognition to tell speech from other sounds, but I think the point that Dan wa s trying to make, is that primitive processes (as referred to in Auditory Scene Analysis) are not sufficient for making speech into a "stream" (also an ASA term). Too many incoherent acoustic elements are involved - how do we for example group an /s/ and a /u/ together? They have practically nothing in commo n acoustically. So you need some kind of higher order knowledge about the speech signal to be able to segregate it from a mixture with an undesired sound. Of course I'm not talking about conscious knowledge here. One way of thinking of i t is to picture a process hierarchy which is neither exclusively bottom-up nor exclusively top-down. At the bottom you'll find the inner ear where mechanical motion is transformed into neural spikes, and at the top you'll find some kind of sound recognition engine. In between there will be all kinds of intermediate levels which can interact with each other. At one of these levels individual streams are separated from each other. This level is the one which is pertinent to this discussion. I have to say that many of these ideas are my interpretations of the ideas of other people taken partially from Al Bregman's book on ASA and partially from discussions with Phil Green, Martin Cooke and Guy Brown at Sheffield. Brian Karlsen Center for Personkommunikation Aalborg University Denmark -- end of forwarded message