Subject: summary of replies From: Luis Felipe de Oliveira <luisfol(at)BOL.COM.BR> Date: Mon, 9 Dec 2002 14:31:10 -0300> This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. --MS_Mac_OE_3122289070_181773_MIME_Part Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit Dear list, some time ago I put a message asking about PCA as a pre-processing technic for neural net's inputs. Some of you resquest a post with the replies I got. So, here is, with a little delay. (sorry about that). Thanks a lot for all your help and suggestions. Luis Felipe Oliveira Malcolm Slaney suggests: See also http://www.almaden.ibm.com/cs/people/malcolm/pubs/MPESAR-ICME2002.pdf for a good solution to this problem. The basic input format, anchor models, should solve your problem. Matt Flax: You could try to use PDP++ ... it is a neural network computation environment. IT allows you to use CSS to pre-process the input ... which might mean you need to write C++ style code to take wav files and transform them .... http://www.cnbc.cmu.edu/resources/PDP%2B%2B/PDP%2B%2B.html John Hershey: Matlab makes all of this fairly easy. you can write your own programs to read the wav files, perform windowed ffts, perform PCA on the power spectra (or log power spectra), and implement your neural net. Christian Spevak: a standard technique for audio preprocessing is the extraction of mel-frequency cepstral coefficients (MFCCs). Logan (2000) argued that this can be regarded as an approximation of the KL transform. (at)INPROCEEDINGS{Logan00, author = {Beth Logan}, title = {Mel frequency cepstral coefficients for music modeling}, crossref = {ISMIR00}, } (at)PROCEEDINGS{ISMIR00, title = {Proceedings of the First International Symposium on Music Information Retrieval (ISMIR)}, booktitle = {Proceedings of the First International Symposium on Music Information Retrieval (ISMIR)}, year = {2000}, address = {Plymouth, Massachusetts}, month = oct, key = {ISMIR}, url = {http://ciir.cs.umass.edu/music2000}, } A Matlab implementation is included e.g. in Slaney's Auditory Toolbox. (at)techreport{Slaney98a, author = {Malcolm Slaney}, title = {Auditory {T}oolbox {V}ersion 2}, institution = {Interval Research Corporation}, address = {Palo Alto, California}, year = {1998}, type = {Interval Technical Report}, number = {1998-010}, url = {http://www.slaney.org/malcolm/pubs.html}, } In my own work, I have used MFCCs in combination with self-organizing maps to classify sound. If you are interested: Christian Spevak, Richard Polfreman and Martin Loomes. Towards detection of perceptually similar sounds: investigating self-organizing maps. In: Proceedings of the AISB'01 Symposium on Creativity in Arts and Sciences, pp. 45-50. York, 2001. http://www.spevak.de/pub/ Other approaches in timbre recognition build on the extraction of multiple features, such as spectral centroid, spectral flux, RMS etc or higher-level features characterizing e.g. rhythmic texture. A good overview of such feature extraction techniques has recently been given by Tzanetakis: (at)phdthesis{Tzanetakis02, author = {George Tzanetakis}, title = {Manipulation, Analysis and Retrieval Systems for Audio Signals}, school = {Princeton University}, year = {2002}, url = {http://www.cs.princeton.edu/~gtzan/}, } Huseyin Hacihabiboglu: Audio data contained in WAV or AIFF files are bulky for any algorithm which is not psychoacoustically motivated (using ANNs is such a case). Therefore, feature vectors which define the salient properties of sounds are needed. Different approaches have been taken employing different feature vector types. In my opinion, the most fruitful approaches used TF representations (in one way or another) of sound as feature vectors. What I understand from your case is that you're trying PCA to find a couple of principle components of the sound file as feature vectors. I would recommend: " P. Herrera, X. Amatriain, E. Batlle, and X. Serra, Towards instrument segmentation for music content description: A critical review of instrument classification techniques, Proceedings of the ICMC99, 1999." as a good overview (i.e. of not only ANNs but also other fancy methods such as rough sets) about different approaches taken. I worked on a similar topic dring my MSc and you can reacch my thesis and related publications from http://husshho.port5.com As for the computer programs I would recommend MATLAB which also has a Neural Network Toolbox (http://www.mathworks.com) and it also has utilities to read wav and aiff files. Other than that, you can also try Octave (http://www.octave.org) which is a free GPL tool, but I don't know if it includes any neural network toolbox. In any case Octave may help you with reading only the aiff files. (In Turkish we say : "Fair enough for the roast of the cheap meat!") :) Hamin Pichevar: I think that a lot of processing has to be done before you can apply your signal to your neural network. PCA and Kahrunen-Loeve transforms suppose that your signal is stationary which is not the case for sound sources. In addition, the Levinson-Durbin algorithm (an implementation of the Kahrunen-Loeve T (KLT))is very CPU consuming. Yes, in fact KLT is a compression technique (so that it will take less neurons in the input layer), but timbre is a "detail". Are you sure that by doing a KLT you don't lose important timbristic information? If you use the KLT anyway, then a HMM (Hidden Markov Model) or a classical neural network (backprop., etc.) could be used for the classification. But don't use KLT on raw (.wav) data. Use one of the feature extraction algorithms used for sound processing [1]. You didn't mention what type of network you want to use, but most of the classical neural networks are static and unsuitable for this task in the general sense. There is a new tendency in this field toward more biological approaches ([2][3][4] and so many others). If you opt for this approach, you should begin with a cochlear filterbank and envelope detection afterwards [2]. Hope this can help you a little bit. Falou !, Ramin [1](at)ARTICLE{Picone93, author = {Joseph W. Picone}, month = {sep}, year = 1993, title = {Signal Modeling Techniques in Speech Recognition}, journal = {Proceedings of the IEEE}, volume = 81, number = 9, pages = {1215-1247}, } [2](at)ARTICLE{Wang99, author = {D. Wang and G. J. Brown}, month = {May}, year = 1999, title = {Separation of Speech from Interfering Sounds Based on Oscillatory Correlation}, journal = {IEEE Transactions on Neural Networks}, volume = 10, number = 3, pages = {684-697}, } [3](at)INPROCEEDINGS{Rouat2002, author = {J. Rouat and R. Pichevar}, year = 2002, title = {Nonlinear Speech Processing Techniques for Source Segregation}, booktitle = {{EUSIPCO2002}} } [4](at)CONFERENCE{Pichevar2002, author = {R. Pichevar and J. Rouat}, year = 2002, title = {Double-Vowel Segregation Based on a {Cochleotopic/AMtopic} Map Using a Biological Neural Network}, booktitle = {{APCAM2002}} } Ali Taylan Cemgil: I did many years ago such an experiment. I have an online paper that you can find at http://www.mbfys.kun.nl/~cemgil/papers/netenga.ps.gz Jay Reynolds: Check out: http://www.cs.unr.edu/~bebis/CS791S/Code/pca/ Rua Haszard Morris: We are in a similar situation; we have time-based pitch contours and face motion data (17*3 dimensions) that we'd like to input to a neural net or some kind of classification stats. I am in the middle of looking at reducing the motion data using a PCA-based technique. I believe that a PCA would result in time-based data (but with less, say 7, dimensions), and would therefore not be suitable for direct input to a neural net. However, such data could be modelled and the parameters for the model input to the net/stats. For example, we are using a polynomial model to reduce time-based pitch contours (from speech data) to coefficients which are then used to feed into a discriminant analysis to classify with. We are having moderate success with this approach, but are still looking at what to do with the motion data. So.. I would appreciate it if you could forward to me any suggestions people make to you on this question. If we have any breakthroughs or success I'll let you know... --MS_Mac_OE_3122289070_181773_MIME_Part Content-type: text/html; charset="US-ASCII" Content-transfer-encoding: quoted-printable <HTML> <HEAD> <TITLE>summary of replies</TITLE> </HEAD> <BODY> Dear list, some time ago I put a message asking about PCA as a pre-processi= ng technic for neural net's inputs. Some of you resquest a post<FONT COLOR=3D"= #FF0000"> </FONT>with the replies I got. So, here is, with a little delay. (= sorry about that).<BR> Thanks a lot for all your help and suggestions.<BR> Luis Felipe Oliveira<BR> <BR> <FONT COLOR=3D"#FF0000"><BR> Malcolm Slaney</FONT> suggests:<BR> <TT>See also<BR> <FONT COLOR=3D"#0000FF"><U>h= ttp://www.almaden.ibm.com/cs/people/malcolm/pubs/MPESAR-ICME2002.pdf<BR> for</U></FONT> a good solution to this problem. The basic input forma= t, anchor <BR> models, should solve your problem.<BR> <BR> </TT><FONT COLOR=3D"#FF0000">Matt Flax:<BR> </FONT><TT>You could try to use PDP++ ... it is a neural network computatio= n <BR> environment. IT allows you to use CSS to pre-process the input ... which <B= R> might mean you need to write C++ style code to take wav files and <BR> transform them ....<BR> <FONT COLOR=3D"#0000FF"><U>http://www.cnbc.cmu.edu/resources/PDP%2B%2B/PDP%2B= %2B.html<BR> </U></FONT></TT><BR> <FONT COLOR=3D"#FF0000">John Hershey:<BR> </FONT><TT>Matlab makes all of this fairly easy.<BR> you can write your own programs to read the wav files, perform windowed<BR> ffts, perform PCA on the power spectra (or log power spectra), and<BR> implement your neural net. <BR> <BR> </TT><FONT COLOR=3D"#FF0000">Christian Spevak:<BR> </FONT><TT>a standard technique for audio preprocessing is the extraction o= f<BR> mel-frequency cepstral coefficients (MFCCs). Logan (2000) argued that this<= BR> can be regarded as an approximation of the KL transform.<BR> <BR> (at)INPROCEEDINGS{Logan00,<BR> author =3D {Beth Logan},<BR> title =3D {Mel frequency cep= stral coefficients for music modeling},<BR> crossref =3D {ISMIR00},<BR> }<BR> <BR> (at)PROCEEDINGS{ISMIR00,<BR> title =3D {Proceedings of th= e First International Symposium on Music<BR> Information Retrieval (ISMIR)},<BR> booktitle =3D {Proceedings of the First Internationa= l Symposium on Music<BR> Information Retrieval (ISMIR)},<BR> year =3D {2000},<BR> address =3D {Plymouth, Massachusetts},<B= R> month =3D oct,<BR> key =3D {ISMIR},= <BR> url =3D {<FONT C= OLOR=3D"#0000FF"><U>http://ciir.cs.umass.edu/music2000}</U></FONT>,<BR> }<BR> <BR> A Matlab implementation is included e.g. in Slaney's Auditory Toolbox.<BR> <BR> (at)techreport{Slaney98a,<BR> author =3D {Malcolm Slaney},<BR> title =3D {Auditory {T}oolbo= x {V}ersion 2},<BR> institution =3D {Interval Research Corporation},<BR> address =3D {Palo Alto, California},<BR> year =3D {1998},<BR> type =3D {Interval Tec= hnical Report},<BR> number =3D {1998-010},<BR> url =3D {<FONT C= OLOR=3D"#0000FF"><U>http://www.slaney.org/malcolm/pubs.html}</U></FONT>,<BR> }<BR> <BR> In my own work, I have used MFCCs in combination with self-organizing maps<= BR> to classify sound. If you are interested:<BR> <BR> Christian Spevak, Richard Polfreman and Martin Loomes. Towards detection of= <BR> perceptually similar sounds: investigating self-organizing maps. In:<BR> Proceedings of the AISB'01 Symposium on Creativity in Arts and Sciences,<BR= > pp. 45-50. York, 2001.<BR> <FONT COLOR=3D"#0000FF"><U>http://www.spevak.de/pub/<BR> </U></FONT><BR> <BR> <BR> Other approaches in timbre recognition build on the extraction of multiple<= BR> features, such as spectral centroid, spectral flux, RMS etc or higher-level= <BR> features characterizing e.g. rhythmic texture. A good overview of such<BR> feature extraction techniques has recently been given by Tzanetakis:<BR> <BR> (at)phdthesis{Tzanetakis02,<BR> author =3D {George Tzanetakis},<BR= > title =3D {Manipulation, Ana= lysis and Retrieval Systems for Audio<BR> Signals},<BR> school =3D {Princeton University},= <BR> year =3D {2002},<BR> url =3D {<FONT C= OLOR=3D"#0000FF"><U>http://www.cs.princeton.edu/~gtzan/}</U></FONT>,<BR> }<BR> <BR> </TT><FONT COLOR=3D"#FF0000">Huseyin Hacihabiboglu:<BR> </FONT><TT>Audio data contained in WAV or AIFF files are bulky for any algo= rithm<BR> which is not psychoacoustically motivated (using ANNs is such a case).<BR> Therefore, feature vectors which define the salient properties of sounds<BR= > are needed. Different approaches have been taken employing different<BR> feature vector types. In my opinion, the most fruitful approaches used TF<B= R> representations (in one way or another) of sound as feature vectors. What<B= R> I understand from your case is that you're trying PCA to find a couple of<B= R> principle components of the sound file as feature vectors. I would<BR> recommend:<BR> <BR> " P. Herrera, X. Amatriain, E. Batlle, and X. Serra, Towards<BR> instrument segmentation for music content description: A critical review<BR= > of instrument classification techniques, Proceedings of the ICMC99, 1999.&q= uot;<BR> <BR> as a good overview (i.e. of not only ANNs but also other fancy methods<BR> such as rough sets) about different approaches taken. <BR> <BR> I worked on a similar topic dring my MSc and you can reacch my thesis<BR> and related publications from <FONT COLOR=3D"#0000FF"><U>http://husshho.port5= .com<BR> </U></FONT><BR> As for the computer programs I would recommend MATLAB which also has a<BR> Neural Network Toolbox (<FONT COLOR=3D"#0000FF"><U>http://www.mathworks.com</= U></FONT>) and it also has<BR> utilities to read wav and aiff files. Other than that, you can also try<BR> Octave (<FONT COLOR=3D"#0000FF"><U>http://www.octave.org</U></FONT>) which is= a free GPL tool, but I don't know<BR> if it includes any neural network toolbox. In any case Octave may help you<= BR> with reading only the aiff files. (In Turkish we say : "Fair enough fo= r <BR> the roast of the cheap meat!") :)<BR> <BR> </TT><FONT COLOR=3D"#FF0000">Hamin Pichevar:<BR> </FONT><TT>I think that a lot of processing has to be done before you can a= pply your<BR> signal to your neural network. PCA and Kahrunen-Loeve transforms suppose<BR= > that your signal is stationary which is not the case for sound sources. In<= BR> addition, the Levinson-Durbin algorithm (an implementation of the<BR> Kahrunen-Loeve T (KLT))is very CPU consuming. Yes, in fact KLT is a<BR> compression technique (so that it will take less neurons in the input<BR> layer), but timbre is a "detail". Are you sure that by doing a KL= T you don't<BR> lose important timbristic information? If you use the KLT anyway, then a HM= M<BR> (Hidden Markov Model) or a classical neural network (backprop., etc.) could= <BR> be used for the classification. But don't use KLT on raw (.wav) data. Use<B= R> one of the feature extraction algorithms used for sound processing [1].<BR> <BR> You didn't mention what type of network you want to use, but most of the<BR= > classical neural networks are static and unsuitable for this task in the<BR= > general sense. There is a new tendency in this field toward more biological= <BR> approaches ([2][3][4] and so many others). If you opt for this approa= ch,<BR> you should begin with a cochlear filterbank and envelope detection<BR> afterwards [2]. Hope this can help you a little bit.<BR> <BR> Falou !,<BR> Ramin<BR> <BR> [1](at)ARTICLE{Picone93,<BR> author =3D {Joseph W. Picone},<BR> month =3D {sep},<BR> year =3D 1993,<BR> title =3D {Signal Modeling Techniques in Speech Recognition},<BR> journal =3D {Proceedings of the IEEE},<BR> volume =3D 81,<BR> number =3D 9,<BR> pages =3D {1215-1247},<BR> <BR> }<BR> [2](at)ARTICLE{Wang99,<BR> author =3D {D. Wang and G. J. Brown},<BR> month =3D {May},<BR> year =3D 1999,<BR> title =3D {Separation of Speech from Interfering Sounds Based on Oscil= latory<BR> Correlation},<BR> journal =3D {IEEE Transactions on Neural Networks},<BR> volume =3D 10,<BR> number =3D 3,<BR> pages =3D {684-697},<BR> }<BR> [3](at)INPROCEEDINGS{Rouat2002,<BR> author =3D {J. Rouat and R. Pichevar},<BR> year =3D 2002,<BR> title =3D {Nonlinear Speech Processing Techniques for Source Segregati= on},<BR> booktitle =3D {{EUSIPCO2002}}<BR> }<BR> [4](at)CONFERENCE{Pichevar2002,<BR> author =3D {R. Pichevar and J. Rouat},<BR> year =3D 2002,<BR> title =3D {Double-Vowel Segregation Based on a {Cochleotopic/AMtopic} = Map<BR> Using a Biological N= eural Network},<BR> booktitle =3D {{APCAM2002}}<BR> }<BR> </TT><BR> <FONT COLOR=3D"#FF0000">Ali Taylan Cemgil:<BR> </FONT><TT>I did many years ago such an experiment. I have an online paper = that <BR> you can find at <FONT COLOR=3D"#0000FF"><U>http://www.mbfys.kun.nl/~cemgil/pa= pers/netenga.ps.gz<BR> </U></FONT></TT><BR> <FONT COLOR=3D"#FF0000">Jay Reynolds:<BR> </FONT><TT>Check out:<BR> <BR> <FONT COLOR=3D"#0000FF"><U>http://www.cs.unr.edu/~bebis/CS791S/Code/pca/<BR> </U></FONT></TT><BR> <FONT COLOR=3D"#FF0000">Rua Haszard Morris:<BR> </FONT><TT>We are in a similar situation; we have time-based pitch contours= and face<BR> motion data (17*3 dimensions) that we'd like to input to a neural net or<BR= > some kind of classification stats. I am in the middle of looking at<B= R> reducing the motion data using a PCA-based technique.<BR> <BR> I believe that a PCA would result in time-based data (but with less, say 7,= <BR> dimensions), and would therefore not be suitable for direct input to a<BR> neural net. However, such data could be modelled and the parameters f= or the<BR> model input to the net/stats.<BR> <BR> For example, we are using a polynomial model to reduce time-based pitch<BR> contours (from speech data) to coefficients which are then used to feed int= o<BR> a discriminant analysis to classify with. We are having moderate succ= ess<BR> with this approach, but are still looking at what to do with the motion<BR> data.<BR> <BR> So.. I would appreciate it if you could forward to me any suggestions peopl= e<BR> make to you on this question. If we have any breakthroughs or success= I'll<BR> let you know...<BR> </TT> </BODY> </HTML> --MS_Mac_OE_3122289070_181773_MIME_Part--