Subject: Re: About importance of "phase" in sound recognition From: Ken Schutte <kenschutte@xxxxxxxx> Date: Wed, 6 Oct 2010 11:16:32 -0500 List-Archive:<http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>--0016e6d4f592d18fb90491f51980 Content-Type: text/plain; charset=ISO-8859-1 Content-Disposition: inline Hi, I think one of the original post's main questions was about magnitude vs. phase in STFT analysis and why/if we can throw out the phase and just use the magnitude. (such as in done in MFCCs and other methods people use). It seems this is a question more about information than about ears/brains - i.e. you can talk about the information available after STFT analysis/synthesis, regardless of how well STFT approximates what our ears really do. To me, it seems like the answer is simply that phonetic information is encoded in the short-time magnitude spectrum, i.e. in the spectrogram. To ask *why* this is the case, you have to think about how speech is produced and why/how we evolved to use speech to communicate. Our brains may use other information to help us extract this information (e.g. pitch might help us separate mixtures, among other things, etc), but ultimately, that is where the phonetic content lives - the short-time magnitude spectrum. I just ran a quick experiment by modifying the STFT and reconstructing - see the following URL, containing files of the form {condition}.{T}.wav http://s3.amazonaws.com/kenschutte/phase_experiment.zip conditions: orig : unmodified (should be perfect reconstruction) phase_rand : random phase and original magnitude phase_zero : zero phase and original magnitude mag_const : set magnitude to a constant and use orig phase The number, T, indicates the width of the analysis window (ms). All use 75% overlap and a simple overlap-add resynthesis. The point is that mag_const.00030.wav is complete noise while phase_rand.00030.wav is completely intelligible. This isn't to say we're "insensitive" to this change, it obviously sounds different, but the phonetic information is still there. Interestingly, mag_const.00001.wav *is* somewhat intelligible. This is because frequencies of interest (formant ranges, < 1kHz) have periods longer than the analysis window, so that information is contained *across* frames rather than in single frames. Looking at the spectrogram with a reasonable window size, you see the formants - so short-time magnitude spectrum is preserved. It's just important what scale of 'short-time' you are talking about. Ken p.s. I thought of one other thing to add: just take FFTs of whole utterance - no frame-based analysis, like ifft(fft(x)). These are in there as "full.*.wav". Not surprisingly, none are intelligible. --0016e6d4f592d18fb90491f51980 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline <div>Hi,</div><div><br></div><div>I think one of the original post's ma= in questions was about magnitude vs. phase in STFT analysis and why/if we c= an throw out the phase and just use the magnitude. (such as in done in MFCC= s and other methods people use). =A0It seems this is a question more about = information than about ears/brains - i.e. you can talk about the informatio= n available after STFT analysis/synthesis, regardless of how well STFT appr= oximates what our ears really do.</div> <div><br></div><div>To me, it seems like the answer is simply that phonetic= information is encoded in the short-time magnitude spectrum, i.e. in the s= pectrogram. =A0To ask *why* this is the case, you have to think about how s= peech is produced and why/how we evolved to use speech to communicate. =A0O= ur brains may use other information to help us extract this information (e.= g. pitch might help us separate mixtures, among other things, etc), but ult= imately, that is where the phonetic content lives - the short-time magnitud= e spectrum.</div> <div><br></div><div>I just ran a quick experiment by modifying the STFT and= reconstructing - see the following URL, containing files of the form {cond= ition}.{T}.wav</div><div><!-- <a href=3D"http://s3.amazonaws.com/kenschutte= /phase_experiment.zip" target=3D"_blank"> -->http://s3.amazonaws.com/kensch= utte/phase_experiment.zip <font color=3Dgray>[ s3.amazonaws.com/kenschutte/phase_experiment.zip ]</f= ont> <!-- </a> --></div> <div><br></div><div>conditions:</div><div>=A0=A0orig =A0 =A0 =A0 : unmodifi= ed (should be perfect reconstruction)</div><div>=A0=A0phase_rand : random p= hase and original magnitude</div><div>=A0=A0phase_zero : zero phase and ori= ginal magnitude</div> <div>=A0=A0mag_const =A0: set magnitude to a constant and use orig phase</d= iv><div><br></div><div>The number, T, indicates the width of the analysis w= indow (ms). =A0All use 75% overlap and a simple overlap-add resynthesis.</d= iv> <div> <br></div><div>The point is that mag_const.00030.wav is complete noise whil= e phase_rand.00030.wav is completely intelligible. =A0This isn't to say= we're "insensitive" to this change, it obviously sounds diff= erent, but the phonetic information is still there.</div> <div><br></div><div>Interestingly, mag_const.00001.wav *is* somewhat intell= igible. =A0This is because frequencies of interest (formant ranges, < 1k= Hz) have periods longer than the analysis window, so that information is co= ntained *across* frames rather than in single frames. Looking at the spectr= ogram with a reasonable window size, you see the formants - so short-time m= agnitude spectrum is preserved. It's just important what scale of '= short-time' you are talking about.</div> <div><br></div><div>Ken</div><div><br></div><div>p.s.</div><div><br></div><= div>I thought of one other thing to add: just take FFTs of whole utterance = - no frame-based analysis, like ifft(fft(x)). =A0These are in there as &quo= t;full.*.wav". =A0Not surprisingly, none are intelligible.</div> <div><br></div> --0016e6d4f592d18fb90491f51980--