[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: About importance of "phase" in sound recognition



Hi,

I think one of the original post's main questions was about magnitude vs. phase in STFT analysis and why/if we can throw out the phase and just use the magnitude. (such as in done in MFCCs and other methods people use).  It seems this is a question more about information than about ears/brains - i.e. you can talk about the information available after STFT analysis/synthesis, regardless of how well STFT approximates what our ears really do.

To me, it seems like the answer is simply that phonetic information is encoded in the short-time magnitude spectrum, i.e. in the spectrogram.  To ask *why* this is the case, you have to think about how speech is produced and why/how we evolved to use speech to communicate.  Our brains may use other information to help us extract this information (e.g. pitch might help us separate mixtures, among other things, etc), but ultimately, that is where the phonetic content lives - the short-time magnitude spectrum.

I just ran a quick experiment by modifying the STFT and reconstructing - see the following URL, containing files of the form {condition}.{T}.wav
http://s3.amazonaws.com/kenschutte/phase_experiment.zip [ s3.amazonaws.com/kenschutte/phase_experiment.zip ]

conditions:
  orig       : unmodified (should be perfect reconstruction)
  phase_rand : random phase and original magnitude
  phase_zero : zero phase and original magnitude
  mag_const  : set magnitude to a constant and use orig phase

The number, T, indicates the width of the analysis window (ms).  All use 75% overlap and a simple overlap-add resynthesis.

The point is that mag_const.00030.wav is complete noise while phase_rand.00030.wav is completely intelligible.  This isn't to say we're "insensitive" to this change, it obviously sounds different, but the phonetic information is still there.

Interestingly, mag_const.00001.wav *is* somewhat intelligible.  This is because frequencies of interest (formant ranges, < 1kHz) have periods longer than the analysis window, so that information is contained *across* frames rather than in single frames. Looking at the spectrogram with a reasonable window size, you see the formants - so short-time magnitude spectrum is preserved. It's just important what scale of 'short-time' you are talking about.

Ken

p.s.

I thought of one other thing to add: just take FFTs of whole utterance - no frame-based analysis, like ifft(fft(x)).  These are in there as "full.*.wav".  Not surprisingly, none are intelligible.