Subject: Re: Robust method of fundamental frequency estimation. From: Arturo Camacho <acamacho@xxxxxxxx> Date: Mon, 26 Feb 2007 23:56:12 -0500 List-Archive:<http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>Dick, Below are my answers to your questions. > Arturo, thanks for the pointer to your poster. I > had to go back to this message to decode the different algorithnms you > compared. Now, I have some questions. > > You say "Although some of these algorithms were > initially proposed using a time-domain approach, all of them can also be > formulated using the spectrum of the signal, and that is the approach we > took." That may be true, but there are other good time-domain > correlation-based pitch models that can NOT be expressed in terms of the > spectrum. > For example, the Meddis & Hewitt or Meddis & O'Mard models, or > Slaney & Lyon models, > derived from Licklider's duplex theory, which do the ACF after what the > cochlea model does, which is a separation into filter channels and a > half-wave rectification. I do not agree. If you know the frequency response of the cochlea, you can predict the spectrum of its output from the spectrum of its input. The effects of half-wave-rectification and compression are more difficult to analyze, but not impossible. I remember reading a little bit about it in Anssi Klapuri’s PhD thesis. > Did you consider any such models? I have used these models in the past, but I stopped using them. If I am not wrong, what Slaney & Lyon’s model does is to apply a summary autocorrelation to the output of a gammatone filterbank (it does some extra steps, but the main idea is that one). Since this can be shown to be equivalent to applying autocorrelation to the original signal (use Wiener–Khinchin theorem and linearity property of Fourier Transform), I have not used it anymore. About Meddis, Hewitt and O’Mards models, applying half-wave rectification to the output of the gammatone filterbank is a good idea because it adds useful harmonics to the signals (see Klapuri’s thesis). Applying compression is also a good idea because it reduces the squaring effects of autocorrelation. However, I did not include Meddis, Hewitt and O’Mards models in my study because they are level dependant. My experience with them is that the firing rate patterns vary a lot if the level of the signal is changed, and this produced changes in pitch with level. Since in applications (at least the ones I work with) we do not know the level at which the signal was recorded or the level at which it will be reproduced, I prefer not to use level dependant models. However, I recognize the utility of half-wave-rectification and compression, and I am actually working in a model/algorithm that makes use of them to estimate pitch. > Have others > reported their results on that speech database? I think that's really the > competition if you have a new pitch model, especially if you want more > generality beyond speech and music. > > Your poster says that the spectra were estimated > using FFT, and the next sentence says using a gammatone filterbank. Which > is it? Or both? Oh, I see, one says the algorithm and the other > the model. Why would you choose an algorithm that doesn't match the model? > Why treat these as > conceptually different things? An algorithm is a computational model, is > it not? > The reason why I made the difference between algorithm and model is because one goal of the algorithm was to make it computationally efficient, but not the model. When we created the algorithm our goal was not to match the pitch perception off weird complex tones or noises, but to estimate the pitch of more natural sounds like speech. It may seem I am contradicting myself because what we used in the poster to show other algorithms pitfalls were complex tones or noises, but these tones and noises were inspired by speech signals. For example, I was working with simulated telephone speech when I discovered that Harmonic Product Spectrum (HPS) produced more errors for male voice than for female voice. Analyzing the errors I discovered that HPS does not work well when the fundamental is missing, which is obvious from the definition of the algorithm. Since male speech fundamental is most of the time below the 300 Hz limit of telephone speech, it is obvious that HPS is prone to fail more for male than for female telephone speech. However, in the poster we showed examples with complex tones instead of speech because they are easier to describe and they are equally good as speech to show the problem. Anyway, the poster also shows that our algorithm performed well on Paul Baghsaw’s speech database for pitch estimation, which was our goal. Arturo -- __________________________________________________ Arturo Camacho PhD Candidate Computer and Information Science and Engineering University of Florida E-mail: acamacho@xxxxxxxx Web page: www.cise.ufl.edu/~acamacho __________________________________________________