Re: frequency to mel formula (Arturo Camacho )


Subject: Re: frequency to mel formula
From:    Arturo Camacho  <acamacho@xxxxxxxx>
Date:    Mon, 7 Mar 2011 17:59:56 -0600
List-Archive:<http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>

Dear members of the list: Following the interesting observations made by Dick, I want to say that some years ago I proposed a pitch estimator named SWIPE [A. Camacho and J. Harris, J. Acoust. Soc. Am. 124, 1638-1652 (2008)] that computes the spectrum using scales of the form x(f) = C (1+log(f/s)). I tested SWIPE with values of s equal to 0, 229, 700, and infinity, corresponding to the log, ERB, mel, and Hertz scales, respectively. The one that produced the best results overall was s=229 (ERB scale). Interestingly, the results were very monotonic: s=229 produced better results than s=700, and s=700 produced better results than s=infinity. On the other hand, s=0 was even worst than s=infinity. Unfortunately, I did not know about Greenwood's scale at that time. I just finished another study in which I added to SWIPE a preprocessing stage based on an auditory model. This time I incorporated s=150, which corresponds to Greenwood's scale. (in his 1990 paper, Greenwood does not precise an exact value for k in eq. [1], but suggests values between 0.8 and 1, which produces values of s = 165.4 k between 132.3 and 165.4 Hz). The results obtained were not as good as with s=229 and not as bad as with s=0, which suggest that for SWIPE the optimal value of s must be between 150 and 700, and maybe very close to 230. Arturo On Thu, Jul 23, 2009 at 12:43 AM, Richard F. Lyon <DickLyon@xxxxxxxx> wrote: > I'd still like to understand more of the history of the Mel scale, formulas > for it, and its relationship to other scales; did O'Shaughnessy come up with > the 700?  Or did he get it from somewhere else?  Someone figured the 1000 > was just too high to be realistic? > > I've been reviewing some of Don Greenwood's papers, and the wikipedia > article on his "Greenwood function" at > http://en.wikipedia.org/wiki/Greenwood_function .  And Don's comments from > last Jan: http://www.auditory.org/postings/2009/53.html > > Don says a good map of cochlear position x (from 0 at apex to 1 at base) to > frequency f in hertz is f = 165.4*(10^(2.1x) - 1).  Solving for x and > scaling to get 1000 at f = 1000, we get a formula in the form of the > mel-scale formula: > >  m = 512.18 * ln(f/165.4 + 1). > > The key here is not the scale factor, but the "break frequency", 165.4 Hz, > that separates the log-like high-frequency region from the linear-like > low-frequency region.  Don finds that the data imply a much lower break > frequency than has traditionally been used; his papers show that the higher > values (700 or 1000) are too high to fit the published data that they're > supposed to be based on.  That means the map is logarithmic over a wider > range than usually recognized, and that the critical bands at the low end > are much narrower than some scales would imply. > > The ERB-rate scale based on Glasberg and Moore 1990 has a corresponding > break point at 228.8 Hz, much closer to Greenberg's interpretation than to > the mel-scale interpretations (this is from ERB = 24.7 (4.37F/1000 + 1), > where 228.8 is 1000/4.37).  In terms of mel-like formula: > >  m = 594.9 * ln(f/228.8 + 1) > > This is also very close to what I've been using in recent cochlear models > for machine hearing (used by Malcolm Slaney in the 1993 auditory toolbox; > actually I'm using 245 Hz now for some reason I don't recall).  So I guess > it's time to take Don seriously at his suggestion to see if such a change > away from mel scale and closer to reality would improve a speech system > (vocoder or recognizer).  But I'm not in that business, so I'll have to bend > some ears...toward a more logarithmic scale. > > Of course, with this relatively small deviation from logarithmic, there's > also not a lot of deviation from bandwidth being a "constant Q" function of > center frequency, so other simple parameterizations are likely to fit as > well.  The Bark scale is an example of such a thing, and there are others; > the Bark scale is closer to mel than to the Greenwood or ERB-rate scales. > > If you want to look at the mappings, they are compared at > http://www.speech.kth.se/~giampi/auditoryscales/ ; but the normalization > isn't at 1000 Hz, so it's hard to compare shapes, and they're not on a log > frequency scale, so it's hard to see the predominantly log-like nature of > the mappings.  So I took and modified the code from there, added Greenwood, > and you can run it if you have matlab or octave handy.  It's clear that the > Greenwood and ERB-rate scales have a long "straight" log segment, and that > the mel and Bark scales break at too high a frequency. > > f = 1000; > erb_1k = 214 * log10(1 + f/228.8); > bark_1k = 13*atan(0.00076*f)+3.5*atan((f/7000).^2); > > f = (10:10:20000)'; > erb = 214 * log10(1 + f/228.8);  % very close to lyon w 245 Hz break > mel = 1127 * log(1 + f/700); > bark = 13*atan(0.00076*f) + 3.5*atan((f/7000).^2); > greenwood = 512.18 * log(1 + f/165.4); > > semilogx(f, [1000*erb/erb_1k, mel, 1000*bark/bark_1k, greenwood]) > legend('ERB', 'Mel', 'Bark', 'Greenwood', 'Location', 'SouthEast') > xlabel('frequency (Hz)') > ylabel('normalized scales') > > Other things I found online include a study that evaluated different pitch > scales on a speech intonation application: > http://www.ling.cam.ac.uk/francis/Nolan%20Semitones.pdf  Here the log > mapping (semitone scale) came out best, with ERB-rate not far behind (and > presumably Greenwood's would have been better than ERB-rate, being a little > closer to log).  Mel and Bark were not much better than linear; on this > task, the frequency range of interest included just voice pitch range, up to > 500 Hz, where these latter scales are essentially just linear.  It's not > clear if this "repetition pitch" task is very closely related to the > "frequency" scaling that the scales are designed to cover, but it's a step. > > Here's one: > http://recherche.ircam.fr/equipes/analyse-synthese/burred/pdf/burred_AES121.pdf > that concludes that Mel, ERB, and Bark are all significantly better than > either constant-Q (log) or linear scales, for source separation of stereo > mixtures.  But the results are about the same for the three "auditory" > scales. > > Here's an ASR study that found no consistent best among ERB, Mel, and Bark: > ftp://cs.joensuu.fi/pub/PhLic/2004_PhLic_Kinnunen_Tomi.pdf > > Any other good comparisons? > > Dick > -- ___________________________________________________ Arturo Camacho Lozano Profesor Invitado Escuela de Ciencias de la Computación e Informática Universidad de Costa Rica ___________________________________________________


This message came from the mail archive
/home/empire6/dpwe/public_html/postings/2011/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University