[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: frequency to mel formula
I'd still like to understand more of the history of the Mel scale,
formulas for it, and its relationship to other scales; did
O'Shaughnessy come up with the 700? Or did he get it from somewhere
else? Someone figured the 1000 was just too high to be realistic?
I've been reviewing some of Don Greenwood's papers, and the wikipedia
article on his "Greenwood function" at
http://en.wikipedia.org/wiki/Greenwood_function . And Don's comments
from last Jan: http://www.auditory.org/postings/2009/53.html
Don says a good map of cochlear position x (from 0 at apex to 1 at
base) to frequency f in hertz is f = 165.4*(10^(2.1x) - 1). Solving
for x and scaling to get 1000 at f = 1000, we get a formula in the
form of the mel-scale formula:
m = 512.18 * ln(f/165.4 + 1).
The key here is not the scale factor, but the "break frequency",
165.4 Hz, that separates the log-like high-frequency region from the
linear-like low-frequency region. Don finds that the data imply a
much lower break frequency than has traditionally been used; his
papers show that the higher values (700 or 1000) are too high to fit
the published data that they're supposed to be based on. That means
the map is logarithmic over a wider range than usually recognized,
and that the critical bands at the low end are much narrower than
some scales would imply.
The ERB-rate scale based on Glasberg and Moore 1990 has a
corresponding break point at 228.8 Hz, much closer to Greenberg's
interpretation than to the mel-scale interpretations (this is from
ERB = 24.7 (4.37F/1000 + 1), where 228.8 is 1000/4.37). In terms of
mel-like formula:
m = 594.9 * ln(f/228.8 + 1)
This is also very close to what I've been using in recent cochlear
models for machine hearing (used by Malcolm Slaney in the 1993
auditory toolbox; actually I'm using 245 Hz now for some reason I
don't recall). So I guess it's time to take Don seriously at his
suggestion to see if such a change away from mel scale and closer to
reality would improve a speech system (vocoder or recognizer). But
I'm not in that business, so I'll have to bend some ears...toward a
more logarithmic scale.
Of course, with this relatively small deviation from logarithmic,
there's also not a lot of deviation from bandwidth being a "constant
Q" function of center frequency, so other simple parameterizations
are likely to fit as well. The Bark scale is an example of such a
thing, and there are others; the Bark scale is closer to mel than to
the Greenwood or ERB-rate scales.
If you want to look at the mappings, they are compared at
http://www.speech.kth.se/~giampi/auditoryscales/ ; but the
normalization isn't at 1000 Hz, so it's hard to compare shapes, and
they're not on a log frequency scale, so it's hard to see the
predominantly log-like nature of the mappings. So I took and
modified the code from there, added Greenwood, and you can run it if
you have matlab or octave handy. It's clear that the Greenwood and
ERB-rate scales have a long "straight" log segment, and that the mel
and Bark scales break at too high a frequency.
f = 1000;
erb_1k = 214 * log10(1 + f/228.8);
bark_1k = 13*atan(0.00076*f)+3.5*atan((f/7000).^2);
f = (10:10:20000)';
erb = 214 * log10(1 + f/228.8); % very close to lyon w 245 Hz break
mel = 1127 * log(1 + f/700);
bark = 13*atan(0.00076*f) + 3.5*atan((f/7000).^2);
greenwood = 512.18 * log(1 + f/165.4);
semilogx(f, [1000*erb/erb_1k, mel, 1000*bark/bark_1k, greenwood])
legend('ERB', 'Mel', 'Bark', 'Greenwood', 'Location', 'SouthEast')
xlabel('frequency (Hz)')
ylabel('normalized scales')
Other things I found online include a study that evaluated different
pitch scales on a speech intonation application:
http://www.ling.cam.ac.uk/francis/Nolan%20Semitones.pdf Here the log
mapping (semitone scale) came out best, with ERB-rate not far behind
(and presumably Greenwood's would have been better than ERB-rate,
being a little closer to log). Mel and Bark were not much better
than linear; on this task, the frequency range of interest included
just voice pitch range, up to 500 Hz, where these latter scales are
essentially just linear. It's not clear if this "repetition pitch"
task is very closely related to the "frequency" scaling that the
scales are designed to cover, but it's a step.
Here's one:
http://recherche.ircam.fr/equipes/analyse-synthese/burred/pdf/burred_AES121.pdf
that concludes that Mel, ERB, and Bark are all significantly better
than either constant-Q (log) or linear scales, for source separation
of stereo mixtures. But the results are about the same for the three
"auditory" scales.
Here's an ASR study that found no consistent best among ERB, Mel, and Bark:
ftp://cs.joensuu.fi/pub/PhLic/2004_PhLic_Kinnunen_Tomi.pdf
Any other good comparisons?
Dick