Re: STFT vs Power Spectral in Musical recognition system ? ("Richard F. Lyon" )


Subject: Re: STFT vs Power Spectral in Musical recognition system ?
From:    "Richard F. Lyon"  <DickLyon@xxxxxxxx>
Date:    Thu, 31 Aug 2006 16:00:31 -0700

Arturo, I totally agree with the idea of using log(1 + KM), or log(M + epsilon) as I usually do it. This kind of nonlinearity is especially important in systems with an imprecisely known zero level or a variable noise floor. On the other hand, a power law, though it has an infinite slope at 0, is not half as bad as a plain log, and lots of people use that anyway. A stabilized power law, like (M + epsilon)^(1/3) is another good choice, probably more in line with perception that letting it go log-like at high magnitudes. With any of these, adjusting your parameters to accommodate a realistic range of input signal levels becomes important; you can no longer ignore scale factors and hope for algorithms to work fine on inputs varying over many orders of magnitude in scale. Dick At 6:32 PM -0400 8/31/06, Arturo Camacho wrote: >One problem of the square-root compression is that its slope >approaches infinity as the magnitude M approaches zero. A more >appropriate approach may be to use log(1+KM), where K is a constant to >be determined. The response of this function is almost logarithmic for >high magnitudes and almost linear for low magnitudes. Of course, the >determination of the optimal value for K given an input is not >trivial. > >Arturo >-- >__________________________________________________ > > Arturo Camacho > PhD Candidate > Computer and Information Science and Engineering > University of Florida > > E-mail: acamacho@xxxxxxxx > Web page: www.cise.ufl.edu/~acamacho >__________________________________________________ > >On Fri, 25 Aug 2006, Richard F. Lyon wrote: > >> Edwin, >> >> A power spectral density is only defined for stationary signals, not >> music. The STFT generalizes it to short segments, if you use the >> squared magnitude. >> >> The difference between the absolute value, square, log, etc. are just >> point nonlinearities that do not change the information content, but >> do change the metric structure of the space a bit. Log is too >> compressed, leading to too much emphasis on near-silent segments, >> while the square (the power you ask about) is too expanded, leading >> to too much emphasis on the louder parts. A good compromise is >> around a square root or cube root of magnitude (roughly matching >> perceptual magnitude via Stevens's law), but the magnitude itself is >> also sometimes acceptable, depending on what you're doing. >> >> Dick >> >> At 7:12 AM -0700 8/25/06, Edwin Sianturi wrote: >> >Content-Type: text/html >> >X-MIME-Autoconverted: from 8bit to quoted-printable by >> >torrent.cc.mcgill.ca id k7PED6jh031610 >> > >> >Hello, >> > >> >I am just a master student, doing my internship. Right now, I am >> >building a musical instrument recognition system. I have read >> >several papers on it, and I am just curious: >> > >> >All the papers/journals that I have read use the STFT, a.k.a the >> >|X(t,f)| of a signal x(t), in order to extract several (spectral) >> >features to be used as the input to the recognition system. >> > >> >What are the reasons behind using the |X(t,f)| instead of using the >> >"power spectral" |X(t,f)|^2 ? >> >(technically speaking, a power spectral density is the expectation >> >of |X(f)|^2, i.e. E(|X(f)|^2) ) >> > >> >Thanks in advance, >> > >> >Edwin SIANTURI >> > >> >>


This message came from the mail archive
http://www.auditory.org/postings/2006/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University