Subject: Human Audio Perception FAQ (v.2 / June 4,1994) From: Argiris Kranidiotis <akra(at)URANUS.DI.UOA.ARIADNE-T.GR> Date: Tue, 7 Jun 1994 12:53:28 +0300Note:This is my second attempt to post this message. I apologize if you get this message for second time. Argiris A. Kranidiotis ---------------------------------------------------------------------------- ______________________________________________________ | | | HUMAN AUDIO PERCEPTION FREQUENTLY ASKED QUESTIONS | | version 2.0 June 4 , 1994 | |______________________________________________________| I n t r o d u c t i o n --------------------------- All started from a recent UseNet posting of mine. From the volume of mail I received , it seems to be a very interesting subject.I decided to release an edited version of all the answers I received so far in the form of a F.A.Q. (Frequently Asked Questions). This version is preliminary.It is still *VERY* incomplete .With your help I will try to make it as complete as possible.Please read on to see what other additional information is needed... The main topic remains the same : Given two spectra ( STFFT's Short Time Fast Fourier Transforms for example ) we try to estimate a psychoacoustic distance between them (i.e.: a timbral metric). This involves some additional data: 1) Equal loudness curves (Fletcher-Munson). Originally published in J.A.S.A. (Journal of the Acoustical Society of America) in 1933. Please send to me your data/approximations/formulae. Still more information needed on this subject. 2) Bark frequency scale (Critical Bands) . I have found some approximations in the range 0..5 KHz . Again more precise information needed. 3) "Masking" effects . Useful introductory information can be found at the MPEG Audio compression FAQ (available via anonymous FTP at sunsite.unc.edu, at IUMA archive). 4) Other psychoacoustic data ? ______________________________________________________________________________ -MANY THANKS to all those kind people who contributed to this text (they are too many to list). -My comments are put in square brackets [ ... ]. -A recent version of this text is available via anonymous FTP at: svr-ftp.eng.cam.ac.uk ( maintained by Tony Robinson <ajr(at)eng.cam.ac.uk> ) Directory: /pub/comp.speech/info , Filename: HumanAudioPerception. Please note that this FAQ is *NOT* restricted in speech topics. Argiris A. Kranidiotis University Of Athens Informatics Department akra(at)zeus.di.uoa.ariadne-t.gr ______________________________________________________________________________ Equal loudness curves ______________________________________________________________________________ From: Various people ------------------------------------------------------------------------ -Flecher-Munson curves (the most popular answer). Peak sensitivity at 3,300 Hz , falling off below 40 Hz, and above 10 kHz. -"An Introduction to the Psychology of Hearing". By Moore , 3d edition. (the most popular reference). From: Vincent Pagel <Vincent.Pagel(at)loria.fr> ------------------------------------------------------------------------ [...] It's a family of curves [Fletcher Munson curves --AK] a bit like this: Db ^| || | | \ | | | | | \ / | | / | \________ ______/ | \___/ | | |_________________________________________________> Frequency (Hz) 400 2500 6000 10000 20000 PERCEPTUALLY all the sounds corresponding to the points on the curve have the same intensity : this means that the ear has a large range where it is nearly linear ( 1000 to 8000 Hz ), achieving better result on a little domain (around 3000 Hz if my memory serves). [ the curve has a minimum at 3,300 Hz -- AK ] The rate drops dramatically after 10000 Hz and before 500 Hz ). You can draw different equal loudness curves depending on the first intensity you begin with ( e.g. if the intensity at 2500Hz is 50 db you get one curve, but if you start at 2500 Hz with 70 db you get another equal loudness curve .... generally equal loudness curves have nearly the same shape and it does not depend too much on the point it begins at) To my knowledge there is no mathematical formula given to approximate equal loudness curves, but with the data in the book by Moor it should not be very difficult to find an approximation. From: Angelo Campanella <acampane(at)magnus.acs.ohio-state.edu> ------------------------------------------------------------------------ Obtain the ISO "Zero Phons" standard threshold of human hearing. -The standard was ISO 389-1975 "Audiometer Standard Reference Zero". -The US Equivalent is ANSI S3.6 - 1969. The following numbers apply: These are dB re 20 micropascals for a sound of pure tone or very narrow band noise: -------------------------------------------------------------------------- Audio Frequency 125 250 500 1000 2000 3000 4000 6000 8000 ========================================================================= Human (Monaural) Threshold of Hearing 45.5 24.5 11 6.5 8.5 7.5 9 8 9.5 Normal young adult with undisturbed hearing. dB re 20 micropascals. Binaural hearing is 10 to 15 dB better, since the brain has a magnificent capability to correlate the simultaneous listening of both ears. From: walkow(at)compsci.bristol.ac.uk (Tomasz Walkowiak) ------------------------------------------------------------------------ The equal loudness curve can be approximated by: E(w)=1.151*SQRT( (w^2+144*10^4)*w^2/((w^2+16*10^4)*(w^2+961*10^4)) ) From: Robinson et al.: Br.J.A.Phys. 7, 166-181, 1956. This approximation is for Nyquist frequency equal to 5 kHz, so w = 2*Pi*f/5kHz , for 0<f<5kHz. Therefore E(w) is defined for 0<w<Pi. The E(w) is linear. And usually is applied to the power spectrum. ______________________________________________________________________________ Bark scale / Critical Bands ______________________________________________________________________________ From: basbug(at)netcom.com (Filiz Basbug) ------------------------------------------------------------------------ >From a paper given by David Lubman at Inter-Noise '92(Toronto) the critical band rate (z) in Bark can be determined by z=[13*arctan(0.76*f)+3.5*arctan(f^2/56.25)] where f is in kHz and the angles returned from the arctangent expressions are in radians. When z is an integer, f is the dividing line frequency between two critical bands. If the frequency corresponding to a particular Bark (z) is desired, use the following: f={[(exp(0.219*z)/352)+0.1]*z-0.032*exp{-0.15*(z-5)^2]} where f is in kHz. Finally, the critical bandwith (df) can be calculated for a given center frequency (f) by df={25+75*[1+1.4*(f^2)]^0.69} where f is in kHz and df is in Hz. There are no explicitly stated limits on the variables, but according to the table that Mr. Lubman generated from the formulas, 1<=z<=24 for Bark, and 20<=f<=15500 for frequency, except 50<=f<=13500 for the center frequencies. (df) ranges from 100 Hz to 3500 Hz. Also note that these formulas are generally accepted approximations but, as far as I know, are not yet standardized. I believe they have all been empirically derived. Calculation of psychoacoustic Loudness steady-state sounds is defined in ISO 532, ISO Rec. 675, and DIN 45631. Extension to non-steady sounds was defined by Zwicker but is not yet standardized (as of 1992). ______________________________________________________________________________ Masking effects ______________________________________________________________________________ From: Vincent Pagel <Vincent.Pagel(at)loria.fr> -------------------------------------------------------------------------- [...] About curves corresponding to the masking effect: Those curves show the minimal intensity a sound with a given frequency must have to be perceived, when played simultaneously with a sound having a constant frequency during the experiment ( e.g. let's say that you want to find out the masking effect of a 500 Hz frequency .... you'll play it for example a 50 db ....and at the same time you'll play another frequency and you adjust the level of the second frequency to find out the limen where it is perceived. For example a sound played at 1000 Hz have to be louder than a sound at 700 Hz, because it's an harmonic of the masking frequency of 500 Hz ). ______________________________________________________________________________ Psychoacoustic norm / Timbral Metric ______________________________________________________________________________ From: Fahey(at)psyvax.psy.utexas.edu (Richard Fahey) -------------------------------------------------------------------------- These curves [Fletcher-Munson again...--AK] may be used to normalize spectra for loudness at different frequencies (changing dB into phons), and with a further change into sones one obtains a loudness density plot. The plot can be made more psychologically real by changing the frequency scale to the Bark scale, and using an auditory filter to smear the spectrum. The distance between two spectra represented in ways similar to this can be calculated as a Euclidean distance, and compared with psychoacoustic data. From: James Beauchamp <beaucham(at)uxh.cso.uiuc.edu> -------------------------------------------------------------------------- Here, we are comparing two time-varying spectra which are very similar to one another. This would be used to measure the efficiency of a particular synthesis technique. Our first guess was to use : SUM(k=1 to n) ((A2(t,k) - A1(t,k))^2 e(t) = sqrt( ------------------------------------ ) SUM(k=1 to n) A1(t,k)^2 which gives a normalized difference (error) vs. time. k is the partial number t is time, and A1(t,k) and A2(t,k) are the kth partial amplitudes vs. time for signals s1(t) and s2(t). Then the average error over time is given by e_ave = (1/DUR) SUM(t=0 to DUR) e(t) The theory is that given two syntheses of signal s1, namely s2 and s3, s2 is a better synthesis of s1 than is s3 if e_ave_2 < e_ave_3. This formulation seems to work fairly well, but it really fails when a synthesis has weak upper partials not found in the original. The weak upper partials contribute very little to the error calculation, but make a big difference in the perceived result. Therefore, it would probably be much better to add up the amplitudes within critical bands than to give all frequencies equal weights as we have been doing, and also to use an amplitude-to-loudness (in sones) translation. (Usually, S = K*A^0.6). The problem with equalizing the A(k,t) using the Fletcher-Munson curves is that one doesn't really know the absolute level of a given sound prior to playing it back, except in a lab testing situation, perhaps. Thus, the difference result would vary with playback level, an uncomfortable situation. From: Richard Parncutt <parncutt(at)sound.music.mcgill.ca> ------------------------------------------------------------------------- The psychoacoustic distance between two steady state complex sounds (or its converse, perceived similarity) is influenced by a number of factors, including similarity of loudess, timbral similarity, and the degree to which the sounds have pitches in common (where by "pitch" I mean PERCEIVED pitch in the psychoacoustic sense.) Terhardt (1972) distinguished two kinds of pitch. Spectral pitches correspond to individual audible pure-tone components. Virtual pitches correspond to groups of audible pure-tone components whose frequencies form an approximately harmonic pattern, suggesting the presence of an (embedded) harmonic-complex tone. Most pitches perceived in everyday and musical sounds are virtual pitches. The relative perceptual salience of pitches may be estimated by the algorithm of Terhardt et al. (1982). Parncutt (1989) defined the pitch commonality of two complex sounds as the extent to which they have perceived pitches in common, depending on the number and salience of coinciding pitches (by comparison to non-coinciding pitches). Calculated pitch commonality values correlate well with similarity judgments of pairs of complex sounds that differ relatively little in loudness and timbre (Parncutt, 1989, 1993), and with music-theoretic accounts of the strength of harmonic relationship between musical tones and chords (Parncutt, 1989). From: Christopher John Rolfe <rolfe(at)sfu.ca> ------------------------------------------------------------------------- Metrics have a long tradition in the literature, beginning with Fechner in the 19th Century. Cognitive science, however, points out that perceptual space may be non-Euclidean. In other words, there is NO simple metric. ______________________________________________________________________________ References / Books ______________________________________________________________________________ "Loudness: its definition, measurement, and calculation, Journal of the Acoustical Society of America, 1933, vol 5, p 9. Author: Fry R.B. PhD Dissertation, Duke University Title: Measurement of Specific Sequence Effects in Loudness Perception Date: 1981 Author: Lane H.L., Catania A.C., Stevens S.S. Title: Voice Level: Autophonic Scale, Perceived Loudness, and Effects of Sidetone Journal: JASA Volume: 33 Number: 2 Page(s): 160-167 Date: 1961 Author: Peterson G E, McKinney N P Title: The measurement of speech power Journal: Phonetica Volume: 7 Page(s): 65-84 Date: 1961 Author: Schlauch R.S., Wier C.C. Title: A Method for Relating Loudness-Matching and Intensity-Discrimination Data Journal: Journal of Speech and Hearing Research Volume: 30 Page(s): 13-20 Date: 1987 Author: Small AM, Brandt JF, Cox PG Title: [...?] function of signal duration Journal: JASA Volume: 34 Page(s): 513-514 Date: 1962 Author: Stevens S.S. Title: Calculation of the Loudness of Complex Noise Journal: JASA Volume: 28 Number: 5 Page(s): 807-832 Date: 1956 Handel, S. (1989). "Listening: an introduction to the perception of auditory events." MIT, Cambridge, MA Dooling, R. J. and Hulse, S. H. (ed.) (1989). The comparative psychologoy of audition: Perceiving complex sounds. Erlbaum, Hillsdale, NJ. McAdams, S. and Bigand, E. (ed.) (1993). Thinking in sound: the cognitive psychology of human audition. Oxford Univ. Press, NY Sloboda, J. A. (1985). The musical mind: The cognitive psychology of music. Clarendon, Oxford Proceedings of IEEE, V. 81, No 10 ,"Signal Compression Based on Models of Human Perception". Grey, J.M. "Multidimensional Perceptual Scaling of Musical Timbres" Journal of the Acoustical Soceiety of America, 63, 1493-1500. Repp, B.H (1984) "Categorical perception: Issues, methods, findings" In N.J. Lass (ed.) Speech and Language: Advances in Basic Research and Practice. Vol. 10. 1249-1257. Moore and Glasberg, JASA 74(3) 1983. "Suggested formulae for calculating auditory-filter bandwidths and excitation patterns" Bladon and Lindblom, JASA 69(5) 1981. "Modeling the judgement of vowel quality differences" J. R. Pierce, The Science of Musical Sound (Freenam, New York, 1983). J. G. Roederer, Introduction to the Physics and Psychophysics of Music (Springer-Verlag, New York, 1975). S. S. Stevens, "Measurement of Loudness", JASA 27 (1955): 815 S. S. Stevens, "Neural Events ans Psyhcophysical Law", _Science 170_ (1970): 1043 E. Zwicker, G. Flottorp, and S. S. Stevens, "Critical Bandwidth in Loudness Summation", JASA 29 (1957): 548 Author:Hynek Hermansky Institution:Speech Technology Laboratory, Division of Panasonic Technologies, Inc., 3888 State Street, Santa Barbara, CA 93105, USA Title:Perceptual linear predictive ({PLP}) analysis of speech}, Journal: JASA Year:1990 Vol.87 ,Number 4 , Page(s):1738-1752 Gersho et al (Bark Spectral Distance). IEEE Journal Selected areas of Communications Sept. (?) 1992 Name: "An Introduction to the Physiology of Hearing" Author: James O. Pickles,Dept. of Physiology,Uni. Birmingham,England. Publisher: Academic Press,1982. ISBN 0-12-554750-1 (hardback) ISBN 0-12-554752-8 (paperback). "An introduction to the psychology of hearing" by B. MOORE , 3d Edition. Terhardt, E. (1972). Zur Tonhoehenwahrnehmung von Klaengen (Perception of the pitch of complex tones). Acustica, 26, 173-199. Terhardt, E., Stoll, G., & Seewann, M. (1982). Algorithm for extraction of pitch and pitch salience from complex tonal signals. Journal of the Acoustical Society of America, 71, 679-688. [ The following papers are from Richard Parncutt (parncutt(at)sound.music.mcgill.ca) -- AK ] Bigand, E., Parncutt, R., & Lerdahl, F. (under review). Perception of musical tension in short chord sequences: The influence of harmonic function, sensory dissonance, horizontal motion, and musical training. Perception and Psychophysics. Parncutt, R. (1993). Pitch properties of chords of octave-spaced tones. Contemporary Music Review, 9, 35-50. Parncutt, R. (1989). Harmony: A Psychoacoustical Approach. Springer-Verlag, Berlin. (Springer Series in Information Sciences, Vol. 19. Eds.: T.S. Huang & M.R. Schroeder. ISBN 3-540-51279-9. 218 pages, 22 figs.) Stoll, G., & Parncutt, R. (1987). Harmonic relationship in similarity judgments of nonsimultaneous complex tones. Acustica, 63, 111-119. Terhardt, E., Stoll., G., Schermbach, R., & Parncutt, R. (1986). Tonhoehenmehrdeutigkeit, Tonverwandschaft und Identifikation von Sukzessivintervallen (Pitch ambiguity, harmonic relationship, and melodic interval identification). Acustica, 61, 57-66. Parncutt, R. (1989). Harmony. A psychoacoustical approach. Heidelberg: Springer-Verlag. Parncutt, R. (1993). Pitch properties of chords of octave-spaced tones. Contemporary Music Review, 9, 35-50. ______________________________________________________________________________ -- ____________________________ __________________________________ / /\ / /\ / Argiris A. Kranidiotis _/ /\ / E-mail (Internet): _/ /\ / University Of Athens / \/ / / \/ / Informatics Department /\ / akra(at)zeus.di.uoa.ariadne-t.gr /\ /___________________________/ / /_________________________________/ / \___________________________\/ \_________________________________\/ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \