AW: perceptual evaluation of cochlear models

Dear Francesco Tordini,

since your discussion now raised the question of perceptual aspects of cochlea models and especially loudness predictions of realistic signals I would like to draw your attention to some our own recent work on these topics:

1) Models for time-varying signals should be used for environmental sounds since they are often non-stationary. There are essentially two approaches (Glasberg and Moore, 2002, Chalupper and Fastl, 2002) which slightly differ in their predictions, see

Rennies J, Verhey JL , Fastl H (2010) Comparison of loudness models for time-varying sounds Acta Acustica united with Acustica 96, 383-396

2) None of the models seem to be perfect when predicting loudness of speech materials, see

Rennies J, Holube I, Verhey JL (2013) Loudness of speech and speech-like signals Acta Acustica United with Acustica 99, 268-282

3) Realistic cochlear models have rarely been used to predict psychophysical data although it may be possible

Heeren W, Rennies J, Verhey JL (2011) Spectral loudness summation of non-simultaneous tone pulses J. Acoust. Soc. Am. 130(6), pp. 3905-3915.

4) Spectro-temporal effects (including persistence effects) are more complex than currently modelled, see e.g.,

Heeren W, Rennies J, Verhey JL (2011) Spectral loudness summation of non-simultaneous tone pulses J. Acoust. Soc. Am. 130(6), pp. 3905-3915.

The latter (4) may partly explain why we still do not have a totally convincing loudness model for environmental sounds.

Kind regards

Jesko

--
Prof. Dr. Jesko Verhey

Department of Experimental Audiology
Otto-von-Guericke University Magdeburg
Leipziger Str. 44
39120 Magdeburg
Germany

++49 391 67-13885 (phone)
++49 391 67-13888 (fax)
Email: jesko.verhey@xxxxxxxxxxx

Von: AUDITORY - Research in Auditory Perception [mailto:AUDITORY@xxxxxxxxxxxxxxx] Im Auftrag von ftordini@xxxxxxxxx
Gesendet: Montag, 8. September 2014 18:44
An: AUDITORY@xxxxxxxxxxxxxxx
Betreff: Re: perceptual evaluation of cochlear models

Hello Joshua,
Thank you for the great expansion and for the further reading suggestions.
I may add three more items to the list, hoping to be clear in my formulation.

(1) A perhaps provocative question could be: is there a loudness or more loudnesses? (is loudness domain dependant?) Should we continue to tackle loudness as an invariant percept across classes once we move onto the more complex domain of real sounds? Rephrasing: once we define an ecologically valid taxonomy of real world sounds (e.g. starting from Gaver), can we expect the loudness model we want to improve to be valid across (sound)classes? Hard to say, I would attempt 'yes', but granting different paramenters tuning according to the dominant context (say, speech, music, or environmental sounds). [hidden question: are we actually, ever, purely "naive" listeners?]

(2) A related question: can we jump form the controlled lab environment into the wild in a single step? I'd say no - The approach followed by EBU/ITU using real world, long, stimuli is highly relevant to the broadcasting world, but it is hard to distinguish between energetic and informational masking effects using real program material mostly made of speech and music. Sets of less informative sources taken from environmental, natural sounds may be a good compromise - a starting point to address basic expansions of the current loudness model(s). Such stragegies and datasets are missing (to my knowledge).

(3) The role of space. Psysiologically driven models (Moore, Patterson) are supported mostly by observations obtained using non-spatialized, or dichotic, scenes to better reveal mechanisms sorting out the spatial confound. However, while spatial cues are considered to play a secondary role in scene alaysis, spatial release from masking is, on the other hand, quite important in partial loudness modeling, at least from the energetic masking point of view and especially for complex sources. This is even more relevant for asymmetric sources distributions. I feel there is much to do before we can address this aspect with confidence, even limiting the scope to non-moving sources, but more curiosity with respect to spatial variables may be valuable when designing listening experiments with natural sounds.
[If one asks a sound engineer working on a movie soundtrack: "where do you start form?", he will start talking about panning, to set the scene using his sources (foley, dialogue, music, ...) and **then** adjust levels/eq ...]

Best,
--

Francesco Tordini

http://www.cim.mcgill.ca/sre/personnel/
http://ca.linkedin.com/in/ftordini

>----Messaggio originale----
>Da: joshua.reiss@xxxxxxxxxx
>Data: 06/09/2014 13.43
>A: "ftordini@xxxxxxxxx"<ftordini@xxxxxxxxx>, "AUDITORY@xxxxxxxxxxxxxxx"<AUDITORY@xxxxxxxxxxxxxxx>
>Ogg: RE: RE: perceptual evaluation of cochlear models
>
>Hi Francesco (and auditory list in case others are interested),
>I'm glad to hear that you've been following the intelligent mixing research.
>
>I'll rephrase your email as a set of related questions...
>
>1. Should we extend the concepts of loudness and partial loudness to complex material? - Yes, we should. Otherwise, what is it good for? That is, what does it matter if we can accurately predict perceived loudness of a pure tone, or the just noticeable differences between pedestal increments for white or pink noise, or the partial loudness of a tone in the presence of noise, etc., if we can't predict loudness outside artificial laboratory conditions. I suppose it works as validation of an auditory model, but its still very limited.
>On the other hand, if we can extend the model to complex sounds like music, conversations, environmental sounds, etc., then we provide robust validation a general model of human loudness perception. The model can then be applied to metering systems, audio production, broadcast standards, improved hearing aid design and so on.
>
>2. Can we extend the concepts of loudness and partial loudness to complex material? - Yes, I think so. Despite all the issues and complexity, there's a tremendous amount of consistency in perception of loudness, especially when one considers relative rather than absolute perception. Take a TV show and the associated adverts. The soundtracks of both may have dialogue, foley, ambience, music,..., all with levels over time. Yet consistently people can identify when the adverts are louder than the show. Same is true when someone changes radio stations, and in music production, sound engineers are always identifying and dealing with masking when there are multiple simultaneous sources.
>I think the issues that many issues relating to complex material may have a big effect on perception of timbre or extraction of meaning or emotion, but only a minor effect on loudness.
>
>3. Can we extend current auditory models of loudness and partial loudness to complex material? - Hard to say. The state of the art in those based on deep understanding of the human hearing system (Glasberg, Moore et al... ; Fastl, Zwicker, et al...) were not developed with complex material in mind, though when used with complex material, researchers have reported good but far from great agreement with perception. Modification, though still in agreement with auditory knowledge, shows improvement, but more research is needed.
>On the other hand, we have models based mostly on listening test data, but incorporating little auditory knowledge. I'm thinking here of the EBU/ITU loudness standards. They are based largely on Gilbert Soulodre's excellent listening test results
>(G. Soulodre, Evaluation of Objective Loudness Meters, 116th AES Convention, 2004.), and represent a big improvement on say, just applying a loudness contour to signal RMS. But they are generally for a fixed listening level, may overfit the data, difficult to generalise, and rarely give deeper insight into the auditory system. Furthermore, like Moore's model, these have also shown some inadequacies when dealing with a wider range of content (Pestana, Reiss & Barbosa, "Loudness Measurement of Multitrack Audio Content Using Modifications of ITU-R BS.1770," 134th AES Convention, 2013).
>So I think rather than just extend, we may need to modify, improve, and go back to the drawing board on some aspects.
>
>4. How could one develop an auditory model of loudness and partial loudness for complex material?
>- Incorporate the validated aspects from prior models, but reassess any compromises.
>- Use listening test results from a wide range of complex material. Perhaps a metastudy could be performed, taking listening test results from many publications for both model creation and validation.
>- Build in known aspects of loudness perception that were left out of existing models due to resources and the fact that they were built for lab scenarios (pure tones, pink noise, sine sweeps...). In particular, I'm thinking forward and backward masking.
>
>5. What about JND? - I would stay clear of this. I'm not even aware of anecdotal evidence suggesting consistency in just noticeable differences for say, a small change in the level of one source in a mix. And I think one can be trained to identify small partial loudness differences. I've had conversations with professional mixing engineers who detect a problem with a mix that I don't notice until they point it out. But the concept of extending JND models to complex material is certainly very interesting.
>
>________________________________________
>From: ftordini@xxxxxxxxx <ftordini@xxxxxxxxx>
>Sent: 04 September 2014 15:45
>To: Joshua Reiss
>Subject: R: RE: perceptual evaluation of cochlear models
>
>Hello Joshua,
>Interesting, indeed. Thank you.
>
>So the question is - to what extent can we stretch the concepts of loudness
>and partial loudness for complex material such as meaningful noise (aka music),
>where attention and preference is likely to play a role as opposed to beeps and
>sweeps ? That is - would you feel comfortable to give a rule of a thumb for a
>JND for partial loudness, to safely rule out other factors when mixing?
>
>I was following your intelligent mixing thread - although I've missed the
>recent one you sent me - and my question above relates to the possibility to
>actually "design" the fore-background perception when you do automatic mixing
>using real sounds...
>I would greatly appreciate any comment form your side.
>
>Best wishes,
>Francesco
>
>
>>----Messaggio originale----
>>Da: joshua.reiss@xxxxxxxxxx
>>Data: 03/09/2014 16.00
>>A: "AUDITORY@xxxxxxxxxxxxxxx"<AUDITORY@xxxxxxxxxxxxxxx>, "Joachim Thiemann"
><joachim.thiemann@xxxxxxxxx>, "ftordini@xxxxxxxxx"<ftordini@xxxxxxxxx>
>>Ogg: RE: perceptual evaluation of cochlear models
>>
>>Hi Francesco and Joachim,
>>I collaborated on a paper that involved perceptual evaluation of partial
>loudness with real world audio content, where partial loudness is derived from
>the auditory models of Moore, Glasberg et al. It showed that the predicted
>loudness of tracks in multitrack musical audio disagrees with perception, but
>that minor modifications to a couple of parameters in the model would result in
>a much closer match to perceptual evaluation results. See
>>Z. Ma, J. D. Reiss and D. Black, "Partial loudness in multitrack mixing," AES
>53rd International Conference on Semantic Audio in London, UK, January 27-29,
>2014.
>>
>>And in the following paper, there was some informal evaluation of the use of
>Glasberg, Moore et al's auditory model for loudness and/or partial loudness
>could be used to mix multitrack musical audio. Though the emphasis was on
>application rather than evaluation, it also noticed issues with the model when
>applied to real world content. See,
>>D. Ward, J. D. Reiss and C. Athwal, "Multitrack mixing using a model of
>loudness and partial loudness," 133rd AES Convention, San Francisco, Oct. 26-
>29, 2012.
>>
>>These may not be exactly what you're looking for, but hopefully you find it
>interesting.
>>________________________________________
>>From: AUDITORY - Research in Auditory Perception <AUDITORY@xxxxxxxxxxxxxxx>
>on behalf of Joachim Thiemann <joachim.thiemann@xxxxxxxxx>
>>Sent: 03 September 2014 07:12
>>To: AUDITORY@xxxxxxxxxxxxxxx
>>Subject: Re: perceptual evaluation of cochlear models
>>
>>Hello Francesco,
>>
>>McGill alumni here - I did a bit of study in this direction, you can
>>read about it in my thesis:
>>http://www-mmsp.ece.mcgill.ca/MMSP/Theses/T2011-2013.html#Thiemann
>>
>>My argument was that if you have a good auditory model, you should be
>>able to start from only the model parameters and be able to
>>reconstruct the original signal with perceptual transparency. I was
>>looking at this in the context of perceptual coding - a perceptual
>>coder minus the entropy stage effectively verifies the model. If
>>artefacts do appear, they can (indirectly) tell you what you are
>>missing.
>>
>>I was specifically looking at gammatone filterbank methods, so there
>>is no comparison to other schemas - but I hope it is a bit in the
>>direction you're looking at.
>>
>>Cheers,
>>Joachim.
>>
>>On 2 September 2014 20:39, ftordini@xxxxxxxxx <ftordini@xxxxxxxxx> wrote:
>>>
>>> Dear List members,
>>> I am looking for references on perceptual evaluation of cochlear models -
>>> taken form an analysis-synthesis point of view, alike the work introduced
>in
>>> Homann_2002 (Frequency analysis and synthesis using a Gammatone filterbank,
>>> §4.3).
>>>
>>> Are you aware of any study that tried to assess the performance of
>>> gammatone-like filterbanks used as a synthesis model? (AKA, what are the
>>> advantages over MPEG-like schemas?)
>>>
>>> All the best,
>>> Francesco
>>>
>>> http://www.cim.mcgill.ca/sre/personnel/
>>> http://ca.linkedin.com/in/ftordini