[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[AUDITORY] Special Issue, Hearing Research - Predicting Speech Intelligibility



Dear Colleagues,

    We are happy to announce that Hearing Research has just published a Special Issue on Predicting Speech Intelligibility - December 2022, Vol 426. 

The Authors, Titles, Abstracts and links to each paper are included below.
Guest Editors, Torsten Dau and Laurel Carney

 

Satyabrata Parida, Michael G. Heinz  Underlying neural mechanisms of degraded speech intelligibility following noise-induced hearing loss: The importance of distorted tonotopy https://doi.org/10.1016/j.heares.2022.108586
Abstract: Listeners with sensorineural hearing loss (SNHL) have substantial perceptual deficits, especially in noisy environments. Unfortunately, speech-intelligibility models have limited success in predicting the performance of listeners with hearing loss. A better understanding of the various suprathreshold factors that contribute to neural-coding degradations of speech in noisy conditions will facilitate better modeling and clinical outcomes. Here, we highlight the importance of one physiological factor that has received minimal attention to date, termed distorted tonotopy, which refers to a disruption in the mapping between acoustic frequency and cochlear place that is a hallmark of normal hearing. More so than commonly assumed factors (e.g., threshold elevation, reduced frequency selectivity, diminished temporal coding), distorted tonotopy severely degrades the neural representations of speech (particularly in noise) in single- and across-fiber responses in the auditory nerve following noise-induced hearing loss. Key results include: 1) effects of distorted tonotopy depend on stimulus spectral bandwidth and timbre, 2) distorted tonotopy increases across-fiber correlation and thus reduces information capacity to the brain, and 3) its effects vary across etiologies, which may contribute to individual differences. These results motivate the development and testing of noninvasive measures that can assess the severity of distorted tonotopy in human listeners. The development of such noninvasive measures of distorted tonotopy would advance precision-audiological approaches to improving diagnostics and rehabilitation for listeners with SNHL.

 

Johannes Zaar, Laurel H. Carney  Predicting speech intelligibility in hearing-impaired listeners using a physiologically inspired auditory model   https://doi.org/10.1016/j.heares.2022.108553
Abstract: This study presents a major update and full evaluation of a speech intelligibility (SI) prediction model previously introduced by Scheidiger, Carney, Dau, and Zaar [(2018), Acta Acust. United Ac. 104, 914-917]. The model predicts SI in speech-in-noise conditions via comparison of the noisy speech and the noise-alone reference. The two signals are processed through a physiologically inspired nonlinear model of the auditory periphery, for a range of characteristic frequencies (CFs), followed by a modulation analysis in the range of the fundamental frequency of speech. The decision metric of the model is the mean of a series of short-term, across-CF correlations between population responses to noisy speech and noise alone, with a sensitivity-limitation process imposed. The decision metric is assumed to be inversely related to SI and is converted to a percent-correct score using a single data-based fitting function. The model performance was evaluated in conditions of stationary, fluctuating, and speech-like interferers using sentence-based speech-reception thresholds (SRTs) previously obtained in 5 normal-hearing (NH) and 13 hearing-impaired (HI) listeners. For the NH listener group, the model accurately predicted SRTs across the different acoustic conditions (apart from a slight overestimation of the masking release observed for fluctuating maskers), as well as plausible effects in response to changes in presentation level. For HI listeners, the model was adjusted to account for the individual audiograms using standard assumptions concerning the amount of HI attributed to inner-hair-cell (IHC) and outer-hair-cell (OHC) impairment. HI model results accounted remarkably well for elevated individual SRTs and reduced masking release. Furthermore, plausible predictions of worsened SI were obtained when the relative contribution of IHC impairment to HI was increased. Overall, the present model provides a useful tool to accurately predict speech-in-noise outcomes in NH and HI listeners, and may yield important insights into auditory processes that are crucial for speech understanding.

 

Helia Relaño-Iborra, Torsten Dau  Speech intelligibility prediction based on modulation frequency-selective processing  https://doi.org/10.1016/j.heares.2022.108610
Abstract: Speech intelligibility models can provide insights regarding the auditory processes involved in human speech perception and communication. One successful approach to modelling speech intelligibility has been based on the analysis of the amplitude modulations present in speech as well as competing interferers. This review covers speech intelligibility models that include a modulation-frequency selective processing stage i.e., a modulation filterbank, as part of their front end. The speech-based envelope power spectrum model [sEPSM, Jørgensen and Dau (2011). J. Acoust. Soc. Am. 130(3), 1475-1487], several variants of the sEPSM including modifications with respect to temporal resolution, spectro-temporal processing and binaural processing, as well as the speech-based computational auditory signal processing and perception model [sCASP; Relaño-Iborra et al. (2019). J. Acoust. Soc. Am. 146(5), 3306–3317], which is based on an established auditory signal detection and masking model, are discussed. The key processing stages of these models for the prediction of speech intelligibility across a variety of acoustic conditions are addressed in relation to competing modeling approaches. The strengths and weaknesses of the modulation-based analysis are outlined and perspectives presented, particularly in connection with the challenge of predicting the consequences of individual hearing loss on speech intelligibility.

 

Amin Edraki, Wai-Yip Chan, Jesper Jensen, Daniel Fogerty  Spectro-temporal modulation glimpsing for speech intelligibility prediction  https://doi.org/10.1016/j.heares.2022.108620  
Abstract: We compare two alternative speech intelligibility prediction algorithms: time-frequency glimpse proportion (GP) and spectro-temporal glimpsing index (STGI). Both algorithms hypothesize that listeners understand speech in challenging acoustic environments by “glimpsing” partially available information from degraded speech. GP defines glimpses as those time-frequency regions whose local signal-to-noise ratio is above a certain threshold and estimates intelligibility as the proportion of the time-frequency regions glimpsed. STGI, on the other hand, applies glimpsing to the spectro-temporal modulation (STM) domain and uses a similarity measure based on the normalized cross-correlation between the STM envelopes of the clean and degraded speech signals to estimate intelligibility as the proportion of the STM channels glimpsed. Our experimental results demonstrate that STGI extends the notion of glimpsing proportion to a wider range of distortions, including non-linear signal processing, and outperforms GP for the additive uncorrelated noise datasets we tested. Furthermore, the results show that spectro-temporal modulation analysis enables STGI to account for the effects of masker type on speech intelligibility, leading to superior performance over GP in modulated noise datasets.

 

Luna Prud’homme, Mathieu Lavandier, Virginia Best  Investigating the role of harmonic cancellation in speech-on-speech masking   https://doi.org/10.1016/j.heares.2022.108562
Abstract: This study investigated the role of harmonic cancellation in the intelligibility of speech in “cocktail party” situations. While there is evidence that harmonic cancellation plays a role in the segregation of simple harmonic sounds based on fundamental frequency (F0), its utility for mixtures of speech containing non-stationary F0s and unvoiced segments is unclear. Here we focused on the energetic masking of speech targets caused by competing speech maskers. Speech reception thresholds were measured using seven maskers: speech-shaped noise, monotonized and intonated harmonic complexes, monotonized speech, noise-vocoded speech, reversed speech and natural speech. These maskers enabled an estimate of how the masking potential of speech is influenced by harmonic structure, amplitude modulation and variations in F0 over time. Measured speech reception thresholds were compared to the predictions of two computational models, with and without a harmonic cancellation component. Overall, the results suggest a minor role of harmonic cancellation in reducing energetic masking in speech mixtures.

 

Luna Prud’homme, Mathieu Lavandier, Virginia Best  A dynamic binaural harmonic-cancellation model to predict speech intelligibility against a harmonic masker varying in intonation, temporal envelope, and location   https://doi.org/10.1016/j.heares.2022.108535
Abstract: The aim of this study was to extend the harmonic-cancellation model proposed by Prud’homme et al. [J. Acoust. Soc. Am. 148 (2020) 3246-–3254] to predict speech intelligibility against a harmonic masker, so that it takes into account binaural hearing, amplitude modulations in the masker and variations in masker fundamental frequency (F0) over time. This was done by segmenting the masker signal into time frames and combining the previous long-term harmonic-cancellation model with the binaural model proposed by Vicente and Lavandier [Hear. Res. 390 (2020) 107937]. The new model was tested on the data from two experiments involving harmonic complex maskers that varied in spatial location, temporal envelope and F0 contour. The interactions between the associated effects were accounted for in the model by varying the time frame duration and excluding the binaural unmasking computation when harmonic cancellation is active. Across both experiments, the correlation between data and model predictions was over 0.96, and the mean and largest absolute prediction errors were lower than 0.6 and 1.5 dB, respectively.

 

David Hülsmeier, Birger Kollmeier   How much individualization is required to predict the individual effect of suprathreshold processing deficits? Assessing Plomp's distortion component with psychoacoustic detection thresholds and FADE  https://doi.org/10.1016/j.heares.2022.108609
Abstract: Plomp introduced an empirical separation of the increased speech recognition thresholds (SRT) in listeners with a sensorineural hearing loss into an Attenuation (A) component (which can be compensated by amplification) and a non-compensable Distortion (D) component. Previous own research backed up this notion by speech recognition models that derive their SRT prediction from the individual audiogram with or without a psychoacoustic measure of suprathreshold processing deficits. To determine the precision in separating the A and D component for the individual listener with various individual measures and individualized models, SRTs with 40 listeners with a variation in hearing impairment were obtained in quiet, stationary noise, and fluctuating noise (ICRA 5–250 and babble). Both the clinical audiogram and an adaptive, precise sweep audiogram were obtained as well as tone-in-noise detection thresholds at four frequencies to characterize the individual hearing impairment. For predicting the SRT, the FADE-model (which is based on machine learning) was used with either of the two audiogram procedures and optionally the individual tone-in-noise detection thresholds. The results indicate that the precisely measured swept tone audiogram allows for a more precise prediction of the individual SRT in comparison to the clinical audiogram (RMS error of 4.3 dB vs. 6.4 dB, respectively). While an estimation from the precise audiogram and FADE performed equally well in predicting the individual A and D component, the further refinement of including the tone-in-noise detection threshold with FADE led to a slight improvement of prediction accuracy (RMS error of 3.3 dB, 4.6 dB and 1.4 dB, for SRT, A and D component, respectively). Hence, applying FADE is advantageous for scientific purposes where a consistent modeling of different psychoacoustical effects in the same listener with a minimum amount of assumptions is desirable. For clinical purposes, however, a precisely measured audiogram and an estimation of the expected D component using a linear regression appears to be a satisfactory first step towards precision audiology.

 

Jan Rennies, Saskia Röttges, Rainer Huber, Christopher F. Hauth, Thomas Brand  A joint framework for blind prediction of binaural speech intelligibility and perceived listening effort https://doi.org/10.1016/j.heares.2022.108598
Abstract: Speech perception is strongly affected by noise and reverberation in the listening room, and binaural processing can substantially facilitate speech perception in conditions when target speech and maskers originate from different directions. Most studies and proposed models for predicting spatial unmasking have focused on speech intelligibility. The present study introduces a model framework that predicts both speech intelligibility and perceived listening effort from the same output measure. The framework is based on a combination of a blind binaural processing stage employing a blind equalization cancelation (EC) mechanism, and a blind backend based on phoneme probability classification. Neither frontend nor backend require any additional information, such as the source directions, the signal-to-noise ratio (SNR), or the number of sources, allowing for a fully blind perceptual assessment of binaural input signals consisting of target speech mixed with noise. The model is validated against a recent data set in which speech intelligibility and perceived listening effort were measured for a range of acoustic conditions differing in reverberation and binaural cues [Rennies and Kidd (2018), J. Acoust. Soc. Am. 144, 2147-2159]. Predictions of the proposed model are compared with a non-blind binaural model consisting of a non-blind EC stage and a backend based on the speech intelligibility index. The analyses indicated that all main trends observed in the experiments were correctly predicted by the blind model. The overall proportion of variance explained by the model (R² = 0.94) for speech intelligibility was slightly worse than for the non-blind model (R² = 0.98). For listening effort predictions, both models showed lower prediction accuracy, but still explained significant proportions of the observed variance (R² = 0.88 and R² = 0.71 for the non-blind and blind model, respectively). Closer inspection showed that the differences between data and predictions were largest for binaural conditions at high SNRs, where the perceived listening effort of human listeners tended to be underestimated by the models, specifically by the blind version.

 

James M. Kates, Kathryn H. Arehart  An overview of the HASPI and HASQI metrics for predicting speech intelligibility and speech quality for normal hearing, hearing loss, and hearing aids   https://doi.org/10.1016/j.heares.2022.108608
Abstract: Alterations of the speech signal, including additive noise and nonlinear distortion, can reduce speech intelligibility and quality. Hearing aids present an especially complicated situation since these devices may implement nonlinear processing designed to compensate for the hearing loss. Hearing-aid processing is often realized as time-varying multichannel gain adjustments, and may also include frequency reassignment. The challenge in designing metrics for hearing aids and hearing-impaired listeners is to accurately model the perceptual trade-offs between speech audibility and the nonlinear distortion introduced by hearing-aid processing. This paper focuses on the Hearing Aid Speech Perception Index (HASPI) and the Hearing Aid Speech Quality Index (HASQI) as representative metrics for predicting intelligibility and quality. These indices start with a model of the auditory periphery that can be adjusted to represent hearing loss. The peripheral model, the speech features computed from the model outputs, and the procedures used to fit the features to subject data are described. Examples are then presented for using the metrics to measure the effects of additive noise, evaluate noise-suppression processing, and to measure the differences among commercial hearing aids. Open questions and considerations in using these and related metrics are then discussed.

 

Marlies Gillis, Jana Van Canneyt, Tom Francart, Jonas Vanthornhout   Neural tracking as a diagnostic tool to assess the auditory pathway   https://doi.org/10.1016/j.heares.2022.108607
Abstract: When a person listens to sound, the brain time-locks to specific aspects of the sound. This is called neural tracking and it can be investigated by analysing neural responses (e.g., measured by electroencephalography) to continuous natural speech. Measures of neural tracking allow for an objective investigation of a range of auditory and linguistic processes in the brain during natural speech perception. This approach is more ecologically valid than traditional auditory evoked responses and has great potential for research and clinical applications. This article reviews the neural tracking framework and highlights three prominent examples of neural tracking analyses: neural tracking of the fundamental frequency of the voice (f0), the speech envelope and linguistic features. Each of these analyses provides a unique point of view into the human brain’s hierarchical stages of speech processing. F0-tracking assesses the encoding of fine temporal information in the early stages of the auditory pathway, i.e., from the auditory periphery up to early processing in the primary auditory cortex. Envelope tracking reflects bottom-up and top-down speech-related processes in the auditory cortex and is likely necessary but not sufficient for speech intelligibility. Linguistic feature tracking (e.g. word or phoneme surprisal) relates to neural processes more directly related to speech intelligibility. Together these analyses form a multi-faceted objective assessment of an individual’s auditory and linguistic processing.

 

Mahdie Karbasi, Dorothea Kolossa  ASR-based speech intelligibility prediction: A review https://doi.org/10.1016/j.heares.2022.108606
Abstract: Various types of methods and approaches are available to predict the intelligibility of speech signals, but many of these still suffer from two major problems: first, their required prior knowledge, which itself could limit the applicability and lower the objectivity of the method, and second, a low generalization capacity, e.g. across noise types, degradation conditions, and speech material. Automatic speech recognition (ASR) has been suggested as a machine-learning-based component of speech intelligibility prediction (SIP), aiming to ameliorate the shortcomings of other SIP methods. Since their first introduction, ASR-based SIP approaches have been developing at an increasingly rapid pace, were deployed in a range of contexts, and have shown promising performance in many scenarios. Our article provides an overview of this body of research. The main differences between competing methods are highlighted and their benefits are explained next to their limitations. We conclude with an outlook on future work and new related directions.

 

Torsten Dau  tdau@xxxxxx

Laurel H. Carney Laurel_Carney@xxxxxxxxxxxxxxxxxx