Dear all (with apologies for any cross-postings)
We are pleased to announce the third edition of the COG-MHEAR Challenge on audio-visual speech enhancement (AVSEC-3) that we are running as a Satellite Workshop of 2024 INTERSPEECH on 1st September 2024, at Kos Island, Greece (http://challenge.cogmhear.org)
Important Dates: 16th February 2024: Release of training and development data. 22nd March 2024: Release of low-latency baseline system. 10th April 2024: Evaluation data release. 10th April 2024: Leaderboard open for submissions. 6th May 2024: Paper submission opens. 20th June/2024: Deadline for challenge submissions. 28th June 2024: Paper submission closes. 12th July: Acceptance notification. 26th July: early release of evaluation results. 1st August 2024: camera-ready paper.
The AVSEC Challenge sets the first benchmark in the field by providing a carefully crafted dataset and scalable protocol for human listening evaluation of audio-visual speech enhancement systems. The open AVSEC framework aims to foster collaborative research and innovation to facilitate the development and evaluation of next-generation audio-visual speech enhancement and separation systems, including multimodal assistive hearing and communication technologies.
The success of the two previous editions of the Challenge (organized as part of IEEE SLT 2022 and IEEE ASRU 2023) demonstrates a consistent trend of system improvement, yet highlights an enduring intelligibility gap when compared to clean speech. We anticipate that the third edition will further enhance system performance and provide a networking and collaborative platform for deliberating on the scope, challenges and opportunities in co-designing and evaluating future speech and hearing technologies.
To register for the challenge please follow the guidelines on the website: https://challenge.cogmhear.org/#/getting-started/register
We welcome submissions from participants of the second (AVSEC-2) and third editions (AVSEC-3) of the Challenge and also invite submissions on related research topics, including but not limited to the following: - Low-latency approaches to audio-visual speech enhancement and separation. - Human auditory-inspired models of multi-modal speech perception and enhancement. - Energy-efficient audio-visual speech enhancement and separation methods. - Machine learning for diverse target listeners and diverse listening scenarios. - Audio quality and intelligibility assessment of audio-visual speech enhancement systems. - Objective metrics to predict quality and intelligibility from audio-visual stimuli. - Understanding human speech perception in competing speaker scenarios. - Clinical applications and live demonstrators of audio-visual speech enhancement and separation, (e.g. multi-modal hearing assistive technologies for hearing-impaired listeners; speech-enabled communication aids to support autistic people with speech disorders). - Accessibility and human-centric factors in the design and evaluation of innovative multimodal technologies, including multi-modal corpus development, public perceptions, ethics considerations, standards, societal, economic and political impacts.
Accepted Workshop papers (both short 2-page and full-length papers of 4-6 pages) will be published in ISCA Proceedings. Authors of selected papers (including winners and runner-ups of each Challenge Track) will be invited to submit significantly extended papers for consideration in a Special Issue of the IEEE Journal of Selected Topics in Signal Processing (JSTSP) on "Deep Multimodal Speech Enhancement and Separation" - CFP available below and here: https://signalprocessingsociety.org/publications-resources/special-issue-deadlines/ieee-jstsp-special-issue-deep-multimodal-speech-enhancement-and-separation - full manuscript submission deadline: 30 Sep 2024)
Many thanks in advance,
Prof Amir Hussain Edinburgh Napier University, Scotland, UK E-mail: a.hussain@xxxxxxxxxxxx
Manuscript Due: 30 September 2024 Publication Date: May 2025 Scope Voice is the most commonly used modality by humans to communicate and psychologically blend into society. Recent technological advances have triggered the development of various voice-related applications in the information and communications technology market. However, noise, reverberation, and interfering speech are detrimental for effective communications between humans and other humans or machines, leading to performance degradation of associated voice-enabled services. To address the formidable speech-in-noise challenge, a range of speech enhancement (SE) and speech separation (SS) techniques are normally employed as important front-end speech processing units to handle distortions in input signals in order to provide more intelligible speech for automatic speech recognition (ASR), synthesis and dialogue systems. Emerging advances in artificial intelligence (AI) and machine learning, particularly deep neural networks, have led to remarkable improvements in SE and SS based solutions. A growing number of researchers have explored various extensions of these methods by utilising a variety of modalities as auxiliary inputs to the main speech processing task to access additional information from heterogeneous signals. In particular, multi-modal SE and SS systems have been shown to deliver enhanced performance in challenging noisy environments by augmenting the conventional speech modality with complementary information from multi-sensory inputs, such as video, noise type, signal-to-noise ratio (SNR), bone-conducted speech (vibrations), speaker, text information, electromyography, and electromagnetic midsagittal articulometer (EMMA) data. Various integration schemes, including early and late fusions, cross-attention mechanisms, and self-supervised learning algorithms, have also been successfully explored. TopicsThis timely special issue aims to collate latest advances in multi-modal SE and SS systems that exploit both conventional and unconventional modalities to further improve state-of-the-art performance in benchmark problems. We particularly welcome submissions for novel deep neural network based algorithms and architectures, including new feature processing methods for multimodal and cross-modal speech processing. We also encourage submissions that address practical issues related to multimodal data recording, energy-efficient system design and real-time low-latency solutions, such as for assistive hearing and speech communication applications. Special Issue research topics of interest relate to open problems needing addressed These include, but are not limited to, the following.
We encourage submissions that not only propose novel approaches but also substantiate the findings with rigorous evaluations, including real-world datasets. Studies that provide insights into the challenges involved and the impact of MM-SE and MM-SS on end-users are particularly welcome. Submission GuidelinesManuscripts should be original and should not have been previously published or currently under consideration for publication elsewhere. All submissions will be peer-reviewed according to the IEEE Signal Processing Society review process. Authors should prepare their manuscripts according to the Instructions for Authors available from the Signal Processing Society website. Follow the instructions given on the IEEE JSTSP webpage: and submit manuscripts. Important DatesManuscript Submission Deadline: 30 September 2024 First Review Due: 15 December 2024 Revised Manuscript Due: 15 January 2024 Second Review Due: 15 February 2024 Final Decision: 28 February 2025 Guest Editors For further information, please contact the guest editors at: Amir Hussain, Edinburgh Napier University, UK (Lead GE) Yu Tsao, Academia Sinica, Taiwan (co-Lead GE) John H.L. Hansen, University of Texas at Dallas, USA Naomi Harte, Trinity College Dublin, Ireland Shinji Watanabe, Carnegie Mellon University, USA Isabel Trancoso, Instituto Superior Técnico, IST, Univ. Lisbon, Portugal Shixiong Zhang, Tencent AI Lab, USA
|