[AUDITORY] CfP: Special session on representation learning for audio, speech, and music processing

Paper submission: 15th of January, 2021
Notification of acceptance: 15th of March, 2021
Camera ready submission: 30th of March, 2021

=====================================

Scope and topics:

In the last decade, deep learning has revolutionized the research fields of audio and speech signal processing, acoustic scene analysis, and music information retrieval. In these research fields, methods relying on deep learning have achieved remarkable performance in various applications and tasks, surpassing legacy methods that rely on the independent usage of signal processing operations and machine learning algorithms. The huge success of deep learning methods relies on their ability to learn representations from sound signals that are useful for various downstream tasks. These representations encapsulate the underlying structure or features of the sound signals, or the latent variables that describe the underlying statistics of the respective signals.

Despite this success, learning representations of audio with deep models remains challenging. For example, the diversity of acoustic noise, the multiplicity of recording devices (e.g., high-end microphones vs. smartphones), and the source variability challenge machine learning methods when they are used in realistic environments. In audio event detection, which has recently become a vigorous research field, systems for the automatic detection of multiple overlapping events are still far from reaching human performance. Another major challenge is the design of robust speech processing systems. Speech enhancement technologies have significantly improved in the past years, notably thanks to deep learning methods. However, there is still a large performance gap between controlled environments and real-world situations. As a final example, in the music information retrieval field, modeling the high-level semantics based on local and long-term relations in music signals is still a core challenge. More generally, self-supervised approaches that can leverage a large amount of unlabeled data are very promising for learning models that can serve as a powerful base for many applications and tasks. Thus, it is of great interest for the scientific community to find new methods for representing audio signals using hierarchical models, such as deep neural networks. This will enable novel learning methods to leverage the large amount of information that audio, speech, and music signals convey.

The aim of this session is to establish a venue where engineers, scientists, and practitioners from both academia and industry, can present and discuss cutting-edge results in representation learning in audio, speech, and music signal processing. Driven by the constantly increasing popularity of audio, speech, and music representation learning, the organizing committee of this session is motivated to build, in the long-term, a solid reference within the computational intelligence community for the digital audio field.

The scope of this proposed special session is representation learning, focused on audio, speech, and music. Representation learning is one of the main aspects of neural networks. Thus, the scope of this proposes special session is well aligned with the scope of the IJCNN, as the current special session is focused on a core aspect of neural networks, which is the representation learning.

The topics of the proposed special session include, but are not limited to:

• Audio, speech, and music signal generative models and methods
• Single and multi-channel methods for separation, enhancement, and denoising
• Spatial analysis, modification, and synthesis for augmented and virtual reality
• Detection, localization, and tracking of audio sources/events
• Style transfer, voice conversion, digital effects, and personalization
• Adversarial attacks and real/synthetic discrimination methods
• Information retrieval and classification methods
• Multi- and inter-modal models and methods
• Self-supervised/metric learning methods
• Domain adaptation, transfer learning, knowledge distilation, and K-shot approaches
• Differentiable signal processing based methods
• Privacy preserving methods
• Interpretability and explainability in deep models for audio
• Context and structure-aware approaches

On behalf of the organizing committee,

Konstantinos Drossos, PhD
Senior researcher
Audio Research Group
Tampere University, Finland

Office: TF309
Address: Korkeakoulunkatu 10, FI-33720
mail: konstantinos.drossos@xxxxxxx