[AUDITORY] [PhD position] Multiple sound source tracking with deep learning

Apologies for cross-posting:

PhD position in Multiple sound source tracking with deep learning, M/F

Ref : 2022-10750 | 31 Mar 2022

4 rue du clos Courtel 35510 CESSON SEVIGNE - France

About the role

Your role is to develop neural network-based methods for an accurate, causal, and low-latency tracking, capable of simultaneously estimating the trajectories of multiple speech sources.

Owing to progress in deep learning, speech recognition has shown a great momentum in recent years. Nevertheless, the accuracy of speech recognition engines degrades in adverse acoustic conditions, due to noise, reverberation, and interfering speech sources, emphasising the requirement for a pre-processing, speech enhancement stage. This is usually achieved by the combination of a microphone array and beamforming, which aims at preserving the sound coming from the target source direction, while attenuating the rest. Therefore, knowing the Direction-of-Arrival (DoA) of the useful speech signal with respect to the microphone array is a mandatory pre-requisite. In practice, the DoA of a target source is estimated by a source localization algorithm, which can provide only instantaneous, noisy, and unlabeled DoA estimates. Such observations are inadequate for beamforming, which requires source trajectories, corresponding to the positions along time of each target source. The goal of the tracking algorithm is to assemble these trajectories from the raw DoA observations.

Tracking multiple speech sources in real-life environments is known to be a notoriously challenging problem, for several reasons. First, due to acoustics - particularly, noise and reflections off the surfaces of the environment (walls, floor, furniture, …). The latter can bias the DoA estimates, and even produce false observations. Second, the observations are usually intermittent, whether due to the intermittent nature of the source itself (speech for instance), or because, during certain period, a source may be masked by a stronger one. Recovering such trajectory is equivalent to « re-identification » of a speech source. Finally, due to application requirements: since the speech recognition system often runs in real-time, the tracker needs to be causal, i.e., it should exploit only present and past information.

The goal of this PhD thesis is to devise a multisource tracking system based entirely, or in part, on deep learning methods. The tracker would integrate the speaker counting and DoA estimates, obtained from highly efficient detection and localization modules, already developed in a recently defended PhD thesis. Furthermore, these features will be augmented by speaker neural embeddings, with the goal of improving the tracking of intermittent sources. Finally, the complete system could be unified into a single neural network architecture, hence enabling the end-to-end training.

About you

Skills and qualities required:

Research Master or engineering school

· Background in signal processing and/or machine learning applied to audio and acoustics

· Appetite for audio processing

· In-depth knowledge of Python and Bash

· Hands-on experience with deep learning toolkits, such as PyTorch, Tensorflow, Kaldi

·. Scientific rigor and creativity

Additional information

The practical outcome of the thesis would be the full processing chain comprised of counting, localization and tracking modules running in real-time on a standard PC (or an embedded device). The subject of this thesis, at the boundary of microphone array processing and speech recognition, associated with the deep learning approach currently in huge progression, guarantee that the conducted works will be recognized both by academic and industrial communities.

To achieve this task, the candidate will have access to different equipment to create database and test its developments: room equipped with 30 loudspeakers and ICARE software to simulate moving sources immersed in realistic and complex acoustic environments.

Concerning the implementation task, the PhD student could be led to interact with the team in charge of developing prototypes and integrating localizing algorithms in VST plugins for real-time visualization.

Department

Orange Innovation brings together the research and innovation activities and expertise of the Group's entities and countries. We work every day to ensure that Orange is recognized as an innovative operator by its customers and we create value for the Group and the Brand in each of our projects. With 720 researchers, thousands of marketers, developers, designers and data analysts, it is the expertise of our 6,000 employees that fuels this ambition every day.

Orange Innovation anticipates technological breakthroughs and supports the Group's countries and entities in making the best technological choices to meet the needs of our consumer and business customers.

The team CVA (Content Audio Video) counts around 20 people, mostly permanent researchers, and PhD students, focused on signal processing and machine learning technologies for audio and video. The focus of the audio group is on microphone array processing, 3D-audio rendering and multichannel compression in connection with the definition and implementation of international standards in the domain (MPEG, 3GPP).

Apply via Orange Jobs website: https://orange.jobs/jobs/offer.do?joid=111920&lang=EN