[AUDITORY] PhD position on neural voice conversion @ Ircam

Please find below the description of a fully funded PhD position on "Neural voice conversion from disentangled attribute representation”

Regards,

Nicolas Obin

Context

The aim of the EVA ("Explicit Voice Attributes") project is to decipher the codes of human voices by learning explicit, structured representations of voice attributes. Achieving this objective will have a strong scientific and technological impact, in at least two fields of application: firstly, in speech analysis, it will enable us to understand the complex tangle of characteristics of a human voice; secondly, for voice generation, it will feed a wide range of applications for creating a voice with the desired attributes, enabling the design of what is known as a vocal personality. The set of attributes will be defined by human expertise or discovered from the data using lightly supervised or unsupervised neural networks. It will include a detailed and explicit description of timbre, voice quality, phonation, or speaker biases such as specific pronunciations or speech disorders (e.g., lisping), regional or non-native accents, and paralinguistic elements such as emotions or style. Ideally, each attribute could be controlled in synthesis and conversion by a degree of intensity, enabling it to be amplified or erased from the voice, as part of a structured integration. These attributes could be defined by experts or by neural network algorithms such as disentanglement learning or self-supervised representations that automatically discover salient attributes in multi-speaker datasets. The main industrial results expected concern different use cases for voice transformation. The first is voice anonymization: to enable RGPD-compliant voice recordings, voice conversion systems could be configured to remove attributes strongly associated with a speaker's identity, while other attributes would remain unchanged to preserve the intelligibility, naturalness and expressiveness of the manipulated voice; the second is voice creation: new voices could be sculpted from a set of desired attributes, to feed the creative industry.

Scientific objectives

The aim of this thesis is to design, implement and learn neural speech conversion algorithms based on attribute representations. The attributes considered range from "low-level" parameters of a source/filter model of the speech signal, such as F0, intensity, etc., to "high-level" parameters such as age/gender, accent, or emotion. Attribute representations will either be directly given as input to neural conversion training, or learned at the same time as conversion. The work carried out should make contributions to one or more of the following issues:

- The design of efficient disentangled representation learning strategies (e.g., information bottleneck, adversarial strategies, mutual information) for the manipulation of speech attributes, starting for instance with the manipulation of acoustic attributes such as F0, intensity and even voice quality;

- The implementation of expressive neural conversion algorithms capable of performing a conversion of the speech signal from arbitrary representations of the attributes given as input for conversion training;

- The design and implementation of neural conversion algorithms capable of conditioning generation on attribute intensity, whether to add, subtract, or amplify/attenuate an attribute in the speech signal.

All the work carried out will be evaluated according to standard voice identity conversion protocols, but also in conjunction with project partners to measure the performance of authentication/detection systems according to the scenarios envisaged. The advances made will be integrated in a functional prototype that could be evaluated by audio professionals and used in artistic contexts.

Research environment

The work will be carried out at Ircam within the Sound Analysis and Synthesis team, which is specialized in voice synthesis and transformation, under the supervision of Nicolas Obin and Axel Roebel. Ircam is a non-profit organization, associated with the Centre National d'Art et de Culture Georges Pompidou, whose missions include research, creation and teaching activities around 20th century music and its relationship with science and technology. Within the joint research unit, UMR 9912 STMS (Sciences et Technologies de la Musique et du Son), shared by Ircam, Sorbonne University, CNRS and the French Ministry of Culture and Communication, specialized teams carry out research and development work in acoustics, sound signal processing, cognitive sciences, interaction technologies, computer music and musicology. Ircam is located in the center of Paris, near the Centre Georges Pompidou at 1, Place Stravinsky 75004 Paris.

Experience and expected skills

The ideal candidate will have:

- Solid expertise in machine learning, and in particular in deep neural networks.

- Good knowledge and experience in automatic speech processing; preferably in the field of speech generation;

- Mastery of digital audio signal processing;

- Excellent command of the Python programming language, the TensorFlow and/or PyTorch environment, and distributed computing on GPU servers.

- Excellent level of written and spoken English

- Autonomy, teamwork, productivity, rigor and methodology

A preliminary Master internship in the related fields of neural speech generation will be greatly appreciated

Application

A CV and a motivation letter must be sent to Axel Roebel and Nicolas Obin at FirstName.LastName@xxxxxxxx

The application deadline is : January, 15th, 2024

[AUDITORY] PhD position on neural voice conversion @ Ircam - Paris, France