[AUDITORY] PhD position on neural voice conversion @xxxxxxxx Ircam - Paris, France (Nicolas Obin )


Subject: [AUDITORY] PhD position on neural voice conversion @xxxxxxxx Ircam - Paris, France
From:    Nicolas Obin  <Nicolas.Obin@xxxxxxxx>
Date:    Mon, 11 Dec 2023 10:26:07 +0100

--Apple-Mail=_32EFC6FB-5157-482E-B536-4A902A0DD81F Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Please find below the description of a fully funded PhD position on = "Neural voice conversion from disentangled attribute representation=E2=80=9D= Regards, Nicolas Obin Context The aim of the EVA ("Explicit Voice Attributes") project is to decipher = the codes of human voices by learning explicit, structured = representations of voice attributes. Achieving this objective will have = a strong scientific and technological impact, in at least two fields of = application: firstly, in speech analysis, it will enable us to = understand the complex tangle of characteristics of a human voice; = secondly, for voice generation, it will feed a wide range of = applications for creating a voice with the desired attributes, enabling = the design of what is known as a vocal personality. The set of = attributes will be defined by human expertise or discovered from the = data using lightly supervised or unsupervised neural networks. It will = include a detailed and explicit description of timbre, voice quality, = phonation, or speaker biases such as specific pronunciations or speech = disorders (e.g., lisping), regional or non-native accents, and = paralinguistic elements such as emotions or style. Ideally, each = attribute could be controlled in synthesis and conversion by a degree of = intensity, enabling it to be amplified or erased from the voice, as part = of a structured integration. These attributes could be defined by = experts or by neural network algorithms such as disentanglement learning = or self-supervised representations that automatically discover salient = attributes in multi-speaker datasets. The main industrial results = expected concern different use cases for voice transformation. The first = is voice anonymization: to enable RGPD-compliant voice recordings, voice = conversion systems could be configured to remove attributes strongly = associated with a speaker's identity, while other attributes would = remain unchanged to preserve the intelligibility, naturalness and = expressiveness of the manipulated voice; the second is voice creation: = new voices could be sculpted from a set of desired attributes, to feed = the creative industry. Scientific objectives The aim of this thesis is to design, implement and learn neural speech = conversion algorithms based on attribute representations. The = attributes considered range from "low-level" parameters of a = source/filter model of the speech signal, such as F0, intensity, etc., = to "high-level" parameters such as age/gender, accent, or emotion. = Attribute representations will either be directly given as input to = neural conversion training, or learned at the same time as conversion. = The work carried out should make contributions to one or more of the = following issues: - The design of efficient disentangled representation learning = strategies (e.g., information bottleneck, adversarial strategies, mutual = information) for the manipulation of speech attributes, starting for = instance with the manipulation of acoustic attributes such as F0, = intensity and even voice quality; - The implementation of expressive neural conversion algorithms capable = of performing a conversion of the speech signal from arbitrary = representations of the attributes given as input for conversion = training; - The design and implementation of neural conversion algorithms capable = of conditioning generation on attribute intensity, whether to add, = subtract, or amplify/attenuate an attribute in the speech signal. =20 All the work carried out will be evaluated according to standard voice = identity conversion protocols, but also in conjunction with project = partners to measure the performance of authentication/detection systems = according to the scenarios envisaged. The advances made will be = integrated in a functional prototype that could be evaluated by audio = professionals and used in artistic contexts. Research environment The work will be carried out at Ircam within the Sound Analysis and = Synthesis team, which is specialized in voice synthesis and = transformation, under the supervision of Nicolas Obin and Axel Roebel. = Ircam is a non-profit organization, associated with the Centre National = d'Art et de Culture Georges Pompidou, whose missions include research, = creation and teaching activities around 20th century music and its = relationship with science and technology. Within the joint research = unit, UMR 9912 STMS (Sciences et Technologies de la Musique et du Son), = shared by Ircam, Sorbonne University, CNRS and the French Ministry of = Culture and Communication, specialized teams carry out research and = development work in acoustics, sound signal processing, cognitive = sciences, interaction technologies, computer music and musicology. Ircam = is located in the center of Paris, near the Centre Georges Pompidou at = 1, Place Stravinsky 75004 Paris. Experience and expected skills The ideal candidate will have: - Solid expertise in machine learning, and in particular in deep neural = networks.=20 - Good knowledge and experience in automatic speech processing; = preferably in the field of speech generation;=20 - Mastery of digital audio signal processing;=20 - Excellent command of the Python programming language, the TensorFlow = and/or PyTorch environment, and distributed computing on GPU servers. - Excellent level of written and spoken English - Autonomy, teamwork, productivity, rigor and methodology A preliminary Master internship in the related fields of neural speech = generation will be greatly appreciated Application A CV and a motivation letter must be sent to Axel Roebel and Nicolas = Obin at FirstName.LastName@xxxxxxxx The application deadline is : January, 15th, 2024 --Apple-Mail=_32EFC6FB-5157-482E-B536-4A902A0DD81F Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 <html><head><meta http-equiv=3D"content-type" content=3D"text/html; = charset=3Dutf-8"></head><body style=3D"overflow-wrap: break-word; = -webkit-nbsp-mode: space; line-break: after-white-space;"><meta = http-equiv=3D"content-type" content=3D"text/html; charset=3Dutf-8"><div = style=3D"overflow-wrap: break-word; -webkit-nbsp-mode: space; = line-break: after-white-space;"><meta http-equiv=3D"content-type" = content=3D"text/html; charset=3Dutf-8"><div style=3D"overflow-wrap: = break-word; -webkit-nbsp-mode: space; line-break: = after-white-space;"><meta http-equiv=3D"content-type" = content=3D"text/html; charset=3Dutf-8"><div style=3D"overflow-wrap: = break-word; -webkit-nbsp-mode: space; line-break: = after-white-space;"><div>Please find below the description of a fully = funded PhD position on "Neural voice conversion from disentangled = attribute = representation=E2=80=9D</div><div><br></div><div>Regards,</div><div><br></= div><div><span style=3D"caret-color: rgb(0, 0, 0); color: rgb(0, 0, = 0);">Nicolas = Obin</span></div><div><br></div><div><div><b><u>Context</u></b></div><div>= The aim of the EVA ("Explicit Voice Attributes") project is to decipher = the codes of human voices by learning explicit, structured = representations of voice attributes. Achieving this objective will have = a strong scientific and technological impact, in at least two fields of = application: firstly, in speech analysis, it will enable us to = understand the complex tangle of characteristics of a human voice; = secondly, for voice generation, it will feed a wide range of = applications for creating a voice with the desired attributes, enabling = the design of what is known as a vocal personality. The set of = attributes will be defined by human expertise or discovered from the = data using lightly supervised or unsupervised neural networks. It will = include a detailed and explicit description of timbre, voice quality, = phonation, or speaker biases such as specific pronunciations or speech = disorders (e.g., lisping), regional or non-native accents, and = paralinguistic elements such as emotions or style. Ideally, each = attribute could be controlled in synthesis and conversion by a degree of = intensity, enabling it to be amplified or erased from the voice, as part = of a structured integration. These attributes could be defined by = experts or by neural network algorithms such as disentanglement learning = or self-supervised representations that automatically discover salient = attributes in multi-speaker datasets. The main industrial results = expected concern different use cases for voice transformation. The first = is voice anonymization: to enable RGPD-compliant voice recordings, voice = conversion systems could be configured to remove attributes strongly = associated with a speaker's identity, while other attributes would = remain unchanged to preserve the intelligibility, naturalness and = expressiveness of the manipulated voice; the second is voice creation: = new voices could be sculpted from a set of desired attributes, to feed = the creative = industry.</div><div><br></div><div><br></div><div><b><u>Scientific = objectives</u></b></div><div>The aim of this thesis is to design, = implement and learn neural speech conversion algorithms based on = attribute representations. &nbsp;The attributes considered range from = "low-level" parameters of a source/filter model of the speech signal, = such as F0, intensity, etc., to "high-level" parameters such as = age/gender, accent, or emotion. Attribute representations will either be = directly given as input to neural conversion training, or learned at the = same time as conversion. The work carried out should make contributions = to one or more of the following issues:</div><div>- The design of = efficient disentangled representation learning strategies (e.g., = information bottleneck, adversarial strategies, mutual information) for = the manipulation of speech attributes, starting for instance with the = manipulation of acoustic attributes such as F0, intensity and even voice = quality;</div><div>- The implementation of expressive neural conversion = algorithms capable of performing a conversion of the speech signal from = arbitrary representations of the attributes given as input for = conversion training;</div><div>- The design and implementation of neural = conversion algorithms capable of conditioning generation on attribute = intensity, whether to add, subtract, or amplify/attenuate an attribute = in the speech signal.</div><div>&nbsp;</div><div>All the work carried = out will be evaluated according to standard voice identity conversion = protocols, but also in conjunction with project partners to measure the = performance of authentication/detection systems according to the = scenarios envisaged. The advances made will be integrated in a = functional prototype that could be evaluated by audio professionals and = used in artistic contexts.</div><div><br></div><div><b><u>Research = environment</u></b></div><div>The work will be carried out at Ircam = within the Sound Analysis and Synthesis team, which is specialized in = voice synthesis and transformation, under the supervision of Nicolas = Obin and Axel Roebel. Ircam is a non-profit organization, associated = with the Centre National d'Art et de Culture Georges Pompidou, whose = missions include research, creation and teaching activities around 20th = century music and its relationship with science and technology. Within = the joint research unit, UMR 9912 STMS (Sciences et Technologies de la = Musique et du Son), shared by Ircam, Sorbonne University, CNRS and the = French Ministry of Culture and Communication, specialized teams carry = out research and development work in acoustics, sound signal processing, = cognitive sciences, interaction technologies, computer music and = musicology. Ircam is located in the center of Paris, near the Centre = Georges Pompidou at 1, Place Stravinsky 75004 = Paris.</div><div><br></div><div><b><u>Experience and expected = skills</u></b></div><div>The ideal candidate will have:</div><div>- = Solid expertise in machine learning, and in particular in deep neural = networks.&nbsp;</div><div>- Good knowledge and experience in automatic = speech processing; preferably in the field of speech = generation;&nbsp;</div><div>- Mastery of digital audio signal = processing;&nbsp;</div><div>- Excellent command of the Python = programming language, the TensorFlow and/or PyTorch environment, and = distributed computing on GPU servers.</div><div>- Excellent level of = written and spoken English</div><div>- Autonomy, teamwork, productivity, = rigor and methodology</div><div>A preliminary Master internship in the = related fields of neural speech generation will be greatly = appreciated</div><div><br></div><div><b><u>Application</u></b></div><div>A= CV and a motivation letter must be sent to Axel Roebel and Nicolas Obin = at FirstName.LastName@xxxxxxxx</div><div><br></div><div>The application = deadline is : January, 15th, 2024</div></div><div><br = class=3D"Apple-interchange-newline"><br = class=3D"Apple-interchange-newline"> </div> <br></div></div></div></body></html>= --Apple-Mail=_32EFC6FB-5157-482E-B536-4A902A0DD81F--


This message came from the mail archive
postings/2023/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University