Re: [AUDITORY] Why is it that joint speech-enhancement with ASR is not a popular research topic?

Subject: Re: [AUDITORY] Why is it that joint speech-enhancement with ASR is not a popular research topic?

From: PIerre DIVENYI <pdivenyi@xxxxxxxxxxxxxxxxxx>

Date: Tue, 26 Jun 2018 08:41:13 -0700

Approved-by: pdivenyi@xxxxxxxxxxxxxxxxxx

Arc-authentication-results: i=1; mx.google.com; spf=pass (google.com: domain of owner-auditory@xxxxxxxxxxxxxxx designates 132.206.27.101 as permitted sender) smtp.mailfrom=owner-auditory@xxxxxxxxxxxxxxx

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-archive:list-owner:list-subscribe:list-unsubscribe:list-help :precedence:in-reply-to:to:comments:subject:from:sender:reply-to :date:message-id:references:content-transfer-encoding:mime-version :approved-by:arc-authentication-results; bh=hfIezLDabs3up8/KCI+dddhQ2EVMw816V1N0LkLrGIQ=; b=qTbnxoomFvrpUUkWaabhyZOFuG63CG65FHlHSuxoBAg9Yl1gowzAXagnU4Z8PuWtsQ Xq2m/L8tXgE2NRExA4N9ffvOLjlqEM7VE9gLadl3FMJ5KxeCMrDvXtYELH+VGLlGu/X4 vzL6TMGISMv6J0BrrVcZHYgduIAqETS8Fi6lTej0yGTiPmAo6udryU8Plt8zr+TrBSAq PWYxvhvzqNeKbDZIYczPLn94pQy8lAjXHJB3uGt6SMeJEbo5ffHvgigilTMlNNmQbzn3 Mj9fS/9vud7kaHUtbu/RhD7T573wdVcD9gcN6FmqgJK2ui7drxFk8zhqyVbQaYpuroh1 4Kiw==

Arc-seal: i=1; a=rsa-sha256; t=1530073379; cv=none; d=google.com; s=arc-20160816; b=MLGLD5cqQ/sec+r0t+lLqBsmpEPH14jBx6grkFZAvI4S7oo82Yjoth8b4+EfCt9bCD JPixc7qdYmjIpoBqYVnsuTr8TdLoVBSb5BlCDrvorx19Dj/SkJRhIhjM3LcozIvxaHUr il1/o0h+vZgYpzXI0UX2MCzkOVCIzq25LWbgqqf5oBuDmqpINCSaN141xF/gvLX5X/LF taJzc+1qz3yelRtimJboBSrb6uYPO7schoCifzBNVewgR0uHeb6m5d+d5YeHmMdthoSB YJ0TjJ+RlQX9VN5g/M2uomjuCv4BJ9JD00Iwg0xZTvzImdisdgyvuu88tmRKycVOfPAm fcSw==

Authentication-results: mx.google.com; spf=pass (google.com: domain of owner-auditory@xxxxxxxxxxxxxxx designates 132.206.27.101 as permitted sender) smtp.mailfrom=owner-auditory@xxxxxxxxxxxxxxx

Comments: To: Samer Hijazi <hijazi@xxxxxxxxxxxxxx>, RenÈ Carré <recarre@xxxxxxxxx>, mmrayati@xxxxxxxxx

Delivered-to: dan.ellis@xxxxxxxxx

In-reply-to: <15143_1529990808_5B31CE98_15143_107_1_CANPVCKhM2kWHMUgQHX2JUAS7mgiqKZbM2W8xjm8Us=J6GUVYEA@mail.gmail.com>

List-archive: <http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>

List-help: <http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>, <mailto:LISTSERV@LISTS.MCGILL.CA?body=INFO%20AUDITORY>

List-owner: <mailto:AUDITORY-request@LISTS.MCGILL.CA>

List-subscribe: <mailto:AUDITORY-subscribe-request@LISTS.MCGILL.CA>

List-unsubscribe: <mailto:AUDITORY-unsubscribe-request@LISTS.MCGILL.CA>

References: <24427_1529727153_5B2DC8B1_24427_258_1_8542A9387F138643A44D148D648EEAC642B5F7F6@KBNMXEXC10.Demant.com> <22611_1529900065_5B306C21_22611_165_1_CANPVCKjdtChc+wesqeCtMjJz0TviGX7q0PWhxMZCG9rQ9Fqcug@mail.gmail.com> <26319_1529911498_5B3098CA_26319_117_1_Pine.GSO.4.58.1806250909050.1473@orsi.inf.u-szeged.hu> <30339_1529914806_5B30A5B6_30339_426_5_7360dd55-fde2-e4ae-7bdc-0e8db4de8612@sheffield.ac.uk> <CANPVCKik8kvCgv1=J5z2OefP3WLMUjJOOHmNA6uVvs=Xgk-nYw@mail.gmail.com> <15548_1529986185_5B31BC89_15548_58_4_f7d91280-7455-8d2e-0f72-49622f4baa48@sheffield.ac.uk> <15143_1529990808_5B31CE98_15143_107_1_CANPVCKhM2kWHMUgQHX2JUAS7mgiqKZbM2W8xjm8Us=J6GUVYEA@mail.gmail.com>

Reply-to: PIerre DIVENYI <pdivenyi@xxxxxxxxxxxxxxxxxx>

Sender: AUDITORY - Research in Auditory Perception <AUDITORY@xxxxxxxxxxxxxxx>

As an early proponent of a combined approach to speech understanding and separation, I agree with most that has been said here. However, I would like to add that we could get easier to our ultimate goal if the dynamics, the changes, were dealt with in a more explicit fashion. So, unashamedly, I would like to recommend all of you to take a look at our book “Speech: A dynamic process” (Carré, Divenyi, and Mrayati; de Gruyter 2017).

Pierre Divenyi

Sent from my autocorrecting iPad

On Jun 25, 2018, at 10:09 PM, Samer Hijazi <hijazi@xxxxxxxxxxxxxx> wrote:

Hi Phil,

Thanks for your insightful response and pointing me to your duplication on this topic from 2003.
I am particularly intrigued by your comment,

I am particularly intrigued with your comment:
" It would be wrong to start with clean speech, add noise, use that as input and clean speech + text as training targets, because in real life speech& other sound sources don't combine like that. "

There are many recent publication on speech enhancement that are using a simple additive noise model, and sometimes RIR simulator, and they are publishing impressive results. Is there a need to incorporate any thing beyond RIR to generalize the training dataset to create a solution that would work properly in the real world?

Regards,

Samer
On Mon, Jun 25, 2018 at 9:13 PM Phil Green <p.green@xxxxxxxxxxxxxxx> wrote:
On 25/06/2018 17:00, Samer Hijazi wrote:

Thanks Laszlo and Phil,
I am not speaking about doing ASR in two steps, i am speaking about doing the ASR and speech enhancement jointly in multi-objective learning process.

Are, you mean multitask learning. That didn't come over at all in your first mail.

There are many papers showing if you used related objective resumes to train your network, you will get better results on both objectives than what you would get if you train for each one separately.

An early paper on this, probably the first application to ASR, was

Parveen & Green, Multitask Learning in Connectionist Robust ASR using Recurrent Neural Networks, Eurospeech 2003.

And it seams obvious that if we used speech contents (i.e. text) and perfect speech waveform as two independent but correlated targets, we will end up with a better text recognition and better speech enhancement; am i missing something?

It would be wrong to start with clean speech, add noise, use that as input and clean speech + text as training targets, because in real life speech & other sound sources don't combine like that. That's why the spectacular results in the Parveen/Green paper are misleading..

HTH
-- 
*** note email is now p.green@xxxxxxxxxx ***
Professor Phil Green
SPandH
Dept of Computer Science
University of Sheffield
*** note email is now p.green@xxxxxxxxxx ***