Re: [AUDITORY] Why is it that joint speech-enhancement with ASR is not a popular research topic?

Subject: Re: [AUDITORY] Why is it that joint speech-enhancement with ASR is not a popular research topic?

From: Phil Green <p.green@xxxxxxxxxxxxxxx>

Date: Mon, 25 Jun 2018 18:25:35 +0100

Arc-authentication-results: i=1; mx.google.com; spf=pass (google.com: domain of owner-auditory@xxxxxxxxxxxxxxx designates 132.206.27.102 as permitted sender) smtp.mailfrom=owner-auditory@xxxxxxxxxxxxxxx; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=sheffield.ac.uk

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-archive:list-owner:list-subscribe:list-unsubscribe:list-help :precedence:in-reply-to:to:comments:subject:from:sender:reply-to :date:message-id:content-language:mime-version:user-agent:references :approved-by:arc-authentication-results; bh=m3W9dFBDxdJ9ZGDtx7IqBU0lEYxxrvy7rIHFk7OOlp0=; b=VVH7eRaKXG0Q88/+zW4X2jm9H6/LWfQtpgjatNzlR/xsYVyonR1J/Iv54wMfaqrIJ/ N51LoW4DUhS6Ky1PUGxyNHlDoC+IAQDNvfB53O1k4xBVIj4fULx9yZYLHsx5sKkORbF+ h0WsdueYw0sDz9ca2+2MKSahUvpPTMVOj2MK6m8g0nH6vdGEIhepmAGMI5htPbeZU8pM 4tQ3YYtFsjlNZlbIMVBz94PnQP9yaGSHjdHOXwrDxwSBfIENPTvq57on9isQEsKtWxag 7Itd3GkRQlMtMquYkVk52Zv7T3Z1VlU5P/ab9M0pq8rry0aYVosgxfoWMdwNjm3JJSw3 XAsg==

Arc-seal: i=1; a=rsa-sha256; t=1529986338; cv=none; d=google.com; s=arc-20160816; b=Bppo4txS52Ya3YgS8SBhIBmJExmEffSPOqviCuKkYnkJHj8i0pJca+MDpT7jyfdVT1 SduK0WHsFW+NA//uN+q1dPdJRrvgiuSPx1IcOv9tyTK2MXQrOulGJAl0X71n5jMV+wZj o7QDFeyjpE486AVHCma2U7WDHUrhkOSo5kAk+fU53ZFZA04h0FCXnPc58C7Jzm8LPykG f/lNsusvXYToY6+DN2K2IVsKzbG6GK3B/Su+l4Z8Yv7RoVwaMaqWR+jWzOt+gyFEQLi5 V5n33Nv9SaHDPlP9vbMB9X2B6/zjmKm4eOIisPbzmGVCJmiVnflmNrGVLSYglNlWwRMI BgiQ==

Authentication-results: mx.google.com; spf=pass (google.com: domain of owner-auditory@xxxxxxxxxxxxxxx designates 132.206.27.102 as permitted sender) smtp.mailfrom=owner-auditory@xxxxxxxxxxxxxxx; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=sheffield.ac.uk

Comments: To: Samer Hijazi <hijazi@xxxxxxxxxxxxxx>

Delivered-to: dan.ellis@xxxxxxxxx

In-reply-to: <CANPVCKik8kvCgv1=J5z2OefP3WLMUjJOOHmNA6uVvs=Xgk-nYw@mail.gmail.com>

List-archive: <http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>

List-help: <http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>, <mailto:LISTSERV@LISTS.MCGILL.CA?body=INFO%20AUDITORY>

List-owner: <mailto:AUDITORY-request@LISTS.MCGILL.CA>

List-subscribe: <mailto:AUDITORY-subscribe-request@LISTS.MCGILL.CA>

List-unsubscribe: <mailto:AUDITORY-unsubscribe-request@LISTS.MCGILL.CA>

References: <24427_1529727153_5B2DC8B1_24427_258_1_8542A9387F138643A44D148D648EEAC642B5F7F6@KBNMXEXC10.Demant.com> <22611_1529900065_5B306C21_22611_165_1_CANPVCKjdtChc+wesqeCtMjJz0TviGX7q0PWhxMZCG9rQ9Fqcug@mail.gmail.com> <26319_1529911498_5B3098CA_26319_117_1_Pine.GSO.4.58.1806250909050.1473@orsi.inf.u-szeged.hu> <30339_1529914806_5B30A5B6_30339_426_5_7360dd55-fde2-e4ae-7bdc-0e8db4de8612@sheffield.ac.uk> <CANPVCKik8kvCgv1=J5z2OefP3WLMUjJOOHmNA6uVvs=Xgk-nYw@mail.gmail.com>

Reply-to: Phil Green <p.green@xxxxxxxxxxxxxxx>

Sender: AUDITORY - Research in Auditory Perception <AUDITORY@xxxxxxxxxxxxxxx>

User-agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0

On 25/06/2018 17:00, Samer Hijazi wrote:

Thanks Laszlo and Phil,
I am not speaking about doing ASR in two steps, i am speaking about doing the ASR and speech enhancement jointly in multi-objective learning process.

Are, you mean multitask learning. That didn't come over at all in your first mail.

There are many papers showing if you used related objective resumes to train your network, you will get better results on both objectives than what you would get if you train for each one separately.

An early paper on this, probably the first application to ASR, was

Parveen & Green, Multitask Learning in Connectionist Robust ASR using Recurrent Neural Networks, Eurospeech 2003.

And it seams obvious that if we used speech contents (i.e. text) and perfect speech waveform as two independent but correlated targets, we will end up with a better text recognition and better speech enhancement; am i missing something?

It would be wrong to start with clean speech, add noise, use that as input and clean speech + text as training targets, because in real life speech & other sound sources don't combine like that. That's why the spectacular results in the Parveen/Green paper are misleading..

HTH

-- 
*** note email is now p.green@xxxxxxxxxx ***
Professor Phil Green
SPandH
Dept of Computer Science
University of Sheffield
*** note email is now p.green@xxxxxxxxxx ***