Re: [AUDITORY] Why is it that joint speech-enhancement with ASR is not a popular research topic? (PIerre DIVENYI )


Subject: Re: [AUDITORY] Why is it that joint speech-enhancement with ASR is not a popular research topic?
From:    PIerre DIVENYI  <pdivenyi@xxxxxxxx>
Date:    Tue, 26 Jun 2018 08:41:13 -0700
List-Archive:<http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>

--Apple-Mail-9B3C9747-632F-4097-9D04-B70B391F7D42 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable As an early proponent of a combined approach to speech understanding and sep= aration, I agree with most that has been said here. However, I would like to= add that we could get easier to our ultimate goal if the dynamics, the chan= ges, were dealt with in a more explicit fashion. So, unashamedly, I would li= ke to recommend all of you to take a look at our book =E2=80=9CSpeech: A dyn= amic process=E2=80=9D (Carr=C3=A9, Divenyi, and Mrayati; de Gruyter 2017). Pierre Divenyi Sent from my autocorrecting iPad > On Jun 25, 2018, at 10:09 PM, Samer Hijazi <hijazi@xxxxxxxx> wrote: >=20 > Hi Phil,=20 >=20 > Thanks for your insightful response and pointing me to your duplication on= this topic from 2003.=20 > I am particularly intrigued by your comment, =20 >=20 > I am particularly intrigued with your comment: > " It would be wrong to start with clean speech, add noise, use that as inp= ut and clean speech + text as training targets, because in real life speech&= other sound sources don't combine like that. " >=20 > There are many recent publication on speech enhancement that are using a s= imple additive noise model, and sometimes RIR simulator, and they are publis= hing impressive results. Is there a need to incorporate any thing beyond RIR= to generalize the training dataset to create a solution that would work pro= perly in the real world? =20 >=20 > Regards, >=20 > Samer =20 >=20 >=20 >=20 >> On Mon, Jun 25, 2018 at 9:13 PM Phil Green <p.green@xxxxxxxx> wrot= e: >>=20 >>=20 >>> On 25/06/2018 17:00, Samer Hijazi wrote: >>> Thanks Laszlo and Phil, >>> I am not speaking about doing ASR in two steps, i am speaking about doin= g the ASR and speech enhancement jointly in multi-objective learning process= . >> Are, you mean multitask learning. That didn't come over at all in your fi= rst mail.=20 >>> There are many papers showing if you used related objective resumes to t= rain your network, you will get better results on both objectives than what y= ou would get if you train for each one separately. >> An early paper on this, probably the first application to ASR, was=20 >>=20 >> Parveen & Green, Multitask Learning in Connectionist Robust ASR using Rec= urrent Neural Networks, Eurospeech 2003. >>=20 >>> And it seams obvious that if we used speech contents (i.e. text) and per= fect speech waveform as two independent but correlated targets, we will end u= p with a better text recognition and better speech enhancement; am= i missing something? =20 >>=20 >> It would be wrong to start with clean speech, add noise, use that as inpu= t and clean speech + text as training targets, because in real life speech &= other sound sources don't combine like that. That's why the spectacular res= ults in the Parveen/Green paper are misleading.. >>=20 >> HTH >> --=20 >> *** note email is now p.green@xxxxxxxx *** >> Professor Phil Green >> SPandH >> Dept of Computer Science >> University of Sheffield >> *** note email is now p.green@xxxxxxxx *** --Apple-Mail-9B3C9747-632F-4097-9D04-B70B391F7D42 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable <html><head><meta http-equiv=3D"content-type" content=3D"text/html; charset=3D= utf-8"></head><body dir=3D"auto">As an early proponent of a combined approac= h to speech understanding and separation, I agree with most that has been sa= id here. However, I would like to add that we could get easier to our ultima= te goal if the dynamics, the changes, were dealt with in a more explicit fas= hion. So, unashamedly, I would like to recommend all of you to take a look a= t our book =E2=80=9CSpeech: A dynamic process=E2=80=9D (Carr=C3=A9, Divenyi,= and Mrayati; de Gruyter 2017).<div><br></div><div>Pierre Divenyi</div><div>= <br><div id=3D"AppleMailSignature">Sent from my autocorrecting iPad</div><di= v><br>On Jun 25, 2018, at 10:09 PM, Samer Hijazi &lt;<a href=3D"mailto:hijaz= i@xxxxxxxx">hijazi@xxxxxxxx</a>&gt; wrote:<br><br></div><blockqu= ote type=3D"cite"><div><div dir=3D"ltr"><div><span style=3D"text-decoration-= style:initial;text-decoration-color:initial;float:none;display:inline">Hi Ph= il,&nbsp;</span></div> <span style=3D"text-decoration-style:initial;text-decoration-color:initial;f= loat:none;display:inline"><div><span style=3D"text-decoration-style:initial;= text-decoration-color:initial;float:none;display:inline"><br></span></div>Th= anks for your insightful response and pointing me to your duplication on thi= s topic from 2003.&nbsp;</span><div><span style=3D"text-decoration-style:ini= tial;text-decoration-color:initial;float:none;display:inline">I am particula= rly intrigued by your comment,&nbsp; &nbsp;</span>&nbsp;<br></div><div><br><= /div><div>I am particularly intrigued with your comment:</div><div>" <span style=3D"color:rgb(80,0,80);background-color:rgb(255,255,255);text-dec= oration-style:initial;text-decoration-color:initial;float:none;display:inlin= e">It would be wrong to start with clean<span>&nbsp;</span></span><span clas= s=3D"gmail-il" style=3D"color:rgb(80,0,80);background-color:rgb(255,255,255)= ;text-decoration-style:initial;text-decoration-color:initial">speech</span><= span style=3D"color:rgb(80,0,80);background-color:rgb(255,255,255);text-deco= ration-style:initial;text-decoration-color:initial;float:none;display:inline= ">, add noise, use that as input and clean<span>&nbsp;</span></span><span cl= ass=3D"gmail-il" style=3D"color:rgb(80,0,80);background-color:rgb(255,255,25= 5);text-decoration-style:initial;text-decoration-color:initial">speech</span= ><span style=3D"color:rgb(80,0,80);background-color:rgb(255,255,255);text-de= coration-style:initial;text-decoration-color:initial;float:none;display:inli= ne"><span>&nbsp;</span>+ text as training targets, because in real life<span= >&nbsp;</span></span><span class=3D"gmail-il" style=3D"color:rgb(80,0,80);ba= ckground-color:rgb(255,255,255);text-decoration-style:initial;text-decoratio= n-color:initial">speech</span><span style=3D"color:rgb(80,0,80);background-c= olor:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:in= itial;float:none;display:inline">&amp; other sound sources don't combine lik= e that.</span> "</div><div><br></div><div>There are many recent publication on speech enhan= cement<span style=3D"background-color:rgb(255,255,255);text-decoration-style= :initial;text-decoration-color:initial;float:none;display:inline"><span>&nbs= p;</span></span> that are using a simple additive noise model, and sometimes RIR simulator, a= nd they are publishing impressive results. Is there a need to incorporate an= y thing beyond RIR to generalize the training dataset to create a solution t= hat would work properly in the real world?&nbsp; &nbsp;&nbsp;</div><div><br>= </div><div>Regards,</div><div><br>Samer&nbsp; &nbsp;&nbsp;</div><div><br></d= iv><div><br></div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr">On M= on, Jun 25, 2018 at 9:13 PM Phil Green &lt;<a href=3D"mailto:p.green@xxxxxxxx= ld.ac.uk">p.green@xxxxxxxx</a>&gt; wrote:<br></div><blockquote class=3D= "gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-= left:1ex"> =20 =20 =20 <div text=3D"#000000" bgcolor=3D"#FFFFFF"> <p><br> </p> <br> <div class=3D"m_-5512744834594495995moz-cite-prefix">On 25/06/2018 17:00= , Samer Hijazi wrote:<br> </div> <blockquote type=3D"cite"> <div dir=3D"ltr">Thanks Laszlo and Phil, <div>I am not speaking about doing ASR in two steps, i am speaking about doing the ASR and speech enhancement jointly in multi-objective learning process. </div> </div> </blockquote> Are, you mean multitask learning. That didn't come over at all in your first mail. <br> <blockquote type=3D"cite"> <div dir=3D"ltr"> <div>There are many papers showing if you used related objective resumes to train your network, you will get better results on both objectives than what you would get if you train for each one separately. </div> </div> </blockquote> An early paper on this, probably the first application to ASR, was <br> <i><br> </i><i>Parveen &amp; Green, Multitask Learning in Connectionist Robust ASR using Recurrent Neural Networks, Eurospeech 2003.</i><br> <div style=3D"font-size:23.1844px;font-family:serif"><br> </div> <blockquote type=3D"cite"> <div dir=3D"ltr"> <div>And it seams obvious that if we used speech contents (i.e. text) and perfect speech waveform as two independent but correlated targets, we will end up with a better text recognition and better speech enhancement; am i missing something?&nbsp; &nbsp;&nbsp; <br> </div> </div> </blockquote> <br> It would be wrong to start with clean speech, add noise, use that as input and clean speech + text as training targets, because in real life speech &amp; other sound sources don't combine like that. That's why the spectacular results in the Parveen/Green paper are misleading..<br> <br> HTH<br> <pre class=3D"m_-5512744834594495995moz-signature" cols=3D"72">--=20 *** note email is now <a class=3D"m_-5512744834594495995moz-txt-link-abbrevi= ated" href=3D"mailto:p.green@xxxxxxxx" target=3D"_blank">p.green@xxxxxxxx= k</a> *** Professor Phil Green SPandH Dept of Computer Science University of Sheffield *** note email is now <a class=3D"m_-5512744834594495995moz-txt-link-abbrevi= ated" href=3D"mailto:p.green@xxxxxxxx" target=3D"_blank">p.green@xxxxxxxx= k</a> *** </pre> </div> </blockquote></div> </div></blockquote></div></body></html>= --Apple-Mail-9B3C9747-632F-4097-9D04-B70B391F7D42--


This message came from the mail archive
src/postings/2018/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University