Re: [AUDITORY] Why is it that joint speech-enhancement with ASR is not a popular research topic? (Jonathan Le Roux )


Subject: Re: [AUDITORY] Why is it that joint speech-enhancement with ASR is not a popular research topic?
From:    Jonathan Le Roux  <Jonathan.Le-Roux@xxxxxxxx>
Date:    Wed, 27 Jun 2018 22:53:32 +0900
List-Archive:<http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>

--000000000000b0d4ed056f9fee4c Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi all, As our group has worked on related topics, please allow me to mention some plugs, I mean, references. - Using features derived from an ASR output (itself computed on the output of a first speech enhancement network) as input to a second enhancement system: Hakan Erdogan, John R. Hershey, Shinji Watanabe, Jonathan Le Roux, "Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks," in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2015), Apr. 2015. http://www.jonathanleroux.org/pdf/Erdogan2015ICASSP04.pdf - Training jointly a speech separation network and an end-to-end ASR network for multi-speaker ASR: Shane Settle, Jonathan Le Roux, Takaaki Hori, Shinji Watanabe, John R. Hershey, "End-to-End Multi-Speaker Speech Recognition," in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Apr. 2018. http://www.jonathanleroux.org/pdf/Settle2018ICASSP04.pdf There has also been a lot of work in multi-channel settings, around the CHiME3 and 4 workshops. In particular, training beamforming and ASR networks jointly has recently been proposed: - Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, John R. Hershey, "Multichannel End-to-end Speech Recognition", International Conference on Machine Learning (ICML), August 2017. - Tsubasa Ochiai, Shinji Watanabe, Shigeru Katagiri, "Does speech enhancement work with end-to-end ASR objectives?: Experimental analysis of multichannel end-to-end ASR", IEEE International Workshop on Machine Learning for Signal Processing (MLSP), October 2017. Best, Jonathan Jonathan Le Roux <Jonathan.Le-Roux@xxxxxxxx> Senior Principal Research Scientist, Speech & Audio Team Leader MERL - Mitsubishi Electric Research Laboratories 201 Broadway, 8th Floor, Cambridge, MA 02139 Tel.: +1-617-621-7547 Fax: +1-617-621-7550 On Wed, Jun 27, 2018 at 1:23 PM PIerre DIVENYI <pdivenyi@xxxxxxxx= > wrote: > As an early proponent of a combined approach to speech understanding and > separation, I agree with most that has been said here. However, I would > like to add that we could get easier to our ultimate goal if the dynamics= , > the changes, were dealt with in a more explicit fashion. So, unashamedly,= I > would like to recommend all of you to take a look at our book =E2=80=9CSp= eech: A > dynamic process=E2=80=9D (Carr=C3=A9, Divenyi, and Mrayati; de Gruyter 20= 17). > > Pierre Divenyi > > Sent from my autocorrecting iPad > > On Jun 25, 2018, at 10:09 PM, Samer Hijazi <hijazi@xxxxxxxx> wrote: > > Hi Phil, > > Thanks for your insightful response and pointing me to your duplication o= n > this topic from 2003. > I am particularly intrigued by your comment, > > I am particularly intrigued with your comment: > " It would be wrong to start with clean speech, add noise, use that as > input and clean speech + text as training targets, because in real life > speech& other sound sources don't combine like that. " > > There are many recent publication on speech enhancement that are using a > simple additive noise model, and sometimes RIR simulator, and they are > publishing impressive results. Is there a need to incorporate any thing > beyond RIR to generalize the training dataset to create a solution that > would work properly in the real world? > > Regards, > > Samer > > > > On Mon, Jun 25, 2018 at 9:13 PM Phil Green <p.green@xxxxxxxx> > wrote: > >> >> >> On 25/06/2018 17:00, Samer Hijazi wrote: >> >> Thanks Laszlo and Phil, >> I am not speaking about doing ASR in two steps, i am speaking about doin= g >> the ASR and speech enhancement jointly in multi-objective learning proce= ss. >> >> Are, you mean multitask learning. That didn't come over at all in your >> first mail. >> >> There are many papers showing if you used related objective resumes to >> train your network, you will get better results on both objectives than >> what you would get if you train for each one separately. >> >> An early paper on this, probably the first application to ASR, was >> >> *Parveen & Green, Multitask Learning in Connectionist Robust ASR using >> Recurrent Neural Networks, Eurospeech 2003.* >> >> And it seams obvious that if we used speech contents (i.e. text) and >> perfect speech waveform as two independent but correlated targets, we wi= ll >> end up with a better text recognition and better speech enhancement; am = i >> missing something? >> >> >> It would be wrong to start with clean speech, add noise, use that as >> input and clean speech + text as training targets, because in real life >> speech & other sound sources don't combine like that. That's why the >> spectacular results in the Parveen/Green paper are misleading.. >> >> HTH >> >> -- >> *** note email is now p.green@xxxxxxxx *** >> Professor Phil Green >> SPandH >> Dept of Computer Science >> University of Sheffield >> *** note email is now p.green@xxxxxxxx *** >> >> --000000000000b0d4ed056f9fee4c Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">Hi all,<div><br></div><div>As our group has worked on rela= ted topics, please allow me to mention some plugs, I mean, references.</div= ><div><br></div><div>- Using features derived from an ASR output (itself co= mputed on the output of a first speech enhancement network) as input to a= =C2=A0 second enhancement system:=C2=A0Hakan Erdogan, John R. Hershey, Shin= ji Watanabe, Jonathan Le Roux, &quot;Phase-sensitive and recognition-booste= d speech separation using deep recurrent neural networks,&quot; in Proc. IE= EE International Conference on Acoustics, Speech, and Signal Processing (IC= ASSP 2015), Apr. 2015.=C2=A0<a href=3D"http://www.jonathanleroux.org/pdf/Er= dogan2015ICASSP04.pdf">http://www.jonathanleroux.org/pdf/Erdogan2015ICASSP0= 4.pdf</a></div><div>- Training jointly a speech separation network and an e= nd-to-end ASR network for multi-speaker ASR:=C2=A0Shane Settle, Jonathan Le= Roux, Takaaki Hori, Shinji Watanabe, John R. Hershey, &quot;End-to-End Mul= ti-Speaker Speech Recognition,&quot; in Proc. IEEE International Conference= on Acoustics, Speech, and Signal Processing (ICASSP), Apr. 2018.=C2=A0<a h= ref=3D"http://www.jonathanleroux.org/pdf/Settle2018ICASSP04.pdf">http://www= .jonathanleroux.org/pdf/Settle2018ICASSP04.pdf</a></div><div><br></div><div= >There has also been a lot of work in multi-channel settings, around the CH= iME3 and 4 workshops. In particular, training beamforming and ASR networks = jointly has recently been proposed:</div><div> <span class=3D"gmail-authors" style=3D"box-sizing:border-box;color:rgb(68,6= 8,68);font-family:Helvetica,Arial,sans-serif;font-weight:200;text-align:lef= t;text-decoration-style:initial;text-decoration-color:initial">-=20 <span style=3D"color:rgb(34,34,34);font-family:sans-serif;font-weight:400;t= ext-align:start;background-color:rgb(255,255,255);text-decoration-style:ini= tial;text-decoration-color:initial;float:none;display:inline">Tsubasa Ochia= i, Shinji Watanabe, Takaaki</span>=C2=A0Hori, John R. Hershey,=C2=A0</span>= =C2=A0&quot;Multichannel End-to-end Speech Recognition&quot;, International= Conference on Machine Learning (ICML), August 2017.<br></div><div>- Tsubas= a Ochiai, Shinji Watanabe, Shigeru Katagiri, &quot;Does speech enhancement = work with end-to-end ASR objectives?: Experimental analysis of multichannel= end-to-end ASR&quot;, IEEE International Workshop on Machine Learning for = Signal Processing (MLSP), October 2017.</div><div><br></div><div>Best,</div= ><div>Jonathan</div><div><br></div><div><br clear=3D"all"><div><div dir=3D"= ltr" class=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr"><div>= <div dir=3D"ltr"><font color=3D"#888888">Jonathan Le Roux &lt;<a href=3D"ma= ilto:Jonathan.Le-Roux@xxxxxxxx" target=3D"_blank">Jonathan.Le-Roux@xxxxxxxx= rmalesup.org</a>&gt;<br>Senior Principal Research Scientist, Speech &amp; A= udio Team Leader<br>MERL - Mitsubishi Electric Research Laboratories<br>201= Broadway, 8th Floor, Cambridge, MA 02139<br></font><font color=3D"#888888"= ><font color=3D"#888888">Tel.: <a value=3D"+16176217547">+1-617-621-7547</a= >=C2=A0 </font></font><font color=3D"#888888"><font color=3D"#888888">Fax: = +1-617-621-7550</font></font><br><font color=3D"#888888"><font color=3D"#88= 8888"><br></font></font></div></div></div></div></div></div></div><br></div= ><div><br></div><br><div class=3D"gmail_quote"><div dir=3D"ltr">On Wed, Jun= 27, 2018 at 1:23 PM PIerre DIVENYI &lt;<a href=3D"mailto:pdivenyi@xxxxxxxx= anford.edu">pdivenyi@xxxxxxxx</a>&gt; wrote:<br></div><blockquote= class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc soli= d;padding-left:1ex"><div dir=3D"auto">As an early proponent of a combined a= pproach to speech understanding and separation, I agree with most that has = been said here. However, I would like to add that we could get easier to ou= r ultimate goal if the dynamics, the changes, were dealt with in a more exp= licit fashion. So, unashamedly, I would like to recommend all of you to tak= e a look at our book =E2=80=9CSpeech: A dynamic process=E2=80=9D (Carr=C3= =A9, Divenyi, and Mrayati; de Gruyter 2017).<div><br></div><div>Pierre Dive= nyi</div><div><br><div id=3D"m_-6205922535632805051AppleMailSignature">Sent= from my autocorrecting iPad</div><div><br>On Jun 25, 2018, at 10:09 PM, Sa= mer Hijazi &lt;<a href=3D"mailto:hijazi@xxxxxxxx" target=3D"_blank">h= ijazi@xxxxxxxx</a>&gt; wrote:<br><br></div><blockquote type=3D"cite">= <div><div dir=3D"ltr"><div><span style=3D"text-decoration-style:initial;tex= t-decoration-color:initial;float:none;display:inline">Hi Phil,=C2=A0</span>= </div> <span style=3D"text-decoration-style:initial;text-decoration-color:initial;= float:none;display:inline"><div><span style=3D"text-decoration-style:initia= l;text-decoration-color:initial;float:none;display:inline"><br></span></div= >Thanks for your insightful response and pointing me to your duplication on= this topic from 2003.=C2=A0</span><div><span style=3D"text-decoration-styl= e:initial;text-decoration-color:initial;float:none;display:inline">I am par= ticularly intrigued by your comment,=C2=A0 =C2=A0</span>=C2=A0<br></div><di= v><br></div><div>I am particularly intrigued with your comment:</div><div>&= quot; <span style=3D"color:rgb(80,0,80);background-color:rgb(255,255,255);text-de= coration-style:initial;text-decoration-color:initial;float:none;display:inl= ine">It would be wrong to start with clean<span>=C2=A0</span></span><span c= lass=3D"m_-6205922535632805051gmail-il" style=3D"color:rgb(80,0,80);backgro= und-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-co= lor:initial">speech</span><span style=3D"color:rgb(80,0,80);background-colo= r:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:init= ial;float:none;display:inline">, add noise, use that as input and clean<spa= n>=C2=A0</span></span><span class=3D"m_-6205922535632805051gmail-il" style= =3D"color:rgb(80,0,80);background-color:rgb(255,255,255);text-decoration-st= yle:initial;text-decoration-color:initial">speech</span><span style=3D"colo= r:rgb(80,0,80);background-color:rgb(255,255,255);text-decoration-style:init= ial;text-decoration-color:initial;float:none;display:inline"><span>=C2=A0</= span>+ text as training targets, because in real life<span>=C2=A0</span></s= pan><span class=3D"m_-6205922535632805051gmail-il" style=3D"color:rgb(80,0,= 80);background-color:rgb(255,255,255);text-decoration-style:initial;text-de= coration-color:initial">speech</span><span style=3D"color:rgb(80,0,80);back= ground-color:rgb(255,255,255);text-decoration-style:initial;text-decoration= -color:initial;float:none;display:inline">&amp; other sound sources don&#39= ;t combine like that.</span> &quot;</div><div><br></div><div>There are many recent publication on speech= enhancement<span style=3D"background-color:rgb(255,255,255);text-decoratio= n-style:initial;text-decoration-color:initial;float:none;display:inline"><s= pan>=C2=A0</span></span> that are using a simple additive noise model, and sometimes RIR simulator,= and they are publishing impressive results. Is there a need to incorporate= any thing beyond RIR to generalize the training dataset to create a soluti= on that would work properly in the real world?=C2=A0 =C2=A0=C2=A0</div><div= ><br></div><div>Regards,</div><div><br>Samer=C2=A0 =C2=A0=C2=A0</div><div><= br></div><div><br></div></div><br><div class=3D"gmail_quote"><div dir=3D"lt= r">On Mon, Jun 25, 2018 at 9:13 PM Phil Green &lt;<a href=3D"mailto:p.green= @xxxxxxxx" target=3D"_blank">p.green@xxxxxxxx</a>&gt; wrote:<= br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde= r-left:1px #ccc solid;padding-left:1ex"> =20 =20 =20 <div text=3D"#000000" bgcolor=3D"#FFFFFF"> <p><br> </p> <br> <div class=3D"m_-6205922535632805051m_-5512744834594495995moz-cite-pref= ix">On 25/06/2018 17:00, Samer Hijazi wrote:<br> </div> <blockquote type=3D"cite"> <div dir=3D"ltr">Thanks Laszlo and Phil, <div>I am not speaking about doing ASR in two steps, i am speaking about doing the ASR and speech enhancement jointly in multi-objective learning process. </div> </div> </blockquote> Are, you mean multitask learning. That didn&#39;t come over at all in your first mail. <br> <blockquote type=3D"cite"> <div dir=3D"ltr"> <div>There are many papers showing if you used related objective resumes to train your network, you will get better results on both objectives than what you would get if you train for each one separately. </div> </div> </blockquote> An early paper on this, probably the first application to ASR, was <br> <i><br> </i><i>Parveen &amp; Green, Multitask Learning in Connectionist Robust ASR using Recurrent Neural Networks, Eurospeech 2003.</i><br> <div style=3D"font-size:23.1844px;font-family:serif"><br> </div> <blockquote type=3D"cite"> <div dir=3D"ltr"> <div>And it seams obvious that if we used speech contents (i.e. text) and perfect speech waveform as two independent but correlated targets, we will end up with a better text recognition and better speech enhancement; am i missing something?=C2=A0 =C2=A0=C2=A0 <br> </div> </div> </blockquote> <br> It would be wrong to start with clean speech, add noise, use that as input and clean speech + text as training targets, because in real life speech &amp; other sound sources don&#39;t combine like that. That&#39;s why the spectacular results in the Parveen/Green paper are misleading..<br> <br> HTH<br> <pre class=3D"m_-6205922535632805051m_-5512744834594495995moz-signature= " cols=3D"72">--=20 *** note email is now <a class=3D"m_-6205922535632805051m_-5512744834594495= 995moz-txt-link-abbreviated" href=3D"mailto:p.green@xxxxxxxx" target=3D"_= blank">p.green@xxxxxxxx</a> *** Professor Phil Green SPandH Dept of Computer Science University of Sheffield *** note email is now <a class=3D"m_-6205922535632805051m_-5512744834594495= 995moz-txt-link-abbreviated" href=3D"mailto:p.green@xxxxxxxx" target=3D"_= blank">p.green@xxxxxxxx</a> *** </pre> </div> </blockquote></div> </div></blockquote></div></div></blockquote></div></div> --000000000000b0d4ed056f9fee4c--


This message came from the mail archive
src/postings/2018/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University