Re: [AUDITORY] Why is it that joint speech-enhancement with ASR is not a popular research topic? (Ricard Marxer )


Subject: Re: [AUDITORY] Why is it that joint speech-enhancement with ASR is not a popular research topic?
From:    Ricard Marxer  <ricardmp@xxxxxxxx>
Date:    Tue, 26 Jun 2018 14:32:39 +0200
List-Archive:<http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>

--0000000000004a6c8d056f8ab266 Content-Type: text/plain; charset="UTF-8" On Tue, Jun 26, 2018 at 7:31 AM Samer Hijazi <hijazi@xxxxxxxx> wrote: > Hi Phil, > > Thanks for your insightful response and pointing me to your duplication on > this topic from 2003. > I am particularly intrigued by your comment, > > I am particularly intrigued with your comment: > " It would be wrong to start with clean speech, add noise, use that as > input and clean speech + text as training targets, because in real life > speech& other sound sources don't combine like that. " > > There are many recent publication on speech enhancement that are using a > simple additive noise model, and sometimes RIR simulator, and they are > publishing impressive results. Is there a need to incorporate any thing > beyond RIR to generalize the training dataset to create a solution that > would work properly in the real world? > In addition to what Phil said, I would say, you should take in consideration the changes in speech due to the environment noise (e.g. Lombard effect). See: https://www.sciencedirect.com/science/article/pii/S0167639317302674?via%3Dihub In relation to your initial query. We have also done some preliminary work on ASR-based speech enhancement: https://www.isca-speech.org/archive/Interspeech_2017/abstracts/1257.html Best regards, Ricard > > Regards, > > Samer > > > > On Mon, Jun 25, 2018 at 9:13 PM Phil Green <p.green@xxxxxxxx> > wrote: > >> >> >> On 25/06/2018 17:00, Samer Hijazi wrote: >> >> Thanks Laszlo and Phil, >> I am not speaking about doing ASR in two steps, i am speaking about doing >> the ASR and speech enhancement jointly in multi-objective learning process. >> >> Are, you mean multitask learning. That didn't come over at all in your >> first mail. >> >> There are many papers showing if you used related objective resumes to >> train your network, you will get better results on both objectives than >> what you would get if you train for each one separately. >> >> An early paper on this, probably the first application to ASR, was >> >> *Parveen & Green, Multitask Learning in Connectionist Robust ASR using >> Recurrent Neural Networks, Eurospeech 2003.* >> >> And it seams obvious that if we used speech contents (i.e. text) and >> perfect speech waveform as two independent but correlated targets, we will >> end up with a better text recognition and better speech enhancement; am i >> missing something? >> >> >> It would be wrong to start with clean speech, add noise, use that as >> input and clean speech + text as training targets, because in real life >> speech & other sound sources don't combine like that. That's why the >> spectacular results in the Parveen/Green paper are misleading.. >> >> HTH >> >> -- >> *** note email is now p.green@xxxxxxxx *** >> Professor Phil Green >> SPandH >> Dept of Computer Science >> University of Sheffield >> *** note email is now p.green@xxxxxxxx *** >> >> -- ricard http://twitter.com/ricardmp http://www.ricardmarxer.com http://www.caligraft.com --0000000000004a6c8d056f8ab266 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><br><div class=3D"gmail_quote"><div dir=3D"ltr">On Tue, Ju= n 26, 2018 at 7:31 AM Samer Hijazi &lt;<a href=3D"mailto:hijazi@xxxxxxxx= com" target=3D"_blank">hijazi@xxxxxxxx</a>&gt; wrote:<br></div><block= quote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1= px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div><span sty= le=3D"text-decoration-style:initial;text-decoration-color:initial;float:non= e;display:inline">Hi Phil,=C2=A0</span></div> <span style=3D"text-decoration-style:initial;text-decoration-color:initial;= float:none;display:inline"><div><span style=3D"text-decoration-style:initia= l;text-decoration-color:initial;float:none;display:inline"><br></span></div= >Thanks for your insightful response and pointing me to your duplication on= this topic from 2003.=C2=A0</span><div><span style=3D"text-decoration-styl= e:initial;text-decoration-color:initial;float:none;display:inline">I am par= ticularly intrigued by your comment,=C2=A0 =C2=A0</span>=C2=A0<br></div><di= v><br></div><div>I am particularly intrigued with your comment:</div><div>&= quot; <span style=3D"color:rgb(80,0,80);background-color:rgb(255,255,255);text-de= coration-style:initial;text-decoration-color:initial;float:none;display:inl= ine">It would be wrong to start with clean<span>=C2=A0</span></span><span c= lass=3D"m_6489546056412141729gmail-m_4496892576729654510gmail-il" style=3D"= color:rgb(80,0,80);background-color:rgb(255,255,255);text-decoration-style:= initial;text-decoration-color:initial">speech</span><span style=3D"color:rg= b(80,0,80);background-color:rgb(255,255,255);text-decoration-style:initial;= text-decoration-color:initial;float:none;display:inline">, add noise, use t= hat as input and clean<span>=C2=A0</span></span><span class=3D"m_6489546056= 412141729gmail-m_4496892576729654510gmail-il" style=3D"color:rgb(80,0,80);b= ackground-color:rgb(255,255,255);text-decoration-style:initial;text-decorat= ion-color:initial">speech</span><span style=3D"color:rgb(80,0,80);backgroun= d-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-colo= r:initial;float:none;display:inline"><span>=C2=A0</span>+ text as training = targets, because in real life<span>=C2=A0</span></span><span class=3D"m_648= 9546056412141729gmail-m_4496892576729654510gmail-il" style=3D"color:rgb(80,= 0,80);background-color:rgb(255,255,255);text-decoration-style:initial;text-= decoration-color:initial">speech</span><span style=3D"color:rgb(80,0,80);ba= ckground-color:rgb(255,255,255);text-decoration-style:initial;text-decorati= on-color:initial;float:none;display:inline">&amp; other sound sources don&#= 39;t combine like that.</span> &quot;</div><div><br></div><div>There are many recent publication on speech= enhancement<span style=3D"background-color:rgb(255,255,255);text-decoratio= n-style:initial;text-decoration-color:initial;float:none;display:inline"><s= pan>=C2=A0</span></span> that are using a simple additive noise model, and sometimes RIR simulator,= and they are publishing impressive results. Is there a need to incorporate= any thing beyond RIR to generalize the training dataset to create a soluti= on that would work properly in the real world?=C2=A0 =C2=A0=C2=A0</div></di= v></blockquote><div><br></div><div>In addition to what Phil said,=C2=A0I wo= uld say, you should take in consideration the changes in speech due to the = environment noise (e.g. Lombard effect). See:</div><div><a href=3D"https://= www.sciencedirect.com/science/article/pii/S0167639317302674?via%3Dihub" tar= get=3D"_blank">https://www.sciencedirect.com/science/article/pii/S016763931= 7302674?via%3Dihub</a></div><div><br></div><div>In relation to your initial= query. We have also done some preliminary work on ASR-based speech enhance= ment:</div><div><a href=3D"https://www.isca-speech.org/archive/Interspeech_= 2017/abstracts/1257.html" target=3D"_blank">https://www.isca-speech.org/arc= hive/Interspeech_2017/abstracts/1257.html</a><br></div><div><br></div><div>= Best regards,</div><div>Ricard</div><div>=C2=A0</div><blockquote class=3D"g= mail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204= ,204,204);padding-left:1ex"><div dir=3D"ltr"><div><br></div><div>Regards,</= div><div><br>Samer=C2=A0 =C2=A0=C2=A0</div><div><br></div><div><br></div></= div><br><div class=3D"gmail_quote"><div dir=3D"ltr">On Mon, Jun 25, 2018 at= 9:13 PM Phil Green &lt;<a href=3D"mailto:p.green@xxxxxxxx" target= =3D"_blank">p.green@xxxxxxxx</a>&gt; wrote:<br></div><blockquote cla= ss=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid = rgb(204,204,204);padding-left:1ex"> =20 =20 =20 <div bgcolor=3D"#FFFFFF"> <p><br> </p> <br> <div class=3D"m_6489546056412141729gmail-m_4496892576729654510m_-551274= 4834594495995moz-cite-prefix">On 25/06/2018 17:00, Samer Hijazi wrote:<br> </div> <blockquote type=3D"cite"> <div dir=3D"ltr">Thanks Laszlo and Phil, <div>I am not speaking about doing ASR in two steps, i am speaking about doing the ASR and speech enhancement jointly in multi-objective learning process. </div> </div> </blockquote> Are, you mean multitask learning. That didn&#39;t come over at all in your first mail. <br> <blockquote type=3D"cite"> <div dir=3D"ltr"> <div>There are many papers showing if you used related objective resumes to train your network, you will get better results on both objectives than what you would get if you train for each one separately. </div> </div> </blockquote> An early paper on this, probably the first application to ASR, was <br> <i><br> </i><i>Parveen &amp; Green, Multitask Learning in Connectionist Robust ASR using Recurrent Neural Networks, Eurospeech 2003.</i><br> <div style=3D"font-size:23.1844px;font-family:serif"><br> </div> <blockquote type=3D"cite"> <div dir=3D"ltr"> <div>And it seams obvious that if we used speech contents (i.e. text) and perfect speech waveform as two independent but correlated targets, we will end up with a better text recognition and better speech enhancement; am i missing something?=C2=A0 =C2=A0=C2=A0 <br> </div> </div> </blockquote> <br> It would be wrong to start with clean speech, add noise, use that as input and clean speech + text as training targets, because in real life speech &amp; other sound sources don&#39;t combine like that. That&#39;s why the spectacular results in the Parveen/Green paper are misleading..<br> <br> HTH<br> <pre class=3D"m_6489546056412141729gmail-m_4496892576729654510m_-551274= 4834594495995moz-signature" cols=3D"72">--=20 *** note email is now <a class=3D"m_6489546056412141729gmail-m_449689257672= 9654510m_-5512744834594495995moz-txt-link-abbreviated" href=3D"mailto:p.gre= en@xxxxxxxx" target=3D"_blank">p.green@xxxxxxxx</a> *** Professor Phil Green SPandH Dept of Computer Science University of Sheffield *** note email is now <a class=3D"m_6489546056412141729gmail-m_449689257672= 9654510m_-5512744834594495995moz-txt-link-abbreviated" href=3D"mailto:p.gre= en@xxxxxxxx" target=3D"_blank">p.green@xxxxxxxx</a> *** </pre> </div> </blockquote></div> </blockquote></div><br clear=3D"all"><div><br></div>-- <br><div dir=3D"ltr"= class=3D"m_6489546056412141729gmail_signature">ricard<br><a href=3D"http:/= /twitter.com/ricardmp" target=3D"_blank">http://twitter.com/ricardmp</a><br= ><a href=3D"http://www.ricardmarxer.com" target=3D"_blank">http://www.ricar= dmarxer.com</a><br><a href=3D"http://www.caligraft.com" target=3D"_blank">h= ttp://www.caligraft.com</a></div></div> --0000000000004a6c8d056f8ab266--


This message came from the mail archive
src/postings/2018/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University