Re: [AUDITORY] Summary: How we talk to machines that listen ("Richard F. Lyon" )


Subject: Re: [AUDITORY] Summary: How we talk to machines that listen
From:    "Richard F. Lyon"  <dicklyon@xxxxxxxx>
Date:    Sun, 24 Feb 2019 11:11:51 -0800
List-Archive:<http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>

--0000000000006d7a560582a8994a Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable "Listening" seems more purposeful, intelligently directed, and ecological than what our current technologies do, which is one reason I've preferred "machine hearing", which sounds a little more passive, or "bottom-up". But both are fine. My book subtitle "Extracting Meaning from Sound" is a bit more on the active side, but I carefully explain that by "meaning" I mean information that helps in doing a task or application, not necessarily an intelligent or ecological or cultural one. We'll advance towards those, too, of course= . I don't understand why Kent says "The core mechanisms of ASA (integration, segmentation, sequential group, etc) can only exist in embodied, conscious organisms." Dick On Sat, Feb 23, 2019 at 9:10 PM Valeriy Shafiro <firosha@xxxxxxxx> wrote: > Dear Kent, > > Thank you for these ideas. I agree these are worth considering, even if = I > do not necessarily agree with your specific take on things. Based on my > own reading of Gibson and other ecological psychologists, I don't > necessarily think "machine listening" is as much of a terminological or > metaphysical trap as it may seem from your post. We can argue for a long > time for whether using "remembering" is better than "memory", but at some > point it may be better to move on with the terms which, however imperfect= , > can still be useful for problems you are trying to solve. Similarly, I > used listening in my post as a shorthand for machines that detect and > respond to speech and that people can interact with through speech. Of > course, as we move on with new technologies, there is a lot of room for > ambiguity when we talk about machines that can perform more and more > human-like actions. However, most people find it acceptable to talk abou= t > sunrise and sunset, while knowing that both are due to the rotation of th= e > Earth, rather than the Sun going up and down, or say "the weather has a > mind of its own" without really thinking that the weather is conscious. > So, in this particular case I am not convinced that talking about listeni= ng > machines and related speech production changes, represents either a sympt= om > of "a time when numbers and funding supersedes scientific curiosity" or > "represents an effort to imbue machines with the qualities of "life", > "consciousness" and "humanity" which we know they do and cannot not have" > That said, how we talk *about *machines we talk to is a very interesting > subject in its own right. > > Best, > > Valeriy > > On Sat, Feb 23, 2019 at 4:06 PM Kent Walker <kent.walker@xxxxxxxx> > wrote: > >> From the perspective of auditory scene analysis (ASA), notions of >> "machine listening" and "computational auditory scene analysis" are >> premature. The core mechanisms of ASA (integration, segmentation, >> sequential group, etc) can only exist in embodied, conscious organisms. >> >> I think Al Bregman would agree that there is no listening without >> ecology, and there is no listening without culture. His work is both >> homage-to and a rebellion-against Gibson's post-behaviourist ecological >> psychology. The use of "machine listening" and "computational auditory >> scene analysis" from this scientific perspective is disingenuous and r. >> >> There are many leading researchers in the areas of embodied cognition an= d >> ecological psychoacoustics on this list. That they remain quiet when wor= ds >> which are important to ASA are misused signals to me that we have entere= d a >> period of time similar to that when behaviourism was king: a time when >> numbers and funding supersedes scientific curiosity. >> >> From: AUDITORY - Research in Auditory Perception < >> AUDITORY@xxxxxxxx> on behalf of Valeriy Shafiro <firosha@xxxxxxxx= M >> > >> Reply-To: Valeriy Shafiro <firosha@xxxxxxxx> >> Date: Wednesday, 20 February, 2019 1:50 PM >> To: "AUDITORY@xxxxxxxx" <AUDITORY@xxxxxxxx> >> Subject: Summary: How we talk to machines that listen >> >> Hello List, >> >> I recently asked on this list about how we talk to machines that listen. >> Specifically, whether adjustments which people make in their speech when >> talking to machines are in fact optimal for improving recognition. Here= is >> a summary of replies to my post. >> >> Several people commented on the importance of expectations about machine >> listening that affect speakers production, in terms of trying to optimiz= e >> recognition based on what one believes the machine =E2=80=9Cneeds=E2=80= =9D to hear. >> >> Elle O=E2=80=99Brien observed that she tends to use clear speech in orde= r to >> maximize information in the acoustic signal. >> "I believe I talk in clear speech to my devices because I anticipate tha= t >> automatic speech recognition can barely make use of >> contextual/semantic/visual clues, and so I need to maximize the informat= ion >> present in the acoustic signal. Because I really expect that's all the >> machine can go on. I think of it more as the kind of speech I'd use in a= n >> adverse listening environment, or with a non-native speaker, than a dog = or >> a child. " >> >> Sam Fisher shared a developer=E2=80=99s perspective based on his experie= nce >> programming Alexa and Google Assistant >> =E2=80=9CI have so far experienced that much of a typical user=E2=80=99s= training for how >> to act with a conversational agent is contextual to dealing with forms a= nd >> computers moreso than dogs or people. However, I intuit that there is a >> =E2=80=9Cnearest neighbor=E2=80=9D effect at play, in which a user draws= on their most >> closely analogous experience in order to inform their production. Theref= ore >> the reference point may vary geographically and generationally. >> >> Most of the users I program for have had to deal with annoying touchtone >> phone systems that present us with a discrete set of outcomes and naviga= te >> us through a =E2=80=9Cflowchart.=E2=80=9D I believe the user=E2=80=99s c= ommon assumptions come from >> a rough understanding that they are being pathed through such a flowchar= t >> and just need to say the right thing to get what they want. >> >> In the design of next generation conversational interfaces that are bit >> more capable of keeping up with a natural flow of dialogue, we try to >> disrupt the user=E2=80=99s expectations from the voice interfaces of the= past that >> were unable to really use all the advantages of =E2=80=9Cconversation=E2= =80=9D as a means >> to efficient, cooperative, and frictionless information exchange. The us= er >> is trained to become comfortably speaking to it in a somewhat =E2=80=9Ce= asier=E2=80=9D >> manner, putting more burden of comprehension on the machine conversant.= =E2=80=9D >> >> Leon van Noorden, though not speaking about computer-directed speech >> specifically, also highlighted the point about speaker expectations abou= t >> the listener that can lead to maladaptive adjustments in speech producti= on. >> "My wife is deaf since birth and an extremely good lipreader. She notice= s >> quite often that, when she says she is deaf, people go into an exaggerat= ed >> speech mode, that does not help her to understand them." >> >> As these example illustrate =E2=80=93 the adjustments in speech producti= on that >> people make to improve speech recognition, may or may not be optimal. >> While some adjustments may intuitively seem beneficial, like speaking >> louder to improve SNR, separating words and speaking slower to aid >> segmentation, or hyperarticulating specific sounds, there are some reaso= ns >> to believe, that it may not always help. Or rather that it depends. Ju= st >> like with other clear speech models, speakers also may differ considerab= ly >> in what adjustments they make. >> >> Bill Whitmer wrote: >> =E2=80=9CThere=E2=80=99s been some recent work on ASR and vocal effort b= y Ricard Marxer >> while he was at Sheffield (currently at Marseilles) on this. As your >> question suggests, they did find that when trained on normal effort spee= ch, >> the ASR fails with Lombard speech. But then when trained on material >> matched for vocal effort, the Lombard speech is better recognised. There >> was a whole lotta variance based on speaker, though =E2=80=93 not too su= rprising. >> >> https://www.sciencedirect.com/science/article/pii/S0167639317302674 >> >> I=E2=80=99d suspect that the Googles and Amazons have been trying to ta= ckle this >> for awhile, but perhaps the variance keeps them from announcing some gra= nd >> solution. Your point about people adopting non-optimal strategies is mos= t >> likely true. It also might have an age factor. There=E2=80=99s been rece= nt work >> from Valerie Hazan and Outi Tuomainen at UCL about changes in speech >> production during a multi-person task. Hazan gave a talk I recall where = the >> younger participants adopted multiple different alternate speaking style= s >> when prompted to repeat, whereas the older participants had only one >> alternate. Can=E2=80=99t find that result (maybe I=E2=80=99m mis-remembe= ring =E2=80=93 this memory >> ain=E2=80=99t what it used to be), but these might be a good start >> >> https://www.sciencedirect.com/science/article/pii/S0378595517305166 >> https://asa.scitation.org/doi/10.1121/1.5053218 >> >> Also, there was some work showing how commercial ASR systems (chatbots) >> were attempting to use emotion detection, and changing their sequence if >> they detected anger/frustration in the news years back. But while the >> machines attempt new strategies, the human strategies seem less studied, >> unless it=E2=80=99s all being kept under lock-and-key by companies.=E2= =80=9D >> >> Sarah Ferguson also spoke to these points and offered additional >> references: >> =E2=80=9CSharon Oviatt had a couple of papers looking at something like = this >> (references below) =E2=80=93 talkers go into a =E2=80=9Chyperspeech=E2= =80=9D when a computer >> mis-recognizes what they say. >> >> Oviatt, S., Levow, G. A., Moreton, E., & MacEachern, M. (1998). Modeling >> global and focal hyperarticulation during human-computer error resolutio= n. >> Journal of the Acoustical Society of America, 104, 3080-3098. >> doi:10.1121/1.423888 >> >> Oviatt, S., MacEachern, M., & Levow, G. A. (1998). Predicting >> hyperarticulate speech during human-computer error resolution. Speech >> Communication, 24, 87-110. doi:10.1016/S0167-6393(98)00005-3 >> >> I don=E2=80=99t know if the idea came from these papers or if I heard it >> somewhere else, but speech recognizers are (or used to be) trained up on >> citation-style speech, so hyperarticulation should make speech recogniti= on >> worse. I=E2=80=99ve been surprised by how well Siri does and will someti= mes try to >> =E2=80=9Cmess with=C2=B4 it to see how it behaves.=E2=80=9D >> >> And for those who are interested to learn more about this topic, Phil >> Green sent information about an upcoming workshop in Glasgow >> http://speech-interaction.org/chi2019/ , while Dana Urbanski suggsted >> searching under =E2=80=9Ccomputer-directed speech=E2=80=9D and Katherine= Marcoux referenced >> a comprehensive review article that include machine-directed speech in t= he >> context of =E2=80=9Cother speech modifications (infant direct speech, Lo= mbard >> speech, clear speech, etc.) >> >> Cooke, M., King, S., Garnier, M., & Aubanel, V. (2014). The listening >> talker: A review of human and algorithmic context-induced modifications = of >> speech. Computer Speech & Language, 28(2), 543-571." >> >> David Pisoni recommended another useful perspective paper by Liz Shriber= g: >> >> Shriberg, E. (2005). Spontaneous speech: How people really talk and why >> engineers should care. INERSPEECH. >> https://web.stanford.edu/class/cs424p/shriberg05.pdf >> >> >> Finally, adjustments in speech directed at machines can also reflect >> emotions and felling that speakers experience during the interaction. >> >> Elle O=E2=80=99Brien also suggested this interesting paper =E2=80=9Cabou= t how children >> express politeness and frustration at machines (in this study, the compu= ter >> is actually controlled by an experimenter, but kids think it is >> voice-controlled). http://www-bcf.usc.edu/~dbyrd/euro01-chaimkids_ed.pdf= =E2=80=9D >> >> So it appears that speech production adjustments people make when >> speaking to machines are a combination of efforts to improve recognition= by >> machines (which may or may not be optimal), as well as possible >> affect-related production changes that arise dynamically in response to >> errors and the history of interactions with a specific device and one= =E2=80=99s >> appraisal of the machine listener's capabilities (this SNL Alexa skit >> captures some of these challenges >> https://www.youtube.com/watch?v=3DYvT_gqs5ETk )... At least while we are >> working out the kinks in computer speech recognition and finding better >> ways to talk to machines, we are guaranteed a good and reliable supply o= f >> amusement and entertainment. >> >> Thanks to everyone who responded! >> >> Valeriy >> >> ---original post----------- >> >> >> On Sun, Feb 3, 2019 at 11:16 AM Valeriy Shafiro <firosha@xxxxxxxx> >> wrote: >> >>> Dear list, >>> >>> I am wondering if any one has any references or suggestions about this >>> question. These days I hear more and more people talking to machines, >>> e.g. Siri, Google, Alexa, etc., and doing it in more and more places. >>> Automatic speech recognition has improved tremendously, but still it se= ems >>> to me that when people talk to machines they often switch into a differ= ent >>> production mode. At times it may sound like talking to a (large) dog >>> and sometimes like talking to a customer service agent in a land far aw= ay >>> who is diligently trying to follow a script rather than listen to what = you >>> are saying. And I wonder whether adjustments that people make in thei= r >>> speech production when talking with machines in that mode are in fact >>> optimal for improving recognition accuracy. Since machines are not >>> processing speech in the same way as humans, I wonder if changes in spe= ech >>> production that make speech more recognizable for other people (or even >>> pets) are always the same as they are for machines. In other words, do >>> people tend to make the most optimal adjustments to make their speech m= ore >>> recognizable to machines. Or is it more like falling back on clear spe= ech >>> modes that work with other kinds of listeners (children, nonnative >>> speakers, pets), or something in between? >>> >>> I realize there is a lot to this question, but perhaps people have >>> started looking into it. I am happy to collate references and replies = and >>> send to the list. >>> >>> Best, >>> >>> >>> >>> Valeriy >>> >> --0000000000006d7a560582a8994a Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div class=3D"gmail_defa= ult" style=3D"font-size:small">&quot;Listening&quot; seems more purposeful,= intelligently directed, and ecological than what our current technologies = do, which is one reason I&#39;ve preferred &quot;machine hearing&quot;, whi= ch sounds a little more passive, or &quot;bottom-up&quot;.=C2=A0 But both a= re fine.</div><div class=3D"gmail_default" style=3D"font-size:small"><br></= div><div class=3D"gmail_default" style=3D"font-size:small">My book subtitle= &quot;Extracting Meaning from Sound&quot; is a bit more on the active side= , but I carefully explain that by &quot;meaning&quot; I mean information th= at helps in doing a task or application, not necessarily an intelligent or = ecological or cultural one.=C2=A0 We&#39;ll advance towards those, too, of = course.</div><div class=3D"gmail_default" style=3D"font-size:small"><br></d= iv><div class=3D"gmail_default" style=3D"font-size:small">I don&#39;t under= stand why Kent says &quot;The core mechanisms of ASA (integration, segmenta= tion, sequential group, etc) can only exist in embodied, conscious organism= s.&quot;<br></div><div class=3D"gmail_default" style=3D"font-size:small"><b= r></div><div class=3D"gmail_default" style=3D"font-size:small">Dick</div><d= iv class=3D"gmail_default" style=3D"font-size:small"><br></div></div></div>= <br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Sat= , Feb 23, 2019 at 9:10 PM Valeriy Shafiro &lt;<a href=3D"mailto:firosha@xxxxxxxx= il.com">firosha@xxxxxxxx</a>&gt; wrote:<br></div><blockquote class=3D"gmai= l_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,20= 4,204);padding-left:1ex"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"= ><div dir=3D"ltr">Dear Kent,<div><br></div><div>Thank you for these ideas.= =C2=A0 I agree these are worth considering, even if I do not necessarily ag= ree with your specific take on things.=C2=A0 Based on my own reading of Gib= son and other ecological psychologists, I don&#39;t necessarily think &quot= ;machine listening&quot; is as much of a terminological or metaphysical tra= p as it may seem from your post.=C2=A0 We can argue for a long time for whe= ther using &quot;remembering&quot; is better than &quot;memory&quot;, but a= t some point it may be better to move on with the terms which, however impe= rfect, can still be useful for problems you are trying to solve.=C2=A0 Simi= larly, I used listening in my post as a shorthand for machines that detect = and respond to speech and that people can interact with through speech.=C2= =A0 Of course, as we move on with new technologies, there is a lot of room = for ambiguity when we talk about machines that can perform more and more hu= man-like actions.=C2=A0 However, most people find it acceptable to talk abo= ut sunrise and sunset, while knowing that both are due to the rotation of t= he Earth, rather than the Sun going up and down, or say &quot;the weather h= as a mind of its own&quot; without really thinking that the weather is cons= cious.=C2=A0 So, in this particular case I am not convinced that talking ab= out listening machines and related speech production changes, represents ei= ther a symptom of=C2=A0 &quot;a time when numbers and funding supersedes sc= ientific curiosity&quot;=C2=A0 or &quot;represents an effort to imbue machi= nes with the qualities of &quot;life&quot;, &quot;consciousness&quot; and &= quot;humanity&quot; which we know they do and cannot not have&quot;=C2=A0 T= hat said, how we talk <u>about </u>machines we talk to is a very interestin= g subject in its own right.=C2=A0=C2=A0</div><div><br></div><div>Best,</div= ><div><br></div><div>Valeriy=C2=A0</div></div><br><div class=3D"gmail_quote= "><div dir=3D"ltr" class=3D"gmail_attr">On Sat, Feb 23, 2019 at 4:06 PM Ken= t Walker &lt;<a href=3D"mailto:kent.walker@xxxxxxxx" target=3D"_blank= ">kent.walker@xxxxxxxx</a>&gt; wrote:<br></div><blockquote class=3D"g= mail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204= ,204,204);padding-left:1ex"> <div style=3D"color:rgb(0,0,0);font-size:14px;font-family:Calibri,sans-seri= f"> <div>From the perspective of auditory scene analysis =C2=A0(ASA), notions o= f &quot;machine listening&quot; and &quot;computational auditory scene anal= ysis&quot; are premature. The core mechanisms of ASA (integration, segmenta= tion, sequential group, etc) can only exist in embodied, conscious organisms.=C2=A0</div> <div><br> </div> <div>I think Al Bregman =C2=A0would agree that there is no listening withou= t ecology, and there is no listening without culture. His work is both homa= ge-to and a rebellion-against Gibson&#39;s post-behaviourist ecological psy= chology.=C2=A0 The use of &quot;machine listening&quot; and &quot;computational auditory scene analysis&quot; from this scientific= perspective is disingenuous and r.=C2=A0</div> <div><br> </div> <div>There are many leading researchers in the areas of embodied cognition = and ecological psychoacoustics on this list. That they remain quiet when wo= rds which are important to ASA are misused signals to me that we have enter= ed a period of time similar to that when behaviourism was king: a time when numbers and funding supersedes sci= entific curiosity.</div> <div><br> </div> <span id=3D"gmail-m_4398671363547811880gmail-m_3154213389526688550OLK_SRC_B= ODY_SECTION"> <div style=3D"font-family:Calibri;font-size:11pt;text-align:left;color:blac= k;border-width:1pt medium medium;border-style:solid none none;border-bottom= -color:initial;border-left-color:initial;padding:3pt 0in 0in;border-top:1pt= solid rgb(181,196,223);border-right-color:initial"> <span style=3D"font-weight:bold">From: </span>AUDITORY - Research in Audito= ry Perception &lt;<a href=3D"mailto:AUDITORY@xxxxxxxx" target=3D"_bl= ank">AUDITORY@xxxxxxxx</a>&gt; on behalf of Valeriy Shafiro &lt;<a h= ref=3D"mailto:firosha@xxxxxxxx" target=3D"_blank">firosha@xxxxxxxx</a>&gt= ;<br> <span style=3D"font-weight:bold">Reply-To: </span>Valeriy Shafiro &lt;<a hr= ef=3D"mailto:firosha@xxxxxxxx" target=3D"_blank">firosha@xxxxxxxx</a>&gt;= <br> <span style=3D"font-weight:bold">Date: </span>Wednesday, 20 February, 2019 = 1:50 PM<br> <span style=3D"font-weight:bold">To: </span>&quot;<a href=3D"mailto:AUDITOR= Y@xxxxxxxx" target=3D"_blank">AUDITORY@xxxxxxxx</a>&quot; &lt= ;<a href=3D"mailto:AUDITORY@xxxxxxxx" target=3D"_blank">AUDITORY@xxxxxxxx= TS.MCGILL.CA</a>&gt;<br> <span style=3D"font-weight:bold">Subject: </span>Summary: How we talk to ma= chines that listen<br> </div> <div><br> </div> <blockquote id=3D"gmail-m_4398671363547811880gmail-m_3154213389526688550MAC= _OUTLOOK_ATTRIBUTION_BLOCKQUOTE"> <div> <div> <div dir=3D"ltr"> <div dir=3D"ltr"> <div dir=3D"ltr"> <div dir=3D"ltr"> <div dir=3D"ltr"> <div>Hello List,=C2=A0</div> <div><br> </div> <div>I recently asked on this list about how we talk to machines that liste= n.=C2=A0 Specifically, whether adjustments which people make in their speec= h when talking to machines are=C2=A0in fact=C2=A0optimal for improving reco= gnition.=C2=A0 Here is a summary of replies to my post.</div> <div><br> </div> <div>Several people commented on the importance of expectations about machi= ne listening that affect speakers production, in terms of trying to optimiz= e recognition based on what one believes the machine =E2=80=9Cneeds=E2=80= =9D to hear.=C2=A0=C2=A0</div> <div><br> </div> <div>Elle O=E2=80=99Brien observed that she tends to use clear speech in or= der to maximize information in the acoustic signal.=C2=A0</div> <div>&quot;I believe I talk in clear speech to my devices because I anticip= ate that automatic speech recognition can barely make use of contextual/sem= antic/visual clues, and so I need to maximize the information present in th= e acoustic signal. Because I really expect that&#39;s all the machine can go on. I think of it more as the kind of sp= eech I&#39;d use in an adverse listening environment, or with a non-native = speaker, than a dog or a child. &quot;</div> <div><br> </div> <div>Sam Fisher shared a developer=E2=80=99s perspective based on his exper= ience programming Alexa and Google Assistant=C2=A0</div> <div>=E2=80=9CI have so far experienced that much of a typical user=E2=80= =99s training for how to act with a conversational agent is contextual to d= ealing with forms and computers moreso than dogs or people. However, I intu= it that there is a =E2=80=9Cnearest neighbor=E2=80=9D effect at play, in which a user draws on their most closely analogous experience in order = to inform their production. Therefore the reference point may vary geograph= ically and generationally.</div> <div><br> </div> <div>Most of the users I program for have had to deal with annoying touchto= ne phone systems that present us with a discrete set of outcomes and naviga= te us through a =E2=80=9Cflowchart.=E2=80=9D I believe the user=E2=80=99s c= ommon assumptions come from a rough understanding that they are being pathed through such a flowchart and just need to say the right t= hing to get what they want.</div> <div><br> </div> <div>In the design of next generation conversational interfaces that are bi= t more capable of keeping up with a natural flow of dialogue, we try to dis= rupt the user=E2=80=99s expectations from the voice interfaces of the past = that were unable to really use all the advantages of =E2=80=9Cconversation=E2=80=9D as a means to efficient, cooperative, an= d frictionless information exchange. The user is trained to become comforta= bly speaking to it in a somewhat =E2=80=9Ceasier=E2=80=9D manner, putting m= ore burden of comprehension on the machine conversant.=E2=80=9D</div> <div><br> </div> <div>Leon van Noorden, though not speaking about computer-directed speech s= pecifically, also highlighted the point about speaker expectations about th= e listener that can lead to maladaptive adjustments in speech production.= =C2=A0</div> <div>&quot;My wife is deaf since birth and an extremely good lipreader. She= notices quite often that, when she says she is deaf, people go into an exa= ggerated speech mode, that does not help her to understand them.&quot;</div= > <div><br> </div> <div>As these example illustrate =E2=80=93 the adjustments in speech produc= tion that people make to improve speech recognition, may or may not be opti= mal.=C2=A0 While some adjustments may intuitively seem beneficial, like spe= aking louder to improve SNR, separating words and speaking slower to aid segmentation, or hyperarticulating specific sou= nds, there are some reasons to believe, that it may not always help.=C2=A0 = Or rather that it depends.=C2=A0 Just like with other clear speech models, = speakers also may differ considerably in what adjustments they make.=C2=A0=C2=A0</div> <div><br> </div> <div>Bill Whitmer wrote:=C2=A0</div> <div>=E2=80=9CThere=E2=80=99s been some recent work on ASR and vocal effort= by Ricard Marxer while he was at Sheffield (currently at Marseilles) on th= is. As your question suggests, they did find that when trained on normal ef= fort speech, the ASR fails with Lombard speech. But then when trained on material matched for vocal effort, the Lombard sp= eech is better recognised. There was a whole lotta variance based on speake= r, though =E2=80=93 not too surprising.</div> <div><br> </div> <div>=C2=A0<a href=3D"https://www.sciencedirect.com/science/article/pii/S01= 67639317302674" target=3D"_blank">https://www.sciencedirect.com/science/art= icle/pii/S0167639317302674</a></div> <div><br> </div> <div>=C2=A0I=E2=80=99d suspect that the Googles and Amazons have been tryin= g to tackle this for awhile, but perhaps the variance keeps them from annou= ncing some grand solution. Your point about people adopting non-optimal str= ategies is most likely true. It also might have an age factor. There=E2=80=99s been recent work from Valerie Hazan and Out= i Tuomainen at UCL about changes in speech production during a multi-person= task. Hazan gave a talk I recall where the younger participants adopted mu= ltiple different alternate speaking styles when prompted to repeat, whereas the older participants had only one alter= nate. Can=E2=80=99t find that result (maybe I=E2=80=99m mis-remembering =E2= =80=93 this memory ain=E2=80=99t what it used to be), but these might be a = good start</div> <div><br> </div> <div><a href=3D"https://www.sciencedirect.com/science/article/pii/S03785955= 17305166" target=3D"_blank">https://www.sciencedirect.com/science/article/p= ii/S0378595517305166</a></div> <div><a href=3D"https://asa.scitation.org/doi/10.1121/1.5053218" target=3D"= _blank">https://asa.scitation.org/doi/10.1121/1.5053218</a></div> <div><br> </div> <div>=C2=A0Also, there was some work showing how commercial ASR systems (ch= atbots) were attempting to use emotion detection, and changing their sequen= ce if they detected anger/frustration in the news years back. But while the= machines attempt new strategies, the human strategies seem less studied, unless it=E2=80=99s all being kept und= er lock-and-key by companies.=E2=80=9D</div> <div><br> </div> <div>Sarah Ferguson also spoke to these points and offered additional refer= ences:</div> <div>=E2=80=9CSharon Oviatt had a couple of papers looking at something lik= e this (references below) =E2=80=93 talkers go into a =E2=80=9Chyperspeech= =E2=80=9D when a computer mis-recognizes what they say.</div> <div>=C2=A0</div> <div>Oviatt, S., Levow, G. A., Moreton, E., &amp; MacEachern, M. (1998). Mo= deling global and focal hyperarticulation during human-computer error resol= ution. Journal of the Acoustical Society of America, 104, 3080-3098. doi:10= .1121/1.423888</div> <div>=C2=A0</div> <div>Oviatt, S., MacEachern, M., &amp; Levow, G. A. (1998). Predicting hype= rarticulate speech during human-computer error resolution. Speech Communica= tion, 24, 87-110. doi:10.1016/S0167-6393(98)00005-3</div> <div>=C2=A0</div> <div>I don=E2=80=99t know if the idea came from these papers or if I heard = it somewhere else, but speech recognizers are (or used to be) trained up on= citation-style speech, so hyperarticulation should make speech recognition= worse. I=E2=80=99ve been surprised by how well Siri does and will sometimes try to =E2=80=9Cmess with=C2=B4 it to see how= it behaves.=E2=80=9D</div> <div><br> </div> <div>And for those who are interested to learn more about this topic, Phil = Green sent information about an upcoming workshop in Glasgow <a href=3D"http://speech-interaction.org/chi2019/" target=3D"_blank">http:/= /speech-interaction.org/chi2019/</a> , while Dana Urbanski suggsted searchi= ng under =E2=80=9Ccomputer-directed speech=E2=80=9D and Katherine Marcoux r= eferenced a comprehensive review article that include machine-directed speech in the context of =E2=80=9Cother speech modifications (infant direc= t speech, Lombard speech, clear speech, etc.)</div> <div><br> </div> <div>Cooke, M., King, S., Garnier, M., &amp; Aubanel, V. (2014). The listen= ing talker: A review of human and algorithmic context-induced modifications= of speech. Computer Speech &amp; Language, 28(2), 543-571.&quot;</div> <div><br> </div> <div>David Pisoni recommended another useful perspective paper by Liz Shrib= erg:</div> <div><br> </div> <div>Shriberg, E. (2005). Spontaneous speech: How people really talk and wh= y engineers should care. INERSPEECH.=C2=A0=C2=A0<a href=3D"https://web.stan= ford.edu/class/cs424p/shriberg05.pdf" target=3D"_blank">https://web.stanfor= d.edu/class/cs424p/shriberg05.pdf</a>=C2=A0</div> <div><br> </div> <div><br> </div> <div>Finally, adjustments in speech directed at machines can also reflect e= motions and felling that speakers experience during the interaction.=C2=A0= =C2=A0</div> <div><br> </div> <div>Elle O=E2=80=99Brien also suggested this interesting paper =E2=80=9Cab= out how children express politeness and frustration at machines (in this st= udy, the computer is actually controlled by an experimenter, but kids think= it is voice-controlled). <a href=3D"http://www-bcf.usc.edu/~dbyrd/euro01-chaimkids_ed.pdf" target=3D= "_blank">http://www-bcf.usc.edu/~dbyrd/euro01-chaimkids_ed.pdf</a>=E2=80=9D= </div> <div><br> </div> <div>So it appears that speech production adjustments people make when spea= king to machines are a combination of efforts to improve recognition by mac= hines (which may or may not be optimal), as well as possible affect-related= production changes that arise dynamically in response to errors and the history of interactions with a specific devi= ce=C2=A0 and one=E2=80=99s appraisal of the machine listener&#39;s capabili= ties (this SNL Alexa skit captures some of these challenges <a href=3D"https://www.youtube.com/watch?v=3DYvT_gqs5ETk" target=3D"_blank"= >https://www.youtube.com/watch?v=3DYvT_gqs5ETk</a> )... At least while we a= re working out the kinks in computer speech recognition and finding better = ways to talk to machines, we are guaranteed a good and reliable supply of amusement and entertainment.=C2=A0</div> <div>=C2=A0</div> <div>Thanks to everyone who responded!</div> <div><br> </div> <div>Valeriy</div> <div><br> </div> <div>---original post-----------</div> <div><br> </div> <div><br> </div> </div> </div> <div class=3D"gmail_quote"> <div dir=3D"ltr" class=3D"gmail_attr">On Sun, Feb 3, 2019 at 11:16 AM Valer= iy Shafiro &lt;<a href=3D"mailto:firosha@xxxxxxxx" target=3D"_blank">firos= ha@xxxxxxxx</a>&gt; wrote:<br> </div> <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-= left:1px solid rgb(204,204,204);padding-left:1ex"> <div dir=3D"ltr"> <p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;color:rgb(33,33,33);line= -height:15.6933px;font-size:11pt;font-family:Calibri,sans-serif"> Dear list,</p> <p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;line-height:15.6933px"><= font face=3D"Calibri,sans-serif" color=3D"#212121"><span style=3D"font-size= :11pt">I am wondering if any one has any references or suggestions about th= is question.=C2=A0 These days I hear more and more people=C2=A0</span></font><span style=3D"color:rgb(33,33,33);font-family:C= alibri,sans-serif;font-size:11pt">talking to machines, e.g. Siri, Google, A= lexa, etc., and doing it=C2=A0in more and more places.=C2=A0 Automatic spee= ch recognition has improved tremendously, but still it seems to me that when people talk to machines they often swit= ch into a different production mode.=C2=A0 =C2=A0At times it may=C2=A0</spa= n><span style=3D"color:rgb(33,33,33);font-family:Calibri,sans-serif;font-si= ze:11pt">sound like=C2=A0talking to a (large) dog=C2=A0 and sometimes like talking to a=C2=A0customer service agent in a= land far away who is diligently trying to=C2=A0follow=C2=A0a script rather= than listen to what you are saying.=C2=A0=C2=A0=C2=A0And I=C2=A0wonder whe= ther adjustments that people make in their speech production when talking with machines in that mode are in fact optimal for=C2=A0improving=C2=A0rec= ognition=C2=A0</span><font face=3D"Calibri,sans-serif" color=3D"#212121"><s= pan style=3D"font-size:11pt">accuracy.=C2=A0 Since machines are not process= ing speech in the same way as humans, I wonder if changes in speech production that make speech more </span><span style=3D"font-size= :14.6667px">recognizable</span><span style=3D"font-size:11pt">=C2=A0for oth= er people (or even pets) are always the same as they are for machines.=C2= =A0 In other words, do people tend to make the most optimal adjustments to make their speech more recognizable to machines.=C2= =A0 Or is it more like falling back on clear speech modes that work with ot= her kinds of listeners (children, nonnative speakers, pets), or something i= n between?=C2=A0=C2=A0</span></font></p> <span class=3D"gmail-m_4398671363547811880gmail-m_3154213389526688550gmail-= m_-4422852840265105174gmail-im" style=3D"color:rgb(80,0,80);font-family:Cal= ibri,Arial,Helvetica,sans-serif;font-size:16px"> <p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;line-height:15.6933px;fo= nt-size:11pt;font-family:Calibri,sans-serif"> I realize there is a lot to this question, but perhaps people have started = looking into it.=C2=A0 I am happy to collate references and replies and sen= d to the list.=C2=A0</p> <p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;line-height:15.6933px;fo= nt-size:11pt;font-family:Calibri,sans-serif"> Best,</p> <p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;line-height:15.6933px;fo= nt-size:11pt;font-family:Calibri,sans-serif"> =C2=A0</p> <p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;line-height:15.6933px;fo= nt-size:11pt;font-family:Calibri,sans-serif"> Valeriy</p> </span></div> </blockquote> </div> </div> </div> </div> </div> </div> </blockquote> </span> </div> </blockquote></div></div></div></div> </blockquote></div></div> --0000000000006d7a560582a8994a--


This message came from the mail archive
src/postings/2019/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University