Subject: [AUDITORY] Summary: How we talk to machines that listen From: Valeriy Shafiro <firosha@xxxxxxxx> Date: Wed, 20 Feb 2019 14:50:42 -0600 List-Archive:<http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>--0000000000007ec9dc0582598375 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hello List, I recently asked on this list about how we talk to machines that listen. Specifically, whether adjustments which people make in their speech when talking to machines are in fact optimal for improving recognition. Here is a summary of replies to my post. Several people commented on the importance of expectations about machine listening that affect speakers production, in terms of trying to optimize recognition based on what one believes the machine =E2=80=9Cneeds=E2=80=9D = to hear. Elle O=E2=80=99Brien observed that she tends to use clear speech in order t= o maximize information in the acoustic signal. "I believe I talk in clear speech to my devices because I anticipate that automatic speech recognition can barely make use of contextual/semantic/visual clues, and so I need to maximize the information present in the acoustic signal. Because I really expect that's all the machine can go on. I think of it more as the kind of speech I'd use in an adverse listening environment, or with a non-native speaker, than a dog or a child. " Sam Fisher shared a developer=E2=80=99s perspective based on his experience programming Alexa and Google Assistant =E2=80=9CI have so far experienced that much of a typical user=E2=80=99s tr= aining for how to act with a conversational agent is contextual to dealing with forms and computers moreso than dogs or people. However, I intuit that there is a =E2=80=9Cnearest neighbor=E2=80=9D effect at play, in which a user draws on= their most closely analogous experience in order to inform their production. Therefore the reference point may vary geographically and generationally. Most of the users I program for have had to deal with annoying touchtone phone systems that present us with a discrete set of outcomes and navigate us through a =E2=80=9Cflowchart.=E2=80=9D I believe the user=E2=80=99s comm= on assumptions come from a rough understanding that they are being pathed through such a flowchart and just need to say the right thing to get what they want. In the design of next generation conversational interfaces that are bit more capable of keeping up with a natural flow of dialogue, we try to disrupt the user=E2=80=99s expectations from the voice interfaces of the pa= st that were unable to really use all the advantages of =E2=80=9Cconversation=E2=80= =9D as a means to efficient, cooperative, and frictionless information exchange. The user is trained to become comfortably speaking to it in a somewhat =E2=80=9Ceasi= er=E2=80=9D manner, putting more burden of comprehension on the machine conversant.=E2= =80=9D Leon van Noorden, though not speaking about computer-directed speech specifically, also highlighted the point about speaker expectations about the listener that can lead to maladaptive adjustments in speech production. "My wife is deaf since birth and an extremely good lipreader. She notices quite often that, when she says she is deaf, people go into an exaggerated speech mode, that does not help her to understand them." As these example illustrate =E2=80=93 the adjustments in speech production = that people make to improve speech recognition, may or may not be optimal. While some adjustments may intuitively seem beneficial, like speaking louder to improve SNR, separating words and speaking slower to aid segmentation, or hyperarticulating specific sounds, there are some reasons to believe, that it may not always help. Or rather that it depends. Just like with other clear speech models, speakers also may differ considerably in what adjustments they make. Bill Whitmer wrote: =E2=80=9CThere=E2=80=99s been some recent work on ASR and vocal effort by R= icard Marxer while he was at Sheffield (currently at Marseilles) on this. As your question suggests, they did find that when trained on normal effort speech, the ASR fails with Lombard speech. But then when trained on material matched for vocal effort, the Lombard speech is better recognised. There was a whole lotta variance based on speaker, though =E2=80=93 not too surpr= ising. https://www.sciencedirect.com/science/article/pii/S0167639317302674 I=E2=80=99d suspect that the Googles and Amazons have been trying to tackl= e this for awhile, but perhaps the variance keeps them from announcing some grand solution. Your point about people adopting non-optimal strategies is most likely true. It also might have an age factor. There=E2=80=99s been recent = work from Valerie Hazan and Outi Tuomainen at UCL about changes in speech production during a multi-person task. Hazan gave a talk I recall where the younger participants adopted multiple different alternate speaking styles when prompted to repeat, whereas the older participants had only one alternate. Can=E2=80=99t find that result (maybe I=E2=80=99m mis-rememberin= g =E2=80=93 this memory ain=E2=80=99t what it used to be), but these might be a good start https://www.sciencedirect.com/science/article/pii/S0378595517305166 https://asa.scitation.org/doi/10.1121/1.5053218 Also, there was some work showing how commercial ASR systems (chatbots) were attempting to use emotion detection, and changing their sequence if they detected anger/frustration in the news years back. But while the machines attempt new strategies, the human strategies seem less studied, unless it=E2=80=99s all being kept under lock-and-key by companies.=E2=80= =9D Sarah Ferguson also spoke to these points and offered additional references= : =E2=80=9CSharon Oviatt had a couple of papers looking at something like thi= s (references below) =E2=80=93 talkers go into a =E2=80=9Chyperspeech=E2=80= =9D when a computer mis-recognizes what they say. Oviatt, S., Levow, G. A., Moreton, E., & MacEachern, M. (1998). Modeling global and focal hyperarticulation during human-computer error resolution. Journal of the Acoustical Society of America, 104, 3080-3098. doi:10.1121/1.423888 Oviatt, S., MacEachern, M., & Levow, G. A. (1998). Predicting hyperarticulate speech during human-computer error resolution. Speech Communication, 24, 87-110. doi:10.1016/S0167-6393(98)00005-3 I don=E2=80=99t know if the idea came from these papers or if I heard it so= mewhere else, but speech recognizers are (or used to be) trained up on citation-style speech, so hyperarticulation should make speech recognition worse. I=E2=80=99ve been surprised by how well Siri does and will sometimes= try to =E2=80=9Cmess with=C2=B4 it to see how it behaves.=E2=80=9D And for those who are interested to learn more about this topic, Phil Green sent information about an upcoming workshop in Glasgow http://speech-interaction.org/chi2019/ , while Dana Urbanski suggsted searching under =E2=80=9Ccomputer-directed speech=E2=80=9D and Katherine Ma= rcoux referenced a comprehensive review article that include machine-directed speech in the context of =E2=80=9Cother speech modifications (infant direct speech, Lomba= rd speech, clear speech, etc.) Cooke, M., King, S., Garnier, M., & Aubanel, V. (2014). The listening talker: A review of human and algorithmic context-induced modifications of speech. Computer Speech & Language, 28(2), 543-571." David Pisoni recommended another useful perspective paper by Liz Shriberg: Shriberg, E. (2005). Spontaneous speech: How people really talk and why engineers should care. INERSPEECH. https://web.stanford.edu/class/cs424p/shriberg05.pdf Finally, adjustments in speech directed at machines can also reflect emotions and felling that speakers experience during the interaction. Elle O=E2=80=99Brien also suggested this interesting paper =E2=80=9Cabout h= ow children express politeness and frustration at machines (in this study, the computer is actually controlled by an experimenter, but kids think it is voice-controlled). http://www-bcf.usc.edu/~dbyrd/euro01-chaimkids_ed.pdf=E2= =80=9D So it appears that speech production adjustments people make when speaking to machines are a combination of efforts to improve recognition by machines (which may or may not be optimal), as well as possible affect-related production changes that arise dynamically in response to errors and the history of interactions with a specific device and one=E2=80=99s appraisal= of the machine listener's capabilities (this SNL Alexa skit captures some of these challenges https://www.youtube.com/watch?v=3DYvT_gqs5ETk )... At least whil= e we are working out the kinks in computer speech recognition and finding better ways to talk to machines, we are guaranteed a good and reliable supply of amusement and entertainment. Thanks to everyone who responded! Valeriy ---original post----------- On Sun, Feb 3, 2019 at 11:16 AM Valeriy Shafiro <firosha@xxxxxxxx> wrote: > Dear list, > > I am wondering if any one has any references or suggestions about this > question. These days I hear more and more people talking to machines, > e.g. Siri, Google, Alexa, etc., and doing it in more and more places. > Automatic speech recognition has improved tremendously, but still it seem= s > to me that when people talk to machines they often switch into a differen= t > production mode. At times it may sound like talking to a (large) dog > and sometimes like talking to a customer service agent in a land far away > who is diligently trying to follow a script rather than listen to what yo= u > are saying. And I wonder whether adjustments that people make in their > speech production when talking with machines in that mode are in fact > optimal for improving recognition accuracy. Since machines are not > processing speech in the same way as humans, I wonder if changes in speec= h > production that make speech more recognizable for other people (or even > pets) are always the same as they are for machines. In other words, do > people tend to make the most optimal adjustments to make their speech mor= e > recognizable to machines. Or is it more like falling back on clear speec= h > modes that work with other kinds of listeners (children, nonnative > speakers, pets), or something in between? > > I realize there is a lot to this question, but perhaps people have starte= d > looking into it. I am happy to collate references and replies and send t= o > the list. > > Best, > > > > Valeriy > --0000000000007ec9dc0582598375 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div di= r=3D"ltr"><div>Hello List,=C2=A0</div><div><br></div><div>I recently asked = on this list about how we talk to machines that listen.=C2=A0 Specifically,= whether adjustments which people make in their speech when talking to mach= ines are=C2=A0in fact=C2=A0optimal for improving recognition.=C2=A0 Here is= a summary of replies to my post.</div><div><br></div><div>Several people c= ommented on the importance of expectations about machine listening that aff= ect speakers production, in terms of trying to optimize recognition based o= n what one believes the machine =E2=80=9Cneeds=E2=80=9D to hear.=C2=A0=C2= =A0</div><div><br></div><div>Elle O=E2=80=99Brien observed that she tends t= o use clear speech in order to maximize information in the acoustic signal.= =C2=A0</div><div>"I believe I talk in clear speech to my devices becau= se I anticipate that automatic speech recognition can barely make use of co= ntextual/semantic/visual clues, and so I need to maximize the information p= resent in the acoustic signal. Because I really expect that's all the m= achine can go on. I think of it more as the kind of speech I'd use in a= n adverse listening environment, or with a non-native speaker, than a dog o= r a child. "</div><div><br></div><div>Sam Fisher shared a developer=E2= =80=99s perspective based on his experience programming Alexa and Google As= sistant=C2=A0</div><div>=E2=80=9CI have so far experienced that much of a t= ypical user=E2=80=99s training for how to act with a conversational agent i= s contextual to dealing with forms and computers moreso than dogs or people= . However, I intuit that there is a =E2=80=9Cnearest neighbor=E2=80=9D effe= ct at play, in which a user draws on their most closely analogous experienc= e in order to inform their production. Therefore the reference point may va= ry geographically and generationally.</div><div><br></div><div>Most of the = users I program for have had to deal with annoying touchtone phone systems = that present us with a discrete set of outcomes and navigate us through a = =E2=80=9Cflowchart.=E2=80=9D I believe the user=E2=80=99s common assumption= s come from a rough understanding that they are being pathed through such a= flowchart and just need to say the right thing to get what they want.</div= ><div><br></div><div>In the design of next generation conversational interf= aces that are bit more capable of keeping up with a natural flow of dialogu= e, we try to disrupt the user=E2=80=99s expectations from the voice interfa= ces of the past that were unable to really use all the advantages of =E2=80= =9Cconversation=E2=80=9D as a means to efficient, cooperative, and friction= less information exchange. The user is trained to become comfortably speaki= ng to it in a somewhat =E2=80=9Ceasier=E2=80=9D manner, putting more burden= of comprehension on the machine conversant.=E2=80=9D</div><div><br></div><= div>Leon van Noorden, though not speaking about computer-directed speech sp= ecifically, also highlighted the point about speaker expectations about the= listener that can lead to maladaptive adjustments in speech production.=C2= =A0</div><div>"My wife is deaf since birth and an extremely good lipre= ader. She notices quite often that, when she says she is deaf, people go in= to an exaggerated speech mode, that does not help her to understand them.&q= uot;</div><div><br></div><div>As these example illustrate =E2=80=93 the adj= ustments in speech production that people make to improve speech recognitio= n, may or may not be optimal.=C2=A0 While some adjustments may intuitively = seem beneficial, like speaking louder to improve SNR, separating words and = speaking slower to aid segmentation, or hyperarticulating specific sounds, = there are some reasons to believe, that it may not always help.=C2=A0 Or ra= ther that it depends.=C2=A0 Just like with other clear speech models, speak= ers also may differ considerably in what adjustments they make.=C2=A0=C2=A0= </div><div><br></div><div>Bill Whitmer wrote:=C2=A0</div><div>=E2=80=9CTher= e=E2=80=99s been some recent work on ASR and vocal effort by Ricard Marxer = while he was at Sheffield (currently at Marseilles) on this. As your questi= on suggests, they did find that when trained on normal effort speech, the A= SR fails with Lombard speech. But then when trained on material matched for= vocal effort, the Lombard speech is better recognised. There was a whole l= otta variance based on speaker, though =E2=80=93 not too surprising.</div><= div><br></div><div>=C2=A0<a href=3D"https://www.sciencedirect.com/science/a= rticle/pii/S0167639317302674">https://www.sciencedirect.com/science/article= /pii/S0167639317302674</a></div><div><br></div><div>=C2=A0I=E2=80=99d suspe= ct that the Googles and Amazons have been trying to tackle this for awhile,= but perhaps the variance keeps them from announcing some grand solution. Y= our point about people adopting non-optimal strategies is most likely true.= It also might have an age factor. There=E2=80=99s been recent work from Va= lerie Hazan and Outi Tuomainen at UCL about changes in speech production du= ring a multi-person task. Hazan gave a talk I recall where the younger part= icipants adopted multiple different alternate speaking styles when prompted= to repeat, whereas the older participants had only one alternate. Can=E2= =80=99t find that result (maybe I=E2=80=99m mis-remembering =E2=80=93 this = memory ain=E2=80=99t what it used to be), but these might be a good start</= div><div><br></div><div><a href=3D"https://www.sciencedirect.com/science/ar= ticle/pii/S0378595517305166">https://www.sciencedirect.com/science/article/= pii/S0378595517305166</a></div><div><a href=3D"https://asa.scitation.org/do= i/10.1121/1.5053218">https://asa.scitation.org/doi/10.1121/1.5053218</a></d= iv><div><br></div><div>=C2=A0Also, there was some work showing how commerci= al ASR systems (chatbots) were attempting to use emotion detection, and cha= nging their sequence if they detected anger/frustration in the news years b= ack. But while the machines attempt new strategies, the human strategies se= em less studied, unless it=E2=80=99s all being kept under lock-and-key by c= ompanies.=E2=80=9D</div><div><br></div><div>Sarah Ferguson also spoke to th= ese points and offered additional references:</div><div>=E2=80=9CSharon Ovi= att had a couple of papers looking at something like this (references below= ) =E2=80=93 talkers go into a =E2=80=9Chyperspeech=E2=80=9D when a computer= mis-recognizes what they say.</div><div>=C2=A0</div><div>Oviatt, S., Levow= , G. A., Moreton, E., & MacEachern, M. (1998). Modeling global and foca= l hyperarticulation during human-computer error resolution. Journal of the = Acoustical Society of America, 104, 3080-3098. doi:10.1121/1.423888</div><d= iv>=C2=A0</div><div>Oviatt, S., MacEachern, M., & Levow, G. A. (1998). = Predicting hyperarticulate speech during human-computer error resolution. S= peech Communication, 24, 87-110. doi:10.1016/S0167-6393(98)00005-3</div><di= v>=C2=A0</div><div>I don=E2=80=99t know if the idea came from these papers = or if I heard it somewhere else, but speech recognizers are (or used to be)= trained up on citation-style speech, so hyperarticulation should make spee= ch recognition worse. I=E2=80=99ve been surprised by how well Siri does and= will sometimes try to =E2=80=9Cmess with=C2=B4 it to see how it behaves.= =E2=80=9D</div><div><br></div><div>And for those who are interested to lear= n more about this topic, Phil Green sent information about an upcoming work= shop in Glasgow <a href=3D"http://speech-interaction.org/chi2019/">http://s= peech-interaction.org/chi2019/</a> , while Dana Urbanski suggsted searching= under =E2=80=9Ccomputer-directed speech=E2=80=9D and Katherine Marcoux ref= erenced a comprehensive review article that include machine-directed speech= in the context of =E2=80=9Cother speech modifications (infant direct speec= h, Lombard speech, clear speech, etc.)</div><div><br></div><div>Cooke, M., = King, S., Garnier, M., & Aubanel, V. (2014). The listening talker: A re= view of human and algorithmic context-induced modifications of speech. Comp= uter Speech & Language, 28(2), 543-571."</div><div><br></div><div>= David Pisoni recommended another useful perspective paper by Liz Shriberg:<= /div><div><br></div><div>Shriberg, E. (2005). Spontaneous speech: How peopl= e really talk and why engineers should care. INERSPEECH.=C2=A0=C2=A0<a href= =3D"https://web.stanford.edu/class/cs424p/shriberg05.pdf">https://web.stanf= ord.edu/class/cs424p/shriberg05.pdf</a>=C2=A0</div><div><br></div><div><br>= </div><div>Finally, adjustments in speech directed at machines can also ref= lect emotions and felling that speakers experience during the interaction.= =C2=A0=C2=A0</div><div><br></div><div>Elle O=E2=80=99Brien also suggested t= his interesting paper =E2=80=9Cabout how children express politeness and fr= ustration at machines (in this study, the computer is actually controlled b= y an experimenter, but kids think it is voice-controlled). <a href=3D"http:= //www-bcf.usc.edu/~dbyrd/euro01-chaimkids_ed.pdf">http://www-bcf.usc.edu/~d= byrd/euro01-chaimkids_ed.pdf</a>=E2=80=9D</div><div><br></div><div>So it ap= pears that speech production adjustments people make when speaking to machi= nes are a combination of efforts to improve recognition by machines (which = may or may not be optimal), as well as possible affect-related production c= hanges that arise dynamically in response to errors and the history of inte= ractions with a specific device=C2=A0 and one=E2=80=99s appraisal of the ma= chine listener's capabilities (this SNL Alexa skit captures some of the= se challenges <a href=3D"https://www.youtube.com/watch?v=3DYvT_gqs5ETk">htt= ps://www.youtube.com/watch?v=3DYvT_gqs5ETk</a> )... At least while we are w= orking out the kinks in computer speech recognition and finding better ways= to talk to machines, we are guaranteed a good and reliable supply of amuse= ment and entertainment.=C2=A0</div><div>=C2=A0</div><div>Thanks to everyone= who responded!</div><div><br></div><div>Valeriy</div><div><br></div><div>-= --original post-----------</div><div><br></div><div><br></div></div></div><= div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Sun, Feb= 3, 2019 at 11:16 AM Valeriy Shafiro <<a href=3D"mailto:firosha@xxxxxxxx= m">firosha@xxxxxxxx</a>> wrote:<br></div><blockquote class=3D"gmail_quo= te" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204= );padding-left:1ex"><div dir=3D"ltr"><p class=3D"MsoNormal" style=3D"margin= :0in 0in 8pt;color:rgb(33,33,33);line-height:15.6933px;font-size:11pt;font-= family:Calibri,sans-serif">Dear list,</p><p class=3D"MsoNormal" style=3D"ma= rgin:0in 0in 8pt;line-height:15.6933px"><font color=3D"#212121" face=3D"Cal= ibri, sans-serif"><span style=3D"font-size:11pt">I am wondering if any one = has any references or suggestions about this question.=C2=A0 These days I h= ear more and more people=C2=A0</span></font><span style=3D"color:rgb(33,33,= 33);font-family:Calibri,sans-serif;font-size:11pt">talking to machines, e.g= . Siri, Google, Alexa, etc., and doing it=C2=A0in more and more places.=C2= =A0 Automatic speech recognition has improved tremendously, but still it se= ems to me that when people talk to machines they often switch into a differ= ent production mode.=C2=A0 =C2=A0At times it may=C2=A0</span><span style=3D= "color:rgb(33,33,33);font-family:Calibri,sans-serif;font-size:11pt">sound l= ike=C2=A0talking to a (large) dog=C2=A0 and sometimes like talking to a=C2= =A0customer service agent in a land far away who is diligently trying to=C2= =A0follow=C2=A0a script rather than listen to what you are saying.=C2=A0=C2= =A0=C2=A0And I=C2=A0wonder whether adjustments that people make in their sp= eech production when talking with machines in that mode are in fact optimal= for=C2=A0improving=C2=A0recognition=C2=A0</span><font color=3D"#212121" fa= ce=3D"Calibri, sans-serif"><span style=3D"font-size:11pt">accuracy.=C2=A0 S= ince machines are not processing speech in the same way as humans, I wonder= if changes in speech production that make speech more </span><span style= =3D"font-size:14.6667px">recognizable</span><span style=3D"font-size:11pt">= =C2=A0for other people (or even pets) are always the same as they are for m= achines.=C2=A0 In other words, do people tend to make the most optimal adju= stments to make their speech more recognizable to machines.=C2=A0 Or is it = more like falling back on clear speech modes that work with other kinds of = listeners (children, nonnative speakers, pets), or something in between?=C2= =A0=C2=A0</span></font></p><span class=3D"gmail-m_-4422852840265105174gmail= -im" style=3D"color:rgb(80,0,80);font-family:Calibri,Arial,Helvetica,sans-s= erif;font-size:16px"><p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;lin= e-height:15.6933px;font-size:11pt;font-family:Calibri,sans-serif">I realize= there is a lot to this question, but perhaps people have started looking i= nto it.=C2=A0 I am happy to collate references and replies and send to the = list.=C2=A0</p><p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;line-heig= ht:15.6933px;font-size:11pt;font-family:Calibri,sans-serif">Best,</p><p cla= ss=3D"MsoNormal" style=3D"margin:0in 0in 8pt;line-height:15.6933px;font-siz= e:11pt;font-family:Calibri,sans-serif">=C2=A0</p><p class=3D"MsoNormal" sty= le=3D"margin:0in 0in 8pt;line-height:15.6933px;font-size:11pt;font-family:C= alibri,sans-serif">Valeriy</p></span></div> </blockquote></div></div></div></div> --0000000000007ec9dc0582598375--