Re: [AUDITORY] Summary: How we talk to machines that listen (Valeriy Shafiro )


Subject: Re: [AUDITORY] Summary: How we talk to machines that listen
From:    Valeriy Shafiro  <firosha@xxxxxxxx>
Date:    Sat, 23 Feb 2019 17:39:05 -0600
List-Archive:<http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>

--00000000000057731e058298370a Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Dear Kent, Thank you for these ideas. I agree these are worth considering, even if I do not necessarily agree with your specific take on things. Based on my own reading of Gibson and other ecological psychologists, I don't necessarily think "machine listening" is as much of a terminological or metaphysical trap as it may seem from your post. We can argue for a long time for whether using "remembering" is better than "memory", but at some point it may be better to move on with the terms which, however imperfect, can still be useful for problems you are trying to solve. Similarly, I used listening in my post as a shorthand for machines that detect and respond to speech and that people can interact with through speech. Of course, as we move on with new technologies, there is a lot of room for ambiguity when we talk about machines that can perform more and more human-like actions. However, most people find it acceptable to talk about sunrise and sunset, while knowing that both are due to the rotation of the Earth, rather than the Sun going up and down, or say "the weather has a mind of its own" without really thinking that the weather is conscious. So, in this particular case I am not convinced that talking about listening machines and related speech production changes, represents either a symptom of "a time when numbers and funding supersedes scientific curiosity" or "represents an effort to imbue machines with the qualities of "life", "consciousness" and "humanity" which we know they do and cannot not have" That said, how we talk *about *machines we talk to is a very interesting subject in its own right. Best, Valeriy On Sat, Feb 23, 2019 at 4:06 PM Kent Walker <kent.walker@xxxxxxxx> wrote: > From the perspective of auditory scene analysis (ASA), notions of > "machine listening" and "computational auditory scene analysis" are > premature. The core mechanisms of ASA (integration, segmentation, > sequential group, etc) can only exist in embodied, conscious organisms. > > I think Al Bregman would agree that there is no listening without > ecology, and there is no listening without culture. His work is both > homage-to and a rebellion-against Gibson's post-behaviourist ecological > psychology. The use of "machine listening" and "computational auditory > scene analysis" from this scientific perspective is disingenuous and r. > > There are many leading researchers in the areas of embodied cognition and > ecological psychoacoustics on this list. That they remain quiet when word= s > which are important to ASA are misused signals to me that we have entered= a > period of time similar to that when behaviourism was king: a time when > numbers and funding supersedes scientific curiosity. > > From: AUDITORY - Research in Auditory Perception <AUDITORY@xxxxxxxx= A> > on behalf of Valeriy Shafiro <firosha@xxxxxxxx> > Reply-To: Valeriy Shafiro <firosha@xxxxxxxx> > Date: Wednesday, 20 February, 2019 1:50 PM > To: "AUDITORY@xxxxxxxx" <AUDITORY@xxxxxxxx> > Subject: Summary: How we talk to machines that listen > > Hello List, > > I recently asked on this list about how we talk to machines that listen. > Specifically, whether adjustments which people make in their speech when > talking to machines are in fact optimal for improving recognition. Here = is > a summary of replies to my post. > > Several people commented on the importance of expectations about machine > listening that affect speakers production, in terms of trying to optimize > recognition based on what one believes the machine =E2=80=9Cneeds=E2=80= =9D to hear. > > Elle O=E2=80=99Brien observed that she tends to use clear speech in order= to > maximize information in the acoustic signal. > "I believe I talk in clear speech to my devices because I anticipate that > automatic speech recognition can barely make use of > contextual/semantic/visual clues, and so I need to maximize the informati= on > present in the acoustic signal. Because I really expect that's all the > machine can go on. I think of it more as the kind of speech I'd use in an > adverse listening environment, or with a non-native speaker, than a dog o= r > a child. " > > Sam Fisher shared a developer=E2=80=99s perspective based on his experien= ce > programming Alexa and Google Assistant > =E2=80=9CI have so far experienced that much of a typical user=E2=80=99s = training for how > to act with a conversational agent is contextual to dealing with forms an= d > computers moreso than dogs or people. However, I intuit that there is a > =E2=80=9Cnearest neighbor=E2=80=9D effect at play, in which a user draws = on their most > closely analogous experience in order to inform their production. Therefo= re > the reference point may vary geographically and generationally. > > Most of the users I program for have had to deal with annoying touchtone > phone systems that present us with a discrete set of outcomes and navigat= e > us through a =E2=80=9Cflowchart.=E2=80=9D I believe the user=E2=80=99s co= mmon assumptions come from > a rough understanding that they are being pathed through such a flowchart > and just need to say the right thing to get what they want. > > In the design of next generation conversational interfaces that are bit > more capable of keeping up with a natural flow of dialogue, we try to > disrupt the user=E2=80=99s expectations from the voice interfaces of the = past that > were unable to really use all the advantages of =E2=80=9Cconversation=E2= =80=9D as a means > to efficient, cooperative, and frictionless information exchange. The use= r > is trained to become comfortably speaking to it in a somewhat =E2=80=9Cea= sier=E2=80=9D > manner, putting more burden of comprehension on the machine conversant.= =E2=80=9D > > Leon van Noorden, though not speaking about computer-directed speech > specifically, also highlighted the point about speaker expectations about > the listener that can lead to maladaptive adjustments in speech productio= n. > "My wife is deaf since birth and an extremely good lipreader. She notices > quite often that, when she says she is deaf, people go into an exaggerate= d > speech mode, that does not help her to understand them." > > As these example illustrate =E2=80=93 the adjustments in speech productio= n that > people make to improve speech recognition, may or may not be optimal. > While some adjustments may intuitively seem beneficial, like speaking > louder to improve SNR, separating words and speaking slower to aid > segmentation, or hyperarticulating specific sounds, there are some reason= s > to believe, that it may not always help. Or rather that it depends. Jus= t > like with other clear speech models, speakers also may differ considerabl= y > in what adjustments they make. > > Bill Whitmer wrote: > =E2=80=9CThere=E2=80=99s been some recent work on ASR and vocal effort by= Ricard Marxer > while he was at Sheffield (currently at Marseilles) on this. As your > question suggests, they did find that when trained on normal effort speec= h, > the ASR fails with Lombard speech. But then when trained on material > matched for vocal effort, the Lombard speech is better recognised. There > was a whole lotta variance based on speaker, though =E2=80=93 not too sur= prising. > > https://www.sciencedirect.com/science/article/pii/S0167639317302674 > > I=E2=80=99d suspect that the Googles and Amazons have been trying to tac= kle this > for awhile, but perhaps the variance keeps them from announcing some gran= d > solution. Your point about people adopting non-optimal strategies is most > likely true. It also might have an age factor. There=E2=80=99s been recen= t work > from Valerie Hazan and Outi Tuomainen at UCL about changes in speech > production during a multi-person task. Hazan gave a talk I recall where t= he > younger participants adopted multiple different alternate speaking styles > when prompted to repeat, whereas the older participants had only one > alternate. Can=E2=80=99t find that result (maybe I=E2=80=99m mis-remember= ing =E2=80=93 this memory > ain=E2=80=99t what it used to be), but these might be a good start > > https://www.sciencedirect.com/science/article/pii/S0378595517305166 > https://asa.scitation.org/doi/10.1121/1.5053218 > > Also, there was some work showing how commercial ASR systems (chatbots) > were attempting to use emotion detection, and changing their sequence if > they detected anger/frustration in the news years back. But while the > machines attempt new strategies, the human strategies seem less studied, > unless it=E2=80=99s all being kept under lock-and-key by companies.=E2=80= =9D > > Sarah Ferguson also spoke to these points and offered additional > references: > =E2=80=9CSharon Oviatt had a couple of papers looking at something like t= his > (references below) =E2=80=93 talkers go into a =E2=80=9Chyperspeech=E2=80= =9D when a computer > mis-recognizes what they say. > > Oviatt, S., Levow, G. A., Moreton, E., & MacEachern, M. (1998). Modeling > global and focal hyperarticulation during human-computer error resolution= . > Journal of the Acoustical Society of America, 104, 3080-3098. > doi:10.1121/1.423888 > > Oviatt, S., MacEachern, M., & Levow, G. A. (1998). Predicting > hyperarticulate speech during human-computer error resolution. Speech > Communication, 24, 87-110. doi:10.1016/S0167-6393(98)00005-3 > > I don=E2=80=99t know if the idea came from these papers or if I heard it = somewhere > else, but speech recognizers are (or used to be) trained up on > citation-style speech, so hyperarticulation should make speech recognitio= n > worse. I=E2=80=99ve been surprised by how well Siri does and will sometim= es try to > =E2=80=9Cmess with=C2=B4 it to see how it behaves.=E2=80=9D > > And for those who are interested to learn more about this topic, Phil > Green sent information about an upcoming workshop in Glasgow > http://speech-interaction.org/chi2019/ , while Dana Urbanski suggsted > searching under =E2=80=9Ccomputer-directed speech=E2=80=9D and Katherine = Marcoux referenced > a comprehensive review article that include machine-directed speech in th= e > context of =E2=80=9Cother speech modifications (infant direct speech, Lom= bard > speech, clear speech, etc.) > > Cooke, M., King, S., Garnier, M., & Aubanel, V. (2014). The listening > talker: A review of human and algorithmic context-induced modifications o= f > speech. Computer Speech & Language, 28(2), 543-571." > > David Pisoni recommended another useful perspective paper by Liz Shriberg= : > > Shriberg, E. (2005). Spontaneous speech: How people really talk and why > engineers should care. INERSPEECH. > https://web.stanford.edu/class/cs424p/shriberg05.pdf > > > Finally, adjustments in speech directed at machines can also reflect > emotions and felling that speakers experience during the interaction. > > Elle O=E2=80=99Brien also suggested this interesting paper =E2=80=9Cabout= how children > express politeness and frustration at machines (in this study, the comput= er > is actually controlled by an experimenter, but kids think it is > voice-controlled). http://www-bcf.usc.edu/~dbyrd/euro01-chaimkids_ed.pdf= =E2=80=9D > > So it appears that speech production adjustments people make when speakin= g > to machines are a combination of efforts to improve recognition by machin= es > (which may or may not be optimal), as well as possible affect-related > production changes that arise dynamically in response to errors and the > history of interactions with a specific device and one=E2=80=99s apprais= al of the > machine listener's capabilities (this SNL Alexa skit captures some of the= se > challenges https://www.youtube.com/watch?v=3DYvT_gqs5ETk )... At least > while we are working out the kinks in computer speech recognition and > finding better ways to talk to machines, we are guaranteed a good and > reliable supply of amusement and entertainment. > > Thanks to everyone who responded! > > Valeriy > > ---original post----------- > > > On Sun, Feb 3, 2019 at 11:16 AM Valeriy Shafiro <firosha@xxxxxxxx> wrote= : > >> Dear list, >> >> I am wondering if any one has any references or suggestions about this >> question. These days I hear more and more people talking to machines, >> e.g. Siri, Google, Alexa, etc., and doing it in more and more places. >> Automatic speech recognition has improved tremendously, but still it see= ms >> to me that when people talk to machines they often switch into a differe= nt >> production mode. At times it may sound like talking to a (large) dog >> and sometimes like talking to a customer service agent in a land far awa= y >> who is diligently trying to follow a script rather than listen to what y= ou >> are saying. And I wonder whether adjustments that people make in their >> speech production when talking with machines in that mode are in fact >> optimal for improving recognition accuracy. Since machines are not >> processing speech in the same way as humans, I wonder if changes in spee= ch >> production that make speech more recognizable for other people (or even >> pets) are always the same as they are for machines. In other words, do >> people tend to make the most optimal adjustments to make their speech mo= re >> recognizable to machines. Or is it more like falling back on clear spee= ch >> modes that work with other kinds of listeners (children, nonnative >> speakers, pets), or something in between? >> >> I realize there is a lot to this question, but perhaps people have >> started looking into it. I am happy to collate references and replies a= nd >> send to the list. >> >> Best, >> >> >> >> Valeriy >> > --00000000000057731e058298370a Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr">Dear Ke= nt,<div><br></div><div>Thank you for these ideas.=C2=A0 I agree these are w= orth considering, even if I do not necessarily agree with your specific tak= e on things.=C2=A0 Based on my own reading of Gibson and other ecological p= sychologists, I don&#39;t necessarily think &quot;machine listening&quot; i= s as much of a terminological or metaphysical trap as it may seem from your= post.=C2=A0 We can argue for a long time for whether using &quot;rememberi= ng&quot; is better than &quot;memory&quot;, but at some point it may be bet= ter to move on with the terms which, however imperfect, can still be useful= for problems you are trying to solve.=C2=A0 Similarly, I used listening in= my post as a shorthand for machines that detect and respond to speech and = that people can interact with through speech.=C2=A0 Of course, as we move o= n with new technologies, there is a lot of room for ambiguity when we talk = about machines that can perform more and more human-like actions.=C2=A0 How= ever, most people find it acceptable to talk about sunrise and sunset, whil= e knowing that both are due to the rotation of the Earth, rather than the S= un going up and down, or say &quot;the weather has a mind of its own&quot; = without really thinking that the weather is conscious.=C2=A0 So, in this pa= rticular case I am not convinced that talking about listening machines and = related speech production changes, represents either a symptom of=C2=A0 &qu= ot;a time when numbers and funding supersedes scientific curiosity&quot;=C2= =A0 or &quot;represents an effort to imbue machines with the qualities of &= quot;life&quot;, &quot;consciousness&quot; and &quot;humanity&quot; which w= e know they do and cannot not have&quot;=C2=A0 That said, how we talk <u>ab= out </u>machines we talk to is a very interesting subject in its own right.= =C2=A0=C2=A0</div><div><br></div><div>Best,</div><div><br></div><div>Valeri= y=C2=A0</div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D= "gmail_attr">On Sat, Feb 23, 2019 at 4:06 PM Kent Walker &lt;<a href=3D"mai= lto:kent.walker@xxxxxxxx">kent.walker@xxxxxxxx</a>&gt; wrote:<b= r></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex= ;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div style=3D"color:rgb(0,0,0);font-size:14px;font-family:Calibri,sans-seri= f"> <div>From the perspective of auditory scene analysis =C2=A0(ASA), notions o= f &quot;machine listening&quot; and &quot;computational auditory scene anal= ysis&quot; are premature. The core mechanisms of ASA (integration, segmenta= tion, sequential group, etc) can only exist in embodied, conscious organisms.=C2=A0</div> <div><br> </div> <div>I think Al Bregman =C2=A0would agree that there is no listening withou= t ecology, and there is no listening without culture. His work is both homa= ge-to and a rebellion-against Gibson&#39;s post-behaviourist ecological psy= chology.=C2=A0 The use of &quot;machine listening&quot; and &quot;computational auditory scene analysis&quot; from this scientific= perspective is disingenuous and r.=C2=A0</div> <div><br> </div> <div>There are many leading researchers in the areas of embodied cognition = and ecological psychoacoustics on this list. That they remain quiet when wo= rds which are important to ASA are misused signals to me that we have enter= ed a period of time similar to that when behaviourism was king: a time when numbers and funding supersedes sci= entific curiosity.</div> <div><br> </div> <span id=3D"gmail-m_3154213389526688550OLK_SRC_BODY_SECTION"> <div style=3D"font-family:Calibri;font-size:11pt;text-align:left;color:blac= k;border-width:1pt medium medium;border-style:solid none none;border-bottom= -color:initial;border-left-color:initial;padding:3pt 0in 0in;border-top-col= or:rgb(181,196,223);border-right-color:initial"> <span style=3D"font-weight:bold">From: </span>AUDITORY - Research in Audito= ry Perception &lt;<a href=3D"mailto:AUDITORY@xxxxxxxx" target=3D"_bl= ank">AUDITORY@xxxxxxxx</a>&gt; on behalf of Valeriy Shafiro &lt;<a h= ref=3D"mailto:firosha@xxxxxxxx" target=3D"_blank">firosha@xxxxxxxx</a>&gt= ;<br> <span style=3D"font-weight:bold">Reply-To: </span>Valeriy Shafiro &lt;<a hr= ef=3D"mailto:firosha@xxxxxxxx" target=3D"_blank">firosha@xxxxxxxx</a>&gt;= <br> <span style=3D"font-weight:bold">Date: </span>Wednesday, 20 February, 2019 = 1:50 PM<br> <span style=3D"font-weight:bold">To: </span>&quot;<a href=3D"mailto:AUDITOR= Y@xxxxxxxx" target=3D"_blank">AUDITORY@xxxxxxxx</a>&quot; &lt= ;<a href=3D"mailto:AUDITORY@xxxxxxxx" target=3D"_blank">AUDITORY@xxxxxxxx= TS.MCGILL.CA</a>&gt;<br> <span style=3D"font-weight:bold">Subject: </span>Summary: How we talk to ma= chines that listen<br> </div> <div><br> </div> <blockquote id=3D"gmail-m_3154213389526688550MAC_OUTLOOK_ATTRIBUTION_BLOCKQ= UOTE"> <div> <div> <div dir=3D"ltr"> <div dir=3D"ltr"> <div dir=3D"ltr"> <div dir=3D"ltr"> <div dir=3D"ltr"> <div>Hello List,=C2=A0</div> <div><br> </div> <div>I recently asked on this list about how we talk to machines that liste= n.=C2=A0 Specifically, whether adjustments which people make in their speec= h when talking to machines are=C2=A0in fact=C2=A0optimal for improving reco= gnition.=C2=A0 Here is a summary of replies to my post.</div> <div><br> </div> <div>Several people commented on the importance of expectations about machi= ne listening that affect speakers production, in terms of trying to optimiz= e recognition based on what one believes the machine =E2=80=9Cneeds=E2=80= =9D to hear.=C2=A0=C2=A0</div> <div><br> </div> <div>Elle O=E2=80=99Brien observed that she tends to use clear speech in or= der to maximize information in the acoustic signal.=C2=A0</div> <div>&quot;I believe I talk in clear speech to my devices because I anticip= ate that automatic speech recognition can barely make use of contextual/sem= antic/visual clues, and so I need to maximize the information present in th= e acoustic signal. Because I really expect that&#39;s all the machine can go on. I think of it more as the kind of sp= eech I&#39;d use in an adverse listening environment, or with a non-native = speaker, than a dog or a child. &quot;</div> <div><br> </div> <div>Sam Fisher shared a developer=E2=80=99s perspective based on his exper= ience programming Alexa and Google Assistant=C2=A0</div> <div>=E2=80=9CI have so far experienced that much of a typical user=E2=80= =99s training for how to act with a conversational agent is contextual to d= ealing with forms and computers moreso than dogs or people. However, I intu= it that there is a =E2=80=9Cnearest neighbor=E2=80=9D effect at play, in which a user draws on their most closely analogous experience in order = to inform their production. Therefore the reference point may vary geograph= ically and generationally.</div> <div><br> </div> <div>Most of the users I program for have had to deal with annoying touchto= ne phone systems that present us with a discrete set of outcomes and naviga= te us through a =E2=80=9Cflowchart.=E2=80=9D I believe the user=E2=80=99s c= ommon assumptions come from a rough understanding that they are being pathed through such a flowchart and just need to say the right t= hing to get what they want.</div> <div><br> </div> <div>In the design of next generation conversational interfaces that are bi= t more capable of keeping up with a natural flow of dialogue, we try to dis= rupt the user=E2=80=99s expectations from the voice interfaces of the past = that were unable to really use all the advantages of =E2=80=9Cconversation=E2=80=9D as a means to efficient, cooperative, an= d frictionless information exchange. The user is trained to become comforta= bly speaking to it in a somewhat =E2=80=9Ceasier=E2=80=9D manner, putting m= ore burden of comprehension on the machine conversant.=E2=80=9D</div> <div><br> </div> <div>Leon van Noorden, though not speaking about computer-directed speech s= pecifically, also highlighted the point about speaker expectations about th= e listener that can lead to maladaptive adjustments in speech production.= =C2=A0</div> <div>&quot;My wife is deaf since birth and an extremely good lipreader. She= notices quite often that, when she says she is deaf, people go into an exa= ggerated speech mode, that does not help her to understand them.&quot;</div= > <div><br> </div> <div>As these example illustrate =E2=80=93 the adjustments in speech produc= tion that people make to improve speech recognition, may or may not be opti= mal.=C2=A0 While some adjustments may intuitively seem beneficial, like spe= aking louder to improve SNR, separating words and speaking slower to aid segmentation, or hyperarticulating specific sou= nds, there are some reasons to believe, that it may not always help.=C2=A0 = Or rather that it depends.=C2=A0 Just like with other clear speech models, = speakers also may differ considerably in what adjustments they make.=C2=A0=C2=A0</div> <div><br> </div> <div>Bill Whitmer wrote:=C2=A0</div> <div>=E2=80=9CThere=E2=80=99s been some recent work on ASR and vocal effort= by Ricard Marxer while he was at Sheffield (currently at Marseilles) on th= is. As your question suggests, they did find that when trained on normal ef= fort speech, the ASR fails with Lombard speech. But then when trained on material matched for vocal effort, the Lombard sp= eech is better recognised. There was a whole lotta variance based on speake= r, though =E2=80=93 not too surprising.</div> <div><br> </div> <div>=C2=A0<a href=3D"https://www.sciencedirect.com/science/article/pii/S01= 67639317302674" target=3D"_blank">https://www.sciencedirect.com/science/art= icle/pii/S0167639317302674</a></div> <div><br> </div> <div>=C2=A0I=E2=80=99d suspect that the Googles and Amazons have been tryin= g to tackle this for awhile, but perhaps the variance keeps them from annou= ncing some grand solution. Your point about people adopting non-optimal str= ategies is most likely true. It also might have an age factor. There=E2=80=99s been recent work from Valerie Hazan and Out= i Tuomainen at UCL about changes in speech production during a multi-person= task. Hazan gave a talk I recall where the younger participants adopted mu= ltiple different alternate speaking styles when prompted to repeat, whereas the older participants had only one alter= nate. Can=E2=80=99t find that result (maybe I=E2=80=99m mis-remembering =E2= =80=93 this memory ain=E2=80=99t what it used to be), but these might be a = good start</div> <div><br> </div> <div><a href=3D"https://www.sciencedirect.com/science/article/pii/S03785955= 17305166" target=3D"_blank">https://www.sciencedirect.com/science/article/p= ii/S0378595517305166</a></div> <div><a href=3D"https://asa.scitation.org/doi/10.1121/1.5053218" target=3D"= _blank">https://asa.scitation.org/doi/10.1121/1.5053218</a></div> <div><br> </div> <div>=C2=A0Also, there was some work showing how commercial ASR systems (ch= atbots) were attempting to use emotion detection, and changing their sequen= ce if they detected anger/frustration in the news years back. But while the= machines attempt new strategies, the human strategies seem less studied, unless it=E2=80=99s all being kept und= er lock-and-key by companies.=E2=80=9D</div> <div><br> </div> <div>Sarah Ferguson also spoke to these points and offered additional refer= ences:</div> <div>=E2=80=9CSharon Oviatt had a couple of papers looking at something lik= e this (references below) =E2=80=93 talkers go into a =E2=80=9Chyperspeech= =E2=80=9D when a computer mis-recognizes what they say.</div> <div>=C2=A0</div> <div>Oviatt, S., Levow, G. A., Moreton, E., &amp; MacEachern, M. (1998). Mo= deling global and focal hyperarticulation during human-computer error resol= ution. Journal of the Acoustical Society of America, 104, 3080-3098. doi:10= .1121/1.423888</div> <div>=C2=A0</div> <div>Oviatt, S., MacEachern, M., &amp; Levow, G. A. (1998). Predicting hype= rarticulate speech during human-computer error resolution. Speech Communica= tion, 24, 87-110. doi:10.1016/S0167-6393(98)00005-3</div> <div>=C2=A0</div> <div>I don=E2=80=99t know if the idea came from these papers or if I heard = it somewhere else, but speech recognizers are (or used to be) trained up on= citation-style speech, so hyperarticulation should make speech recognition= worse. I=E2=80=99ve been surprised by how well Siri does and will sometimes try to =E2=80=9Cmess with=C2=B4 it to see how= it behaves.=E2=80=9D</div> <div><br> </div> <div>And for those who are interested to learn more about this topic, Phil = Green sent information about an upcoming workshop in Glasgow <a href=3D"http://speech-interaction.org/chi2019/" target=3D"_blank">http:/= /speech-interaction.org/chi2019/</a> , while Dana Urbanski suggsted searchi= ng under =E2=80=9Ccomputer-directed speech=E2=80=9D and Katherine Marcoux r= eferenced a comprehensive review article that include machine-directed speech in the context of =E2=80=9Cother speech modifications (infant direc= t speech, Lombard speech, clear speech, etc.)</div> <div><br> </div> <div>Cooke, M., King, S., Garnier, M., &amp; Aubanel, V. (2014). The listen= ing talker: A review of human and algorithmic context-induced modifications= of speech. Computer Speech &amp; Language, 28(2), 543-571.&quot;</div> <div><br> </div> <div>David Pisoni recommended another useful perspective paper by Liz Shrib= erg:</div> <div><br> </div> <div>Shriberg, E. (2005). Spontaneous speech: How people really talk and wh= y engineers should care. INERSPEECH.=C2=A0=C2=A0<a href=3D"https://web.stan= ford.edu/class/cs424p/shriberg05.pdf" target=3D"_blank">https://web.stanfor= d.edu/class/cs424p/shriberg05.pdf</a>=C2=A0</div> <div><br> </div> <div><br> </div> <div>Finally, adjustments in speech directed at machines can also reflect e= motions and felling that speakers experience during the interaction.=C2=A0= =C2=A0</div> <div><br> </div> <div>Elle O=E2=80=99Brien also suggested this interesting paper =E2=80=9Cab= out how children express politeness and frustration at machines (in this st= udy, the computer is actually controlled by an experimenter, but kids think= it is voice-controlled). <a href=3D"http://www-bcf.usc.edu/~dbyrd/euro01-chaimkids_ed.pdf" target=3D= "_blank">http://www-bcf.usc.edu/~dbyrd/euro01-chaimkids_ed.pdf</a>=E2=80=9D= </div> <div><br> </div> <div>So it appears that speech production adjustments people make when spea= king to machines are a combination of efforts to improve recognition by mac= hines (which may or may not be optimal), as well as possible affect-related= production changes that arise dynamically in response to errors and the history of interactions with a specific devi= ce=C2=A0 and one=E2=80=99s appraisal of the machine listener&#39;s capabili= ties (this SNL Alexa skit captures some of these challenges <a href=3D"https://www.youtube.com/watch?v=3DYvT_gqs5ETk" target=3D"_blank"= >https://www.youtube.com/watch?v=3DYvT_gqs5ETk</a> )... At least while we a= re working out the kinks in computer speech recognition and finding better = ways to talk to machines, we are guaranteed a good and reliable supply of amusement and entertainment.=C2=A0</div> <div>=C2=A0</div> <div>Thanks to everyone who responded!</div> <div><br> </div> <div>Valeriy</div> <div><br> </div> <div>---original post-----------</div> <div><br> </div> <div><br> </div> </div> </div> <div class=3D"gmail_quote"> <div dir=3D"ltr" class=3D"gmail_attr">On Sun, Feb 3, 2019 at 11:16 AM Valer= iy Shafiro &lt;<a href=3D"mailto:firosha@xxxxxxxx" target=3D"_blank">firos= ha@xxxxxxxx</a>&gt; wrote:<br> </div> <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-= left:1px solid rgb(204,204,204);padding-left:1ex"> <div dir=3D"ltr"> <p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;color:rgb(33,33,33);line= -height:15.6933px;font-size:11pt;font-family:Calibri,sans-serif"> Dear list,</p> <p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;line-height:15.6933px"><= font color=3D"#212121" face=3D"Calibri,sans-serif"><span style=3D"font-size= :11pt">I am wondering if any one has any references or suggestions about th= is question.=C2=A0 These days I hear more and more people=C2=A0</span></font><span style=3D"color:rgb(33,33,33);font-family:C= alibri,sans-serif;font-size:11pt">talking to machines, e.g. Siri, Google, A= lexa, etc., and doing it=C2=A0in more and more places.=C2=A0 Automatic spee= ch recognition has improved tremendously, but still it seems to me that when people talk to machines they often swit= ch into a different production mode.=C2=A0 =C2=A0At times it may=C2=A0</spa= n><span style=3D"color:rgb(33,33,33);font-family:Calibri,sans-serif;font-si= ze:11pt">sound like=C2=A0talking to a (large) dog=C2=A0 and sometimes like talking to a=C2=A0customer service agent in a= land far away who is diligently trying to=C2=A0follow=C2=A0a script rather= than listen to what you are saying.=C2=A0=C2=A0=C2=A0And I=C2=A0wonder whe= ther adjustments that people make in their speech production when talking with machines in that mode are in fact optimal for=C2=A0improving=C2=A0rec= ognition=C2=A0</span><font color=3D"#212121" face=3D"Calibri,sans-serif"><s= pan style=3D"font-size:11pt">accuracy.=C2=A0 Since machines are not process= ing speech in the same way as humans, I wonder if changes in speech production that make speech more </span><span style=3D"font-size= :14.6667px">recognizable</span><span style=3D"font-size:11pt">=C2=A0for oth= er people (or even pets) are always the same as they are for machines.=C2= =A0 In other words, do people tend to make the most optimal adjustments to make their speech more recognizable to machines.=C2= =A0 Or is it more like falling back on clear speech modes that work with ot= her kinds of listeners (children, nonnative speakers, pets), or something i= n between?=C2=A0=C2=A0</span></font></p> <span class=3D"gmail-m_3154213389526688550gmail-m_-4422852840265105174gmail= -im" style=3D"color:rgb(80,0,80);font-family:Calibri,Arial,Helvetica,sans-s= erif;font-size:16px"> <p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;line-height:15.6933px;fo= nt-size:11pt;font-family:Calibri,sans-serif"> I realize there is a lot to this question, but perhaps people have started = looking into it.=C2=A0 I am happy to collate references and replies and sen= d to the list.=C2=A0</p> <p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;line-height:15.6933px;fo= nt-size:11pt;font-family:Calibri,sans-serif"> Best,</p> <p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;line-height:15.6933px;fo= nt-size:11pt;font-family:Calibri,sans-serif"> =C2=A0</p> <p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;line-height:15.6933px;fo= nt-size:11pt;font-family:Calibri,sans-serif"> Valeriy</p> </span></div> </blockquote> </div> </div> </div> </div> </div> </div> </blockquote> </span> </div> </blockquote></div></div></div></div> --00000000000057731e058298370a--


This message came from the mail archive
src/postings/2019/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University