Re: [AUDITORY] Summary: How we talk to machines that listen (Kent Walker )


Subject: Re: [AUDITORY] Summary: How we talk to machines that listen
From:    Kent Walker  <kent.walker@xxxxxxxx>
Date:    Sat, 23 Feb 2019 22:06:53 +0000
List-Archive:<http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>

--_000_D89709F5147BCkentwalkermailmcgillca_ Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable From the perspective of auditory scene analysis (ASA), notions of "machine= listening" and "computational auditory scene analysis" are premature. The = core mechanisms of ASA (integration, segmentation, sequential group, etc) c= an only exist in embodied, conscious organisms. I think Al Bregman would agree that there is no listening without ecology,= and there is no listening without culture. His work is both homage-to and = a rebellion-against Gibson's post-behaviourist ecological psychology. The = use of "machine listening" and "computational auditory scene analysis" from= this scientific perspective is disingenuous and represents an effort to im= bue machines with the qualities of "life", "consciousness" and "humanity" w= hich we know they do and cannot not have. There are many leading researchers in the areas of embodied cognition and e= cological psychoacoustics on this list. That they remain quiet when words w= hich are important to ASA are misused signals to me that we have entered a = period of time similar to that when behaviourism was king: a time when numb= ers and funding supersedes scientific curiosity. From: AUDITORY - Research in Auditory Perception <AUDITORY@xxxxxxxx<= mailto:AUDITORY@xxxxxxxx>> on behalf of Valeriy Shafiro <firosha@xxxxxxxx= IL.COM<mailto:firosha@xxxxxxxx>> Reply-To: Valeriy Shafiro <firosha@xxxxxxxx<mailto:firosha@xxxxxxxx>> Date: Wednesday, 20 February, 2019 1:50 PM To: "AUDITORY@xxxxxxxx<mailto:AUDITORY@xxxxxxxx>" <AUDITORY@xxxxxxxx= ISTS.MCGILL.CA<mailto:AUDITORY@xxxxxxxx>> Subject: Summary: How we talk to machines that listen Hello List, I recently asked on this list about how we talk to machines that listen. S= pecifically, whether adjustments which people make in their speech when tal= king to machines are in fact optimal for improving recognition. Here is a = summary of replies to my post. Several people commented on the importance of expectations about machine li= stening that affect speakers production, in terms of trying to optimize rec= ognition based on what one believes the machine =93needs=94 to hear. Elle O=92Brien observed that she tends to use clear speech in order to maxi= mize information in the acoustic signal. "I believe I talk in clear speech to my devices because I anticipate that a= utomatic speech recognition can barely make use of contextual/semantic/visu= al clues, and so I need to maximize the information present in the acoustic= signal. Because I really expect that's all the machine can go on. I think = of it more as the kind of speech I'd use in an adverse listening environmen= t, or with a non-native speaker, than a dog or a child. " Sam Fisher shared a developer=92s perspective based on his experience progr= amming Alexa and Google Assistant =93I have so far experienced that much of a typical user=92s training for h= ow to act with a conversational agent is contextual to dealing with forms a= nd computers moreso than dogs or people. However, I intuit that there is a = =93nearest neighbor=94 effect at play, in which a user draws on their most = closely analogous experience in order to inform their production. Therefore= the reference point may vary geographically and generationally. Most of the users I program for have had to deal with annoying touchtone ph= one systems that present us with a discrete set of outcomes and navigate us= through a =93flowchart.=94 I believe the user=92s common assumptions come = from a rough understanding that they are being pathed through such a flowch= art and just need to say the right thing to get what they want. In the design of next generation conversational interfaces that are bit mor= e capable of keeping up with a natural flow of dialogue, we try to disrupt = the user=92s expectations from the voice interfaces of the past that were u= nable to really use all the advantages of =93conversation=94 as a means to = efficient, cooperative, and frictionless information exchange. The user is = trained to become comfortably speaking to it in a somewhat =93easier=94 man= ner, putting more burden of comprehension on the machine conversant.=94 Leon van Noorden, though not speaking about computer-directed speech specif= ically, also highlighted the point about speaker expectations about the lis= tener that can lead to maladaptive adjustments in speech production. "My wife is deaf since birth and an extremely good lipreader. She notices q= uite often that, when she says she is deaf, people go into an exaggerated s= peech mode, that does not help her to understand them." As these example illustrate =96 the adjustments in speech production that p= eople make to improve speech recognition, may or may not be optimal. While= some adjustments may intuitively seem beneficial, like speaking louder to = improve SNR, separating words and speaking slower to aid segmentation, or h= yperarticulating specific sounds, there are some reasons to believe, that i= t may not always help. Or rather that it depends. Just like with other cl= ear speech models, speakers also may differ considerably in what adjustment= s they make. Bill Whitmer wrote: =93There=92s been some recent work on ASR and vocal effort by Ricard Marxer= while he was at Sheffield (currently at Marseilles) on this. As your quest= ion suggests, they did find that when trained on normal effort speech, the = ASR fails with Lombard speech. But then when trained on material matched fo= r vocal effort, the Lombard speech is better recognised. There was a whole = lotta variance based on speaker, though =96 not too surprising. https://www.sciencedirect.com/science/article/pii/S0167639317302674 I=92d suspect that the Googles and Amazons have been trying to tackle this= for awhile, but perhaps the variance keeps them from announcing some grand= solution. Your point about people adopting non-optimal strategies is most = likely true. It also might have an age factor. There=92s been recent work f= rom Valerie Hazan and Outi Tuomainen at UCL about changes in speech product= ion during a multi-person task. Hazan gave a talk I recall where the younge= r participants adopted multiple different alternate speaking styles when pr= ompted to repeat, whereas the older participants had only one alternate. Ca= n=92t find that result (maybe I=92m mis-remembering =96 this memory ain=92t= what it used to be), but these might be a good start https://www.sciencedirect.com/science/article/pii/S0378595517305166 https://asa.scitation.org/doi/10.1121/1.5053218 Also, there was some work showing how commercial ASR systems (chatbots) we= re attempting to use emotion detection, and changing their sequence if they= detected anger/frustration in the news years back. But while the machines = attempt new strategies, the human strategies seem less studied, unless it= =92s all being kept under lock-and-key by companies.=94 Sarah Ferguson also spoke to these points and offered additional references= : =93Sharon Oviatt had a couple of papers looking at something like this (ref= erences below) =96 talkers go into a =93hyperspeech=94 when a computer mis-= recognizes what they say. Oviatt, S., Levow, G. A., Moreton, E., & MacEachern, M. (1998). Modeling gl= obal and focal hyperarticulation during human-computer error resolution. Jo= urnal of the Acoustical Society of America, 104, 3080-3098. doi:10.1121/1.4= 23888 Oviatt, S., MacEachern, M., & Levow, G. A. (1998). Predicting hyperarticula= te speech during human-computer error resolution. Speech Communication, 24,= 87-110. doi:10.1016/S0167-6393(98)00005-3 I don=92t know if the idea came from these papers or if I heard it somewher= e else, but speech recognizers are (or used to be) trained up on citation-s= tyle speech, so hyperarticulation should make speech recognition worse. I= =92ve been surprised by how well Siri does and will sometimes try to =93mes= s with=B4 it to see how it behaves.=94 And for those who are interested to learn more about this topic, Phil Green= sent information about an upcoming workshop in Glasgow http://speech-inter= action.org/chi2019/ , while Dana Urbanski suggsted searching under =93compu= ter-directed speech=94 and Katherine Marcoux referenced a comprehensive rev= iew article that include machine-directed speech in the context of =93other= speech modifications (infant direct speech, Lombard speech, clear speech, = etc.) Cooke, M., King, S., Garnier, M., & Aubanel, V. (2014). The listening talke= r: A review of human and algorithmic context-induced modifications of speec= h. Computer Speech & Language, 28(2), 543-571." David Pisoni recommended another useful perspective paper by Liz Shriberg: Shriberg, E. (2005). Spontaneous speech: How people really talk and why eng= ineers should care. INERSPEECH. https://web.stanford.edu/class/cs424p/shri= berg05.pdf Finally, adjustments in speech directed at machines can also reflect emotio= ns and felling that speakers experience during the interaction. Elle O=92Brien also suggested this interesting paper =93about how children = express politeness and frustration at machines (in this study, the computer= is actually controlled by an experimenter, but kids think it is voice-cont= rolled). http://www-bcf.usc.edu/~dbyrd/euro01-chaimkids_ed.pdf=94 So it appears that speech production adjustments people make when speaking = to machines are a combination of efforts to improve recognition by machines= (which may or may not be optimal), as well as possible affect-related prod= uction changes that arise dynamically in response to errors and the history= of interactions with a specific device and one=92s appraisal of the machi= ne listener's capabilities (this SNL Alexa skit captures some of these chal= lenges https://www.youtube.com/watch?v=3DYvT_gqs5ETk )... At least while we= are working out the kinks in computer speech recognition and finding bette= r ways to talk to machines, we are guaranteed a good and reliable supply of= amusement and entertainment. Thanks to everyone who responded! Valeriy ---original post----------- On Sun, Feb 3, 2019 at 11:16 AM Valeriy Shafiro <firosha@xxxxxxxx<mailto:f= irosha@xxxxxxxx>> wrote: Dear list, I am wondering if any one has any references or suggestions about this ques= tion. These days I hear more and more people talking to machines, e.g. Sir= i, Google, Alexa, etc., and doing it in more and more places. Automatic sp= eech recognition has improved tremendously, but still it seems to me that w= hen people talk to machines they often switch into a different production m= ode. At times it may sound like talking to a (large) dog and sometimes l= ike talking to a customer service agent in a land far away who is diligentl= y trying to follow a script rather than listen to what you are saying. An= d I wonder whether adjustments that people make in their speech production = when talking with machines in that mode are in fact optimal for improving r= ecognition accuracy. Since machines are not processing speech in the same = way as humans, I wonder if changes in speech production that make speech mo= re recognizable for other people (or even pets) are always the same as they= are for machines. In other words, do people tend to make the most optimal= adjustments to make their speech more recognizable to machines. Or is it = more like falling back on clear speech modes that work with other kinds of = listeners (children, nonnative speakers, pets), or something in between? I realize there is a lot to this question, but perhaps people have started = looking into it. I am happy to collate references and replies and send to = the list. Best, Valeriy --_000_D89709F5147BCkentwalkermailmcgillca_ Content-Type: text/html; charset="Windows-1252" Content-ID: <914E80A35FD58B42AE9E2C9EE1A03DD5@xxxxxxxx> Content-Transfer-Encoding: quoted-printable <html> <head> <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DWindows-1= 252"> </head> <body style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-lin= e-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-fami= ly: Calibri, sans-serif; "> <div>From the perspective of auditory scene analysis &nbsp;(ASA), notions o= f &quot;machine listening&quot; and &quot;computational auditory scene anal= ysis&quot; are premature. The core mechanisms of ASA (integration, segmenta= tion, sequential group, etc) can only exist in embodied, conscious organisms.&nbsp;</div> <div><br> </div> <div>I think Al Bregman &nbsp;would agree that there is no listening withou= t ecology, and there is no listening without culture. His work is both homa= ge-to and a rebellion-against Gibson's post-behaviourist ecological psychol= ogy. &nbsp;The use of &quot;machine listening&quot; and &quot;computational auditory scene analysis&quot; from this scientific= perspective is disingenuous and represents an effort to imbue machines wit= h the qualities of &quot;life&quot;, &quot;consciousness&quot; and &quot;hu= manity&quot; which we know they do and cannot not have.&nbsp;</div> <div><br> </div> <div>There are many leading researchers in the areas of embodied cognition = and ecological psychoacoustics on this list. That they remain quiet when wo= rds which are important to ASA are misused signals to me that we have enter= ed a period of time similar to that when behaviourism was king: a time when numbers and funding supersedes sci= entific curiosity.</div> <div><br> </div> <span id=3D"OLK_SRC_BODY_SECTION"> <div style=3D"font-family:Calibri; font-size:11pt; text-align:left; color:b= lack; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM:= 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid;= BORDER-RIGHT: medium none; PADDING-TOP: 3pt"> <span style=3D"font-weight:bold">From: </span>AUDITORY - Research in Audito= ry Perception &lt;<a href=3D"mailto:AUDITORY@xxxxxxxx">AUDITORY@xxxxxxxx= S.MCGILL.CA</a>&gt; on behalf of Valeriy Shafiro &lt;<a href=3D"mailto:firo= sha@xxxxxxxx">firosha@xxxxxxxx</a>&gt;<br> <span style=3D"font-weight:bold">Reply-To: </span>Valeriy Shafiro &lt;<a hr= ef=3D"mailto:firosha@xxxxxxxx">firosha@xxxxxxxx</a>&gt;<br> <span style=3D"font-weight:bold">Date: </span>Wednesday, 20 February, 2019 = 1:50 PM<br> <span style=3D"font-weight:bold">To: </span>&quot;<a href=3D"mailto:AUDITOR= Y@xxxxxxxx">AUDITORY@xxxxxxxx</a>&quot; &lt;<a href=3D"mailto= :AUDITORY@xxxxxxxx">AUDITORY@xxxxxxxx</a>&gt;<br> <span style=3D"font-weight:bold">Subject: </span>Summary: How we talk to ma= chines that listen<br> </div> <div><br> </div> <blockquote id=3D"MAC_OUTLOOK_ATTRIBUTION_BLOCKQUOTE" style=3D"BORDER-LEFT:= #b5c4df 5 solid; PADDING:0 0 0 5; MARGIN:0 0 0 5;"> <div> <div> <div dir=3D"ltr"> <div dir=3D"ltr"> <div dir=3D"ltr"> <div dir=3D"ltr"> <div dir=3D"ltr"> <div>Hello List,&nbsp;</div> <div><br> </div> <div>I recently asked on this list about how we talk to machines that liste= n.&nbsp; Specifically, whether adjustments which people make in their speec= h when talking to machines are&nbsp;in fact&nbsp;optimal for improving reco= gnition.&nbsp; Here is a summary of replies to my post.</div> <div><br> </div> <div>Several people commented on the importance of expectations about machi= ne listening that affect speakers production, in terms of trying to optimiz= e recognition based on what one believes the machine =93needs=94 to hear.&n= bsp;&nbsp;</div> <div><br> </div> <div>Elle O=92Brien observed that she tends to use clear speech in order to= maximize information in the acoustic signal.&nbsp;</div> <div>&quot;I believe I talk in clear speech to my devices because I anticip= ate that automatic speech recognition can barely make use of contextual/sem= antic/visual clues, and so I need to maximize the information present in th= e acoustic signal. Because I really expect that's all the machine can go on. I think of it more as the kind of speech= I'd use in an adverse listening environment, or with a non-native speaker,= than a dog or a child. &quot;</div> <div><br> </div> <div>Sam Fisher shared a developer=92s perspective based on his experience = programming Alexa and Google Assistant&nbsp;</div> <div>=93I have so far experienced that much of a typical user=92s training = for how to act with a conversational agent is contextual to dealing with fo= rms and computers moreso than dogs or people. However, I intuit that there = is a =93nearest neighbor=94 effect at play, in which a user draws on their most closely analogous experience in order = to inform their production. Therefore the reference point may vary geograph= ically and generationally.</div> <div><br> </div> <div>Most of the users I program for have had to deal with annoying touchto= ne phone systems that present us with a discrete set of outcomes and naviga= te us through a =93flowchart.=94 I believe the user=92s common assumptions = come from a rough understanding that they are being pathed through such a flowchart and just need to say the right t= hing to get what they want.</div> <div><br> </div> <div>In the design of next generation conversational interfaces that are bi= t more capable of keeping up with a natural flow of dialogue, we try to dis= rupt the user=92s expectations from the voice interfaces of the past that w= ere unable to really use all the advantages of =93conversation=94 as a means to efficient, cooperative, and frictionle= ss information exchange. The user is trained to become comfortably speaking= to it in a somewhat =93easier=94 manner, putting more burden of comprehens= ion on the machine conversant.=94</div> <div><br> </div> <div>Leon van Noorden, though not speaking about computer-directed speech s= pecifically, also highlighted the point about speaker expectations about th= e listener that can lead to maladaptive adjustments in speech production.&n= bsp;</div> <div>&quot;My wife is deaf since birth and an extremely good lipreader. She= notices quite often that, when she says she is deaf, people go into an exa= ggerated speech mode, that does not help her to understand them.&quot;</div= > <div><br> </div> <div>As these example illustrate =96 the adjustments in speech production t= hat people make to improve speech recognition, may or may not be optimal.&n= bsp; While some adjustments may intuitively seem beneficial, like speaking = louder to improve SNR, separating words and speaking slower to aid segmentation, or hyperarticulating specific sou= nds, there are some reasons to believe, that it may not always help.&nbsp; = Or rather that it depends.&nbsp; Just like with other clear speech models, = speakers also may differ considerably in what adjustments they make.&nbsp;&nbsp;</div> <div><br> </div> <div>Bill Whitmer wrote:&nbsp;</div> <div>=93There=92s been some recent work on ASR and vocal effort by Ricard M= arxer while he was at Sheffield (currently at Marseilles) on this. As your = question suggests, they did find that when trained on normal effort speech,= the ASR fails with Lombard speech. But then when trained on material matched for vocal effort, the Lombard sp= eech is better recognised. There was a whole lotta variance based on speake= r, though =96 not too surprising.</div> <div><br> </div> <div>&nbsp;<a href=3D"https://www.sciencedirect.com/science/article/pii/S01= 67639317302674">https://www.sciencedirect.com/science/article/pii/S01676393= 17302674</a></div> <div><br> </div> <div>&nbsp;I=92d suspect that the Googles and Amazons have been trying to t= ackle this for awhile, but perhaps the variance keeps them from announcing = some grand solution. Your point about people adopting non-optimal strategie= s is most likely true. It also might have an age factor. There=92s been recent work from Valerie Hazan and Outi Tuom= ainen at UCL about changes in speech production during a multi-person task.= Hazan gave a talk I recall where the younger participants adopted multiple= different alternate speaking styles when prompted to repeat, whereas the older participants had only one alter= nate. Can=92t find that result (maybe I=92m mis-remembering =96 this memory= ain=92t what it used to be), but these might be a good start</div> <div><br> </div> <div><a href=3D"https://www.sciencedirect.com/science/article/pii/S03785955= 17305166">https://www.sciencedirect.com/science/article/pii/S03785955173051= 66</a></div> <div><a href=3D"https://asa.scitation.org/doi/10.1121/1.5053218">https://as= a.scitation.org/doi/10.1121/1.5053218</a></div> <div><br> </div> <div>&nbsp;Also, there was some work showing how commercial ASR systems (ch= atbots) were attempting to use emotion detection, and changing their sequen= ce if they detected anger/frustration in the news years back. But while the= machines attempt new strategies, the human strategies seem less studied, unless it=92s all being kept under loc= k-and-key by companies.=94</div> <div><br> </div> <div>Sarah Ferguson also spoke to these points and offered additional refer= ences:</div> <div>=93Sharon Oviatt had a couple of papers looking at something like this= (references below) =96 talkers go into a =93hyperspeech=94 when a computer= mis-recognizes what they say.</div> <div>&nbsp;</div> <div>Oviatt, S., Levow, G. A., Moreton, E., &amp; MacEachern, M. (1998). Mo= deling global and focal hyperarticulation during human-computer error resol= ution. Journal of the Acoustical Society of America, 104, 3080-3098. doi:10= .1121/1.423888</div> <div>&nbsp;</div> <div>Oviatt, S., MacEachern, M., &amp; Levow, G. A. (1998). Predicting hype= rarticulate speech during human-computer error resolution. Speech Communica= tion, 24, 87-110. doi:10.1016/S0167-6393(98)00005-3</div> <div>&nbsp;</div> <div>I don=92t know if the idea came from these papers or if I heard it som= ewhere else, but speech recognizers are (or used to be) trained up on citat= ion-style speech, so hyperarticulation should make speech recognition worse= . I=92ve been surprised by how well Siri does and will sometimes try to =93mess with=B4 it to see how it behav= es.=94</div> <div><br> </div> <div>And for those who are interested to learn more about this topic, Phil = Green sent information about an upcoming workshop in Glasgow <a href=3D"http://speech-interaction.org/chi2019/">http://speech-interactio= n.org/chi2019/</a> , while Dana Urbanski suggsted searching under =93comput= er-directed speech=94 and Katherine Marcoux referenced a comprehensive revi= ew article that include machine-directed speech in the context of =93other speech modifications (infant direct spee= ch, Lombard speech, clear speech, etc.)</div> <div><br> </div> <div>Cooke, M., King, S., Garnier, M., &amp; Aubanel, V. (2014). The listen= ing talker: A review of human and algorithmic context-induced modifications= of speech. Computer Speech &amp; Language, 28(2), 543-571.&quot;</div> <div><br> </div> <div>David Pisoni recommended another useful perspective paper by Liz Shrib= erg:</div> <div><br> </div> <div>Shriberg, E. (2005). Spontaneous speech: How people really talk and wh= y engineers should care. INERSPEECH.&nbsp;&nbsp;<a href=3D"https://web.stan= ford.edu/class/cs424p/shriberg05.pdf">https://web.stanford.edu/class/cs424p= /shriberg05.pdf</a>&nbsp;</div> <div><br> </div> <div><br> </div> <div>Finally, adjustments in speech directed at machines can also reflect e= motions and felling that speakers experience during the interaction.&nbsp;&= nbsp;</div> <div><br> </div> <div>Elle O=92Brien also suggested this interesting paper =93about how chil= dren express politeness and frustration at machines (in this study, the com= puter is actually controlled by an experimenter, but kids think it is voice= -controlled). <a href=3D"http://www-bcf.usc.edu/~dbyrd/euro01-chaimkids_ed.pdf">http://ww= w-bcf.usc.edu/~dbyrd/euro01-chaimkids_ed.pdf</a>=94</div> <div><br> </div> <div>So it appears that speech production adjustments people make when spea= king to machines are a combination of efforts to improve recognition by mac= hines (which may or may not be optimal), as well as possible affect-related= production changes that arise dynamically in response to errors and the history of interactions with a specific devi= ce&nbsp; and one=92s appraisal of the machine listener's capabilities (this= SNL Alexa skit captures some of these challenges <a href=3D"https://www.youtube.com/watch?v=3DYvT_gqs5ETk">https://www.youtu= be.com/watch?v=3DYvT_gqs5ETk</a> )... At least while we are working out the= kinks in computer speech recognition and finding better ways to talk to ma= chines, we are guaranteed a good and reliable supply of amusement and entertainment.&nbsp;</div> <div>&nbsp;</div> <div>Thanks to everyone who responded!</div> <div><br> </div> <div>Valeriy</div> <div><br> </div> <div>---original post-----------</div> <div><br> </div> <div><br> </div> </div> </div> <div class=3D"gmail_quote"> <div dir=3D"ltr" class=3D"gmail_attr">On Sun, Feb 3, 2019 at 11:16 AM Valer= iy Shafiro &lt;<a href=3D"mailto:firosha@xxxxxxxx">firosha@xxxxxxxx</a>&g= t; wrote:<br> </div> <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-= left:1px solid rgb(204,204,204);padding-left:1ex"> <div dir=3D"ltr"> <p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;color:rgb(33,33,33);line= -height:15.6933px;font-size:11pt;font-family:Calibri,sans-serif"> Dear list,</p> <p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;line-height:15.6933px"><= font color=3D"#212121" face=3D"Calibri,sans-serif"><span style=3D"font-size= :11pt">I am wondering if any one has any references or suggestions about th= is question.&nbsp; These days I hear more and more people&nbsp;</span></font><span style=3D"color: rgb(33, 33, 33); font-fami= ly: Calibri, sans-serif; font-size: 11pt; ">talking to machines, e.g. Siri,= Google, Alexa, etc., and doing it&nbsp;in more and more places.&nbsp; Auto= matic speech recognition has improved tremendously, but still it seems to me that when people talk to machines they often swit= ch into a different production mode.&nbsp; &nbsp;At times it may&nbsp;</spa= n><span style=3D"color: rgb(33, 33, 33); font-family: Calibri, sans-serif; = font-size: 11pt; ">sound like&nbsp;talking to a (large) dog&nbsp; and sometimes like talking to a&nbsp;customer service agent in a= land far away who is diligently trying to&nbsp;follow&nbsp;a script rather= than listen to what you are saying.&nbsp;&nbsp;&nbsp;And I&nbsp;wonder whe= ther adjustments that people make in their speech production when talking with machines in that mode are in fact optimal for&nbsp;improving&nbsp;rec= ognition&nbsp;</span><font color=3D"#212121" face=3D"Calibri,sans-serif"><s= pan style=3D"font-size:11pt">accuracy.&nbsp; Since machines are not process= ing speech in the same way as humans, I wonder if changes in speech production that make speech more </span><span style=3D"font-size= :14.6667px">recognizable</span><span style=3D"font-size:11pt">&nbsp;for oth= er people (or even pets) are always the same as they are for machines.&nbsp= ; In other words, do people tend to make the most optimal adjustments to make their speech more recognizable to machines.&nb= sp; Or is it more like falling back on clear speech modes that work with ot= her kinds of listeners (children, nonnative speakers, pets), or something i= n between?&nbsp;&nbsp;</span></font></p> <span class=3D"gmail-m_-4422852840265105174gmail-im" style=3D"color: rgb(80= , 0, 80); font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 16= px; "> <p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;line-height:15.6933px;fo= nt-size:11pt;font-family:Calibri,sans-serif"> I realize there is a lot to this question, but perhaps people have started = looking into it.&nbsp; I am happy to collate references and replies and sen= d to the list.&nbsp;</p> <p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;line-height:15.6933px;fo= nt-size:11pt;font-family:Calibri,sans-serif"> Best,</p> <p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;line-height:15.6933px;fo= nt-size:11pt;font-family:Calibri,sans-serif"> &nbsp;</p> <p class=3D"MsoNormal" style=3D"margin:0in 0in 8pt;line-height:15.6933px;fo= nt-size:11pt;font-family:Calibri,sans-serif"> Valeriy</p> </span></div> </blockquote> </div> </div> </div> </div> </div> </div> </blockquote> </span> </body> </html> --_000_D89709F5147BCkentwalkermailmcgillca_--


This message came from the mail archive
src/postings/2019/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University