Re: [AUDITORY] Summary: How we talk to machines that listen (Peter Lennox )

Subject: Re: [AUDITORY] Summary: How we talk to machines that listen From: Peter Lennox <0000009461c1dbf1-dmarc-request@xxxxxxxx> Date: Wed, 27 Feb 2019 14:29:12 +0000 List-Archive:<http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY> --_000_CWLP265MB12835E8AD57C16A6CC169696CE740CWLP265MB1283GBRP_ Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable Hi Richard You said: I don't understand why Kent says "The core mechanisms of ASA (int= egration, segmentation, sequential group, etc) can only exist in embodied, = conscious organisms." I would guess that he's taking an 'embodied perception' stance. So 'listen= ing' is a proactive activity, rather than a passive reception of signals an= d subsequent signal processing ( Clark and Chalmers called us 'proactive pr= edictavores') Here's Karl Popper, 1978: "...However, I do not think that we shall be able to construct conscious be= ings without first constructing living organisms; and this seems to be diff= icult enough. Consciousness has a biological function in animals. It does n= ot seem to me at all likely that a machine can be conscious unless it needs= consciousness. Even we ourselves fall asleep when our consciousness has no= function to fulfil. Thus unless we succeed in creating life artificially, = life aiming at long-term survival; and more than that, artificial self-movi= ng animals that require a kind of pilot, I do not think that conscious arti= ficial intelligence will become a reality=85" =93Three Worlds=94 KARL POPPER THE TANNER LECTURE ON HUMAN VALUES Delivered= at The University of Michigan April 7, 1978 pp 165 Of course that's an (admittedly eminent) opinion, and Popper uses 'consciou= sness' as the yardstick. But the question of whether 'consciousness' is nee= ded to have 'perception' is still open. For myself, I tend to think of cons= ciousness as merely the functional extension of perception, a kind of 'hype= r-perception'. So, you have listening if you have perception and you have perception if y= ou have a perceiving entity. Experience to date tells us that perceiving en= tities embody intention and the means to continue (which is kind of what pe= rception is). So if we ever conclude that Alexa is actually listening, we'd have to conce= de that Alexa is an entity; Skynet come true ?? cheers ppl Dr. Peter Lennox SFHEA Senior Lecturer in Perception College of Arts, Humanities and Education School of Arts e: p.lennox@xxxxxxxx<mailto:p.lennox@xxxxxxxx> t: 01332 593155 https://derby.academia.edu/peterlennox https://www.researchgate.net/profile/Peter_Lennox University of Derby, Kedleston Road, Derby, DE22 1GB, UK ________________________________ From: AUDITORY - Research in Auditory Perception <AUDITORY@xxxxxxxx>= on behalf of Richard F. Lyon <dicklyon@xxxxxxxx> Sent: 24 February 2019 19:11:51 To: AUDITORY@xxxxxxxx Subject: Re: Summary: How we talk to machines that listen "Listening" seems more purposeful, intelligently directed, and ecological t= han what our current technologies do, which is one reason I've preferred "m= achine hearing", which sounds a little more passive, or "bottom-up". But b= oth are fine. My book subtitle "Extracting Meaning from Sound" is a bit more on the activ= e side, but I carefully explain that by "meaning" I mean information that h= elps in doing a task or application, not necessarily an intelligent or ecol= ogical or cultural one. We'll advance towards those, too, of course. I don't understand why Kent says "The core mechanisms of ASA (integration, = segmentation, sequential group, etc) can only exist in embodied, conscious = organisms." Dick On Sat, Feb 23, 2019 at 9:10 PM Valeriy Shafiro <firosha@xxxxxxxx<mailto:f= irosha@xxxxxxxx>> wrote: Dear Kent, Thank you for these ideas. I agree these are worth considering, even if I = do not necessarily agree with your specific take on things. Based on my ow= n reading of Gibson and other ecological psychologists, I don't necessarily= think "machine listening" is as much of a terminological or metaphysical t= rap as it may seem from your post. We can argue for a long time for whethe= r using "remembering" is better than "memory", but at some point it may be = better to move on with the terms which, however imperfect, can still be use= ful for problems you are trying to solve. Similarly, I used listening in m= y post as a shorthand for machines that detect and respond to speech and th= at people can interact with through speech. Of course, as we move on with = new technologies, there is a lot of room for ambiguity when we talk about m= achines that can perform more and more human-like actions. However, most p= eople find it acceptable to talk about sunrise and sunset, while knowing th= at both are due to the rotation of the Earth, rather than the Sun going up = and down, or say "the weather has a mind of its own" without really thinkin= g that the weather is conscious. So, in this particular case I am not conv= inced that talking about listening machines and related speech production c= hanges, represents either a symptom of "a time when numbers and funding su= persedes scientific curiosity" or "represents an effort to imbue machines = with the qualities of "life", "consciousness" and "humanity" which we know = they do and cannot not have" That said, how we talk about machines we talk= to is a very interesting subject in its own right. Best, Valeriy On Sat, Feb 23, 2019 at 4:06 PM Kent Walker <kent.walker@xxxxxxxx<mai= lto:kent.walker@xxxxxxxx>> wrote: From the perspective of auditory scene analysis (ASA), notions of "machine= listening" and "computational auditory scene analysis" are premature. The = core mechanisms of ASA (integration, segmentation, sequential group, etc) c= an only exist in embodied, conscious organisms. I think Al Bregman would agree that there is no listening without ecology,= and there is no listening without culture. His work is both homage-to and = a rebellion-against Gibson's post-behaviourist ecological psychology. The = use of "machine listening" and "computational auditory scene analysis" from= this scientific perspective is disingenuous and r. There are many leading researchers in the areas of embodied cognition and e= cological psychoacoustics on this list. That they remain quiet when words w= hich are important to ASA are misused signals to me that we have entered a = period of time similar to that when behaviourism was king: a time when numb= ers and funding supersedes scientific curiosity. From: AUDITORY - Research in Auditory Perception <AUDITORY@xxxxxxxx<= mailto:AUDITORY@xxxxxxxx>> on behalf of Valeriy Shafiro <firosha@xxxxxxxx= IL.COM<mailto:firosha@xxxxxxxx>> Reply-To: Valeriy Shafiro <firosha@xxxxxxxx<mailto:firosha@xxxxxxxx>> Date: Wednesday, 20 February, 2019 1:50 PM To: "AUDITORY@xxxxxxxx<mailto:AUDITORY@xxxxxxxx>" <AUDITORY@xxxxxxxx= ISTS.MCGILL.CA<mailto:AUDITORY@xxxxxxxx>> Subject: Summary: How we talk to machines that listen Hello List, I recently asked on this list about how we talk to machines that listen. S= pecifically, whether adjustments which people make in their speech when tal= king to machines are in fact optimal for improving recognition. Here is a = summary of replies to my post. Several people commented on the importance of expectations about machine li= stening that affect speakers production, in terms of trying to optimize rec= ognition based on what one believes the machine =93needs=94 to hear. Elle O=92Brien observed that she tends to use clear speech in order to maxi= mize information in the acoustic signal. "I believe I talk in clear speech to my devices because I anticipate that a= utomatic speech recognition can barely make use of contextual/semantic/visu= al clues, and so I need to maximize the information present in the acoustic= signal. Because I really expect that's all the machine can go on. I think = of it more as the kind of speech I'd use in an adverse listening environmen= t, or with a non-native speaker, than a dog or a child. " Sam Fisher shared a developer=92s perspective based on his experience progr= amming Alexa and Google Assistant =93I have so far experienced that much of a typical user=92s training for h= ow to act with a conversational agent is contextual to dealing with forms a= nd computers moreso than dogs or people. However, I intuit that there is a = =93nearest neighbor=94 effect at play, in which a user draws on their most = closely analogous experience in order to inform their production. Therefore= the reference point may vary geographically and generationally. Most of the users I program for have had to deal with annoying touchtone ph= one systems that present us with a discrete set of outcomes and navigate us= through a =93flowchart.=94 I believe the user=92s common assumptions come = from a rough understanding that they are being pathed through such a flowch= art and just need to say the right thing to get what they want. In the design of next generation conversational interfaces that are bit mor= e capable of keeping up with a natural flow of dialogue, we try to disrupt = the user=92s expectations from the voice interfaces of the past that were u= nable to really use all the advantages of =93conversation=94 as a means to = efficient, cooperative, and frictionless information exchange. The user is = trained to become comfortably speaking to it in a somewhat =93easier=94 man= ner, putting more burden of comprehension on the machine conversant.=94 Leon van Noorden, though not speaking about computer-directed speech specif= ically, also highlighted the point about speaker expectations about the lis= tener that can lead to maladaptive adjustments in speech production. "My wife is deaf since birth and an extremely good lipreader. She notices q= uite often that, when she says she is deaf, people go into an exaggerated s= peech mode, that does not help her to understand them." As these example illustrate =96 the adjustments in speech production that p= eople make to improve speech recognition, may or may not be optimal. While= some adjustments may intuitively seem beneficial, like speaking louder to = improve SNR, separating words and speaking slower to aid segmentation, or h= yperarticulating specific sounds, there are some reasons to believe, that i= t may not always help. Or rather that it depends. Just like with other cl= ear speech models, speakers also may differ considerably in what adjustment= s they make. Bill Whitmer wrote: =93There=92s been some recent work on ASR and vocal effort by Ricard Marxer= while he was at Sheffield (currently at Marseilles) on this. As your quest= ion suggests, they did find that when trained on normal effort speech, the = ASR fails with Lombard speech. But then when trained on material matched fo= r vocal effort, the Lombard speech is better recognised. There was a whole = lotta variance based on speaker, though =96 not too surprising. https://www.sciencedirect.com/science/article/pii/S0167639317302674 I=92d suspect that the Googles and Amazons have been trying to tackle this= for awhile, but perhaps the variance keeps them from announcing some grand= solution. Your point about people adopting non-optimal strategies is most = likely true. It also might have an age factor. There=92s been recent work f= rom Valerie Hazan and Outi Tuomainen at UCL about changes in speech product= ion during a multi-person task. Hazan gave a talk I recall where the younge= r participants adopted multiple different alternate speaking styles when pr= ompted to repeat, whereas the older participants had only one alternate. Ca= n=92t find that result (maybe I=92m mis-remembering =96 this memory ain=92t= what it used to be), but these might be a good start https://www.sciencedirect.com/science/article/pii/S0378595517305166 https://asa.scitation.org/doi/10.1121/1.5053218 Also, there was some work showing how commercial ASR systems (chatbots) we= re attempting to use emotion detection, and changing their sequence if they= detected anger/frustration in the news years back. But while the machines = attempt new strategies, the human strategies seem less studied, unless it= =92s all being kept under lock-and-key by companies.=94 Sarah Ferguson also spoke to these points and offered additional references= : =93Sharon Oviatt had a couple of papers looking at something like this (ref= erences below) =96 talkers go into a =93hyperspeech=94 when a computer mis-= recognizes what they say. Oviatt, S., Levow, G. A., Moreton, E., & MacEachern, M. (1998). Modeling gl= obal and focal hyperarticulation during human-computer error resolution. Jo= urnal of the Acoustical Society of America, 104, 3080-3098. doi:10.1121/1.4= 23888 Oviatt, S., MacEachern, M., & Levow, G. A. (1998). Predicting hyperarticula= te speech during human-computer error resolution. Speech Communication, 24,= 87-110. doi:10.1016/S0167-6393(98)00005-3 I don=92t know if the idea came from these papers or if I heard it somewher= e else, but speech recognizers are (or used to be) trained up on citation-s= tyle speech, so hyperarticulation should make speech recognition worse. I= =92ve been surprised by how well Siri does and will sometimes try to =93mes= s with=B4 it to see how it behaves.=94 And for those who are interested to learn more about this topic, Phil Green= sent information about an upcoming workshop in Glasgow http://speech-inter= action.org/chi2019/ , while Dana Urbanski suggsted searching under =93compu= ter-directed speech=94 and Katherine Marcoux referenced a comprehensive rev= iew article that include machine-directed speech in the context of =93other= speech modifications (infant direct speech, Lombard speech, clear speech, = etc.) Cooke, M., King, S., Garnier, M., & Aubanel, V. (2014). The listening talke= r: A review of human and algorithmic context-induced modifications of speec= h. Computer Speech & Language, 28(2), 543-571." David Pisoni recommended another useful perspective paper by Liz Shriberg: Shriberg, E. (2005). Spontaneous speech: How people really talk and why eng= ineers should care. INERSPEECH. https://web.stanford.edu/class/cs424p/shri= berg05.pdf Finally, adjustments in speech directed at machines can also reflect emotio= ns and felling that speakers experience during the interaction. Elle O=92Brien also suggested this interesting paper =93about how children = express politeness and frustration at machines (in this study, the computer= is actually controlled by an experimenter, but kids think it is voice-cont= rolled). http://www-bcf.usc.edu/~dbyrd/euro01-chaimkids_ed.pdf=94 So it appears that speech production adjustments people make when speaking = to machines are a combination of efforts to improve recognition by machines= (which may or may not be optimal), as well as possible affect-related prod= uction changes that arise dynamically in response to errors and the history= of interactions with a specific device and one=92s appraisal of the machi= ne listener's capabilities (this SNL Alexa skit captures some of these chal= lenges https://www.youtube.com/watch?v=3DYvT_gqs5ETk )... At least while we= are working out the kinks in computer speech recognition and finding bette= r ways to talk to machines, we are guaranteed a good and reliable supply of= amusement and entertainment. Thanks to everyone who responded! Valeriy ---original post----------- On Sun, Feb 3, 2019 at 11:16 AM Valeriy Shafiro <firosha@xxxxxxxx<mailto:f= irosha@xxxxxxxx>> wrote: Dear list, I am wondering if any one has any references or suggestions about this ques= tion. These days I hear more and more people talking to machines, e.g. Sir= i, Google, Alexa, etc., and doing it in more and more places. Automatic sp= eech recognition has improved tremendously, but still it seems to me that w= hen people talk to machines they often switch into a different production m= ode. At times it may sound like talking to a (large) dog and sometimes l= ike talking to a customer service agent in a land far away who is diligentl= y trying to follow a script rather than listen to what you are saying. An= d I wonder whether adjustments that people make in their speech production = when talking with machines in that mode are in fact optimal for improving r= ecognition accuracy. Since machines are not processing speech in the same = way as humans, I wonder if changes in speech production that make speech mo= re recognizable for other people (or even pets) are always the same as they= are for machines. In other words, do people tend to make the most optimal= adjustments to make their speech more recognizable to machines. Or is it = more like falling back on clear speech modes that work with other kinds of = listeners (children, nonnative speakers, pets), or something in between? I realize there is a lot to this question, but perhaps people have started = looking into it. I am happy to collate references and replies and send to = the list. Best, Valeriy The University of Derby has a published policy regarding email and reserves= the right to monitor email traffic. If you believe this was sent to you in error, please reply to the sender an= d let them know. Key University contacts: http://www.derby.ac.uk/its/contacts/ --_000_CWLP265MB12835E8AD57C16A6CC169696CE740CWLP265MB1283GBRP_ Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable <html> <head> <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DWindows-1= 252"> <style type=3D"text/css" style=3D"display:none;"></style> </head> <body dir=3D"ltr"> <div id=3D"divtagdefaultwrapper" style=3D"font-size:12pt;color:#000000;font= -family:Calibri,Helvetica,sans-serif;" dir=3D"ltr"> <p style=3D"margin-top:0;margin-bottom:0">Hi Richard</p> <p style=3D"margin-top:0;margin-bottom:0">You said: <span>I don't understan= d why Kent says "The core mechanisms of ASA (integration, segmentation= , sequential group, etc) can only exist in embodied, conscious organisms.&q= uot;</span></p> <p style=3D"margin-top:0;margin-bottom:0"><span> I would guess that he= 's taking an 'embodied perception' stance. So 'listening' is a proactive ac= tivity, rather than a passive reception of signals and subsequent signal pr= ocessing ( Clark and Chalmers called us 'proactive predictavores')</span></p> <p style=3D"margin-top:0;margin-bottom:0"><span><br> </span></p> <p style=3D"margin-top:0;margin-bottom:0"><span>Here's Karl Popper, 1978:</= span></p> <p style=3D"margin-top:0;margin-bottom:0"><span></p> <div> <p style=3D"margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: Cambria= ;"><span style=3D"font-size:14.0pt;mso-fareast-font-family:"Times New = Roman";=0A= mso-bidi-font-family:"Times New Roman"">"...However, </span><span style=3D"font-size:14.5pt;mso-fareast-font-family:"Times = New Roman";mso-bidi-font-family:=0A= "Times New Roman""><span>I </span></span><span style=3D"font-size:14.0pt;=0A= mso-fareast-font-family:"Times New Roman";mso-bidi-font-family:&q= uot;Times New Roman""><span>do </span><span>not </span><span>think </span><span>that </span><span>we </spa= n><span>shall </span><span>be </span><span>able </span><span>to </span><span>construct </= span><span>conscious </span><span>beings </span><span>without </span><span>first </span><span>co= nstructing </span><span>living organisms; </span><span>and </span><span>this </span><s= pan>seems </span><span>to </span><span>be </span><span>difficult </span><span>enough.= </span> <span>Consciousness </span><span>has </span></span><span style=3D"font-size= :13.5pt;mso-fareast-font-family:=0A= "Times New Roman";mso-bidi-font-family:"Times New Roman&quot= ;"><span>a </span></span><span style=3D"font-size:14.0pt;mso-fareast-font-family:&quot= ;Times New Roman";mso-bidi-font-family:=0A= "Times New Roman""><span>biological </span><span>function </span><span>in </span><span>animals. </span><span>It= </span> <span>does </span><span>not </span><span>seem </span><span>to </span><span>= me </span> <span>at </span><span>all </span><span>likely </span><span>that </span></sp= an><span style=3D"font-size:13.5pt;mso-fareast-font-family:"Times New = Roman";=0A= mso-bidi-font-family:"Times New Roman""><span>a </span></span><span style=3D"font-size:14.0pt;mso-fareast-font-family:&quot= ;Times New Roman";mso-bidi-font-family:=0A= "Times New Roman""><span>machine </span><span>can </span><span>be </span><span>conscious </span><span>unless= it </span> </span><span style=3D"font-size:14.5pt;mso-fareast-font-family:=0A= "Times New Roman";mso-bidi-font-family:"Times New Roman&quot= ;"><span>needs </span></span><span style=3D"font-size:14.0pt;mso-fareast-font-family:&quot= ;Times New Roman";mso-bidi-font-family:=0A= "Times New Roman""><span>consciousness. </span><span>Even </span><span>we </span><span>ourselves </span><span>fall = </span> <span>asleep </span><span>when </span><span>our </span><span>consciousness = </span> <span>has </span><span>no </span><span>function </span><span>to </span><spa= n>fulfil. </span><span>Thus </span><span>unless </span><span>we succeed in </span><sp= an>creating </span><span>life </span><span>artificially, </span><span>life </span><span= >aiming </span><span>at </span><span>long</span>-<span>term </span><span>survival; = </span> <span>and </span><span>more </span><span>than </span><span>that, </span><sp= an>artificial </span><span>self</span>-<span>moving </span><span>animals </span><span>tha= t </span> <span>require </span></span><span style=3D"font-size:12.5pt;mso-fareast-fon= t-family:"Times New Roman";mso-bidi-font-family:=0A= "Times New Roman""><span>a </span></span><span style=3D"font-size:14.0pt;=0A= mso-fareast-font-family:"Times New Roman";mso-bidi-font-family:&q= uot;Times New Roman""><span>kind </span><span>of </span><span>pilot, </span><span>I </span><span>do </span><= span>not </span><span>think </span><span>that </span><span>conscious </span><span>ar= tificial </span><span>intelligence will </span><span>become </span></span><span styl= e=3D"font-size:13.5pt;mso-fareast-font-family:"Times New Roman";m= so-bidi-font-family:=0A= "Times New Roman""><span>a </span></span><span style=3D"font-size:14.0pt;=0A= mso-fareast-font-family:"Times New Roman";mso-bidi-font-family:&q= uot;Times New Roman""><span>reality=85"</span></span></p> <p style=3D"margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: Cambria= ;"><span style=3D"font-size:14.0pt;mso-fareast-font-family:"Times New = Roman";=0A= mso-bidi-font-family:"Times New Roman""> </span></p> <p style=3D"margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: Cambria= ;"><span style=3D"font-size:14.0pt;mso-fareast-font-family:"Times New = Roman";=0A= mso-bidi-font-family:"Times New Roman"">=93Three Worlds=94 KARL P= OPPER THE TANNER LECTURE ON HUMAN VALUES Delivered at The University of Michigan April 7, 1978</span></p> <p style=3D"margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: Cambria= ;"><span style=3D"font-size:14.0pt;mso-fareast-font-family:"Times New = Roman";=0A= mso-bidi-font-family:"Times New Roman"">pp 165</span></p> </div> </span> <p></p> <p style=3D"margin-top:0;margin-bottom:0"><br> </p> <p style=3D"margin-top:0;margin-bottom:0">Of course that's an (admittedly e= minent) opinion, and Popper uses 'consciousness' as the yardstick. But the = question of whether 'consciousness' is needed to have 'perception' is still= open. For myself, I tend to think of consciousness as merely the functional extension of perception, a kind = of 'hyper-perception'.</p> <p style=3D"margin-top:0;margin-bottom:0">So, you  have listening if y= ou have perception and you have perception if you have a perceiving entity.= Experience to date tells us that perceiving entities embody intention and = the means to continue (which is kind of what perception is).</p> <p style=3D"margin-top:0;margin-bottom:0">So if we ever conclude that Alexa= is actually listening, we'd have to concede that Alexa is an entity; Skyne= t come true <span>&#55357;&#56841;</span></p> <p style=3D"margin-top:0;margin-bottom:0">cheers</p> <p style=3D"margin-top:0;margin-bottom:0">ppl<br> </p> <p style=3D"margin-top:0;margin-bottom:0"><br> </p> <div id=3D"Signature"> <div id=3D"divtagdefaultwrapper" dir=3D"ltr" style=3D"font-size: 12pt; colo= r: rgb(0, 0, 0); font-family: Calibri, Arial, Helvetica, sans-serif, "= EmojiFont", "Apple Color Emoji", "Segoe UI Emoji",= NotoColorEmoji, "Segoe UI Symbol", "Android Emoji", Em= ojiSymbols;"> <div class=3D"BodyFragment"><font size=3D"2"> <div class=3D"PlainText"> <div> <p class=3D"MsoPlainText">Dr. Peter Lennox SFHEA</p> <p class=3D"MsoPlainText">Senior Lecturer in Perception</p> <p class=3D"MsoPlainText">College of Arts, Humanities and Education </p> <p class=3D"MsoPlainText">School of Arts </p> <p class=3D"MsoPlainText"> </p> <p class=3D"MsoPlainText">e: <a href=3D"mailto:p.lennox@xxxxxxxx" id=3D"= LPNoLP">p.lennox@xxxxxxxx</a> </p> <p class=3D"MsoPlainText">t: 01332 593155</p> <p class=3D"MsoPlainText"> </p> <p class=3D"MsoNormal"><a href=3D"https://derby.academia.edu/peterlennox" i= d=3D"LPNoLP">https://derby.academia.edu/peterlennox</a> </p> <p class=3D"MsoNormal"><a href=3D"https://www.researchgate.net/profile/Pete= r_Lennox" id=3D"LPNoLP">https://www.researchgate.net/profile/Peter_Lennox</= a> </p> <p class=3D"MsoNormal"> </p> <p class=3D"MsoNormal"><span style=3D"font-size:9.0pt; font-family:"Ar= ial",sans-serif">University of Derby,<br> Kedleston Road,<br> Derby,<br> DE22 1GB, UK</span></p> </div> </div> </font></div> </div> </div> </div> <hr style=3D"display:inline-block;width:98%" tabindex=3D"-1"> <div id=3D"divRplyFwdMsg" dir=3D"ltr"><font face=3D"Calibri, sans-serif" st= yle=3D"font-size:11pt" color=3D"#000000"><b>From:</b> AUDITORY - Research i= n Auditory Perception <AUDITORY@xxxxxxxx> on behalf of Richard= F. Lyon <dicklyon@xxxxxxxx><br> <b>Sent:</b> 24 February 2019 19:11:51<br> <b>To:</b> AUDITORY@xxxxxxxx<br> <b>Subject:</b> Re: Summary: How we talk to machines that listen</font> <div> </div> </div> <div> <div dir=3D"ltr"> <div dir=3D"ltr"> <div dir=3D"ltr"> <div class=3D"x_gmail_default" style=3D"font-size:small">"Listening&qu= ot; seems more purposeful, intelligently directed, and ecological than what= our current technologies do, which is one reason I've preferred "mach= ine hearing", which sounds a little more passive, or "bottom-up".  But both are fine.</div> <div class=3D"x_gmail_default" style=3D"font-size:small"><br> </div> <div class=3D"x_gmail_default" style=3D"font-size:small">My book subtitle &= quot;Extracting Meaning from Sound" is a bit more on the active side, = but I carefully explain that by "meaning" I mean information that= helps in doing a task or application, not necessarily an intelligent or ecological or cultural one.  We'll advance towards tho= se, too, of course.</div> <div class=3D"x_gmail_default" style=3D"font-size:small"><br> </div> <div class=3D"x_gmail_default" style=3D"font-size:small">I don't understand= why Kent says "The core mechanisms of ASA (integration, segmentation,= sequential group, etc) can only exist in embodied, conscious organisms.&qu= ot;<br> </div> <div class=3D"x_gmail_default" style=3D"font-size:small"><br> </div> <div class=3D"x_gmail_default" style=3D"font-size:small">Dick</div> <div class=3D"x_gmail_default" style=3D"font-size:small"><br> </div> </div> </div> <br> <div class=3D"x_gmail_quote"> <div dir=3D"ltr" class=3D"x_gmail_attr">On Sat, Feb 23, 2019 at 9:10 PM Val= eriy Shafiro <<a href=3D"mailto:firosha@xxxxxxxx">firosha@xxxxxxxx</a>= > wrote:<br> </div> <blockquote class=3D"x_gmail_quote" style=3D"margin:0px 0px 0px 0.8ex; bord= er-left:1px solid rgb(204,204,204); padding-left:1ex"> <div dir=3D"ltr"> <div dir=3D"ltr"> <div dir=3D"ltr"> <div dir=3D"ltr">Dear Kent, <div><br> </div> <div>Thank you for these ideas.  I agree these are worth considering, = even if I do not necessarily agree with your specific take on things. = Based on my own reading of Gibson and other ecological psychologists, I do= n't necessarily think "machine listening" is as much of a terminological or metaphysical trap as it may seem from yo= ur post.  We can argue for a long time for whether using "remembe= ring" is better than "memory", but at some point it may be b= etter to move on with the terms which, however imperfect, can still be useful for problems you are trying to solve.  Similarly,= I used listening in my post as a shorthand for machines that detect and re= spond to speech and that people can interact with through speech.  Of = course, as we move on with new technologies, there is a lot of room for ambiguity when we talk about machines that can = perform more and more human-like actions.  However, most people find i= t acceptable to talk about sunrise and sunset, while knowing that both are = due to the rotation of the Earth, rather than the Sun going up and down, or say "the weather has a mind of its= own" without really thinking that the weather is conscious.  So,= in this particular case I am not convinced that talking about listening ma= chines and related speech production changes, represents either a symptom of  "a time when numbers and funding supersedes= scientific curiosity"  or "represents an effort to imbue ma= chines with the qualities of "life", "consciousness" an= d "humanity" which we know they do and cannot not have"&nbsp= ; That said, how we talk <u>about </u>machines we talk to is a very interesting subject in its own r= ight.  </div> <div><br> </div> <div>Best,</div> <div><br> </div> <div>Valeriy </div> </div> <br> <div class=3D"x_gmail_quote"> <div dir=3D"ltr" class=3D"x_gmail_attr">On Sat, Feb 23, 2019 at 4:06 PM Ken= t Walker <<a href=3D"mailto:kent.walker@xxxxxxxx" target=3D"_blank= ">kent.walker@xxxxxxxx</a>> wrote:<br> </div> <blockquote class=3D"x_gmail_quote" style=3D"margin:0px 0px 0px 0.8ex; bord= er-left:1px solid rgb(204,204,204); padding-left:1ex"> <div style=3D"color:rgb(0,0,0); font-size:14px; font-family:Calibri,sans-se= rif"> <div>From the perspective of auditory scene analysis  (ASA), notions o= f "machine listening" and "computational auditory scene anal= ysis" are premature. The core mechanisms of ASA (integration, segmenta= tion, sequential group, etc) can only exist in embodied, conscious organisms. </div> <div><br> </div> <div>I think Al Bregman  would agree that there is no listening withou= t ecology, and there is no listening without culture. His work is both homa= ge-to and a rebellion-against Gibson's post-behaviourist ecological psychol= ogy.  The use of "machine listening" and "computational auditory scene analysis" from this scientific= perspective is disingenuous and r. </div> <div><br> </div> <div>There are many leading researchers in the areas of embodied cognition = and ecological psychoacoustics on this list. That they remain quiet when wo= rds which are important to ASA are misused signals to me that we have enter= ed a period of time similar to that when behaviourism was king: a time when numbers and funding supersedes sci= entific curiosity.</div> <div><br> </div> <span id=3D"x_gmail-m_4398671363547811880gmail-m_3154213389526688550OLK_SRC= _BODY_SECTION"> <div style=3D"font-family:Calibri; font-size:11pt; text-align:left; color:b= lack; border-width:1pt medium medium; border-style:solid none none; border-= bottom-color:initial; border-left-color:initial; padding:3pt 0in 0in; borde= r-top:1pt solid rgb(181,196,223); border-right-color:initial"> <span style=3D"font-weight:bold">From: </span>AUDITORY - Research in Audito= ry Perception <<a href=3D"mailto:AUDITORY@xxxxxxxx" target=3D"_bl= ank">AUDITORY@xxxxxxxx</a>> on behalf of Valeriy Shafiro <<a h= ref=3D"mailto:firosha@xxxxxxxx" target=3D"_blank">firosha@xxxxxxxx</a>&gt= ;<br> <span style=3D"font-weight:bold">Reply-To: </span>Valeriy Shafiro <<a hr= ef=3D"mailto:firosha@xxxxxxxx" target=3D"_blank">firosha@xxxxxxxx</a>>= <br> <span style=3D"font-weight:bold">Date: </span>Wednesday, 20 February, 2019 = 1:50 PM<br> <span style=3D"font-weight:bold">To: </span>"<a href=3D"mailto:AUDITOR= Y@xxxxxxxx" target=3D"_blank">AUDITORY@xxxxxxxx</a>" &lt= ;<a href=3D"mailto:AUDITORY@xxxxxxxx" target=3D"_blank">AUDITORY@xxxxxxxx= TS.MCGILL.CA</a>><br> <span style=3D"font-weight:bold">Subject: </span>Summary: How we talk to ma= chines that listen<br> </div> <div><br> </div> <blockquote id=3D"x_gmail-m_4398671363547811880gmail-m_3154213389526688550M= AC_OUTLOOK_ATTRIBUTION_BLOCKQUOTE"> <div> <div> <div dir=3D"ltr"> <div dir=3D"ltr"> <div dir=3D"ltr"> <div dir=3D"ltr"> <div dir=3D"ltr"> <div>Hello List, </div> <div><br> </div> <div>I recently asked on this list about how we talk to machines that liste= n.  Specifically, whether adjustments which people make in their speec= h when talking to machines are in fact optimal for improving reco= gnition.  Here is a summary of replies to my post.</div> <div><br> </div> <div>Several people commented on the importance of expectations about machi= ne listening that affect speakers production, in terms of trying to optimiz= e recognition based on what one believes the machine =93needs=94 to hear.&n= bsp; </div> <div><br> </div> <div>Elle O=92Brien observed that she tends to use clear speech in order to= maximize information in the acoustic signal. </div> <div>"I believe I talk in clear speech to my devices because I anticip= ate that automatic speech recognition can barely make use of contextual/sem= antic/visual clues, and so I need to maximize the information present in th= e acoustic signal. Because I really expect that's all the machine can go on. I think of it more as the kind of speech= I'd use in an adverse listening environment, or with a non-native speaker,= than a dog or a child. "</div> <div><br> </div> <div>Sam Fisher shared a developer=92s perspective based on his experience = programming Alexa and Google Assistant </div> <div>=93I have so far experienced that much of a typical user=92s training = for how to act with a conversational agent is contextual to dealing with fo= rms and computers moreso than dogs or people. However, I intuit that there = is a =93nearest neighbor=94 effect at play, in which a user draws on their most closely analogous experience in order = to inform their production. Therefore the reference point may vary geograph= ically and generationally.</div> <div><br> </div> <div>Most of the users I program for have had to deal with annoying touchto= ne phone systems that present us with a discrete set of outcomes and naviga= te us through a =93flowchart.=94 I believe the user=92s common assumptions = come from a rough understanding that they are being pathed through such a flowchart and just need to say the right t= hing to get what they want.</div> <div><br> </div> <div>In the design of next generation conversational interfaces that are bi= t more capable of keeping up with a natural flow of dialogue, we try to dis= rupt the user=92s expectations from the voice interfaces of the past that w= ere unable to really use all the advantages of =93conversation=94 as a means to efficient, cooperative, and frictionle= ss information exchange. The user is trained to become comfortably speaking= to it in a somewhat =93easier=94 manner, putting more burden of comprehens= ion on the machine conversant.=94</div> <div><br> </div> <div>Leon van Noorden, though not speaking about computer-directed speech s= pecifically, also highlighted the point about speaker expectations about th= e listener that can lead to maladaptive adjustments in speech production.&n= bsp;</div> <div>"My wife is deaf since birth and an extremely good lipreader. She= notices quite often that, when she says she is deaf, people go into an exa= ggerated speech mode, that does not help her to understand them."</div= > <div><br> </div> <div>As these example illustrate =96 the adjustments in speech production t= hat people make to improve speech recognition, may or may not be optimal.&n= bsp; While some adjustments may intuitively seem beneficial, like speaking = louder to improve SNR, separating words and speaking slower to aid segmentation, or hyperarticulating specific sou= nds, there are some reasons to believe, that it may not always help.  = Or rather that it depends.  Just like with other clear speech models, = speakers also may differ considerably in what adjustments they make.  </div> <div><br> </div> <div>Bill Whitmer wrote: </div> <div>=93There=92s been some recent work on ASR and vocal effort by Ricard M= arxer while he was at Sheffield (currently at Marseilles) on this. As your = question suggests, they did find that when trained on normal effort speech,= the ASR fails with Lombard speech. But then when trained on material matched for vocal effort, the Lombard sp= eech is better recognised. There was a whole lotta variance based on speake= r, though =96 not too surprising.</div> <div><br> </div> <div> <a href=3D"https://www.sciencedirect.com/science/article/pii/S01= 67639317302674" target=3D"_blank">https://www.sciencedirect.com/science/art= icle/pii/S0167639317302674</a></div> <div><br> </div> <div> I=92d suspect that the Googles and Amazons have been trying to t= ackle this for awhile, but perhaps the variance keeps them from announcing = some grand solution. Your point about people adopting non-optimal strategie= s is most likely true. It also might have an age factor. There=92s been recent work from Valerie Hazan and Outi Tuom= ainen at UCL about changes in speech production during a multi-person task.= Hazan gave a talk I recall where the younger participants adopted multiple= different alternate speaking styles when prompted to repeat, whereas the older participants had only one alter= nate. Can=92t find that result (maybe I=92m mis-remembering =96 this memory= ain=92t what it used to be), but these might be a good start</div> <div><br> </div> <div><a href=3D"https://www.sciencedirect.com/science/article/pii/S03785955= 17305166" target=3D"_blank">https://www.sciencedirect.com/science/article/p= ii/S0378595517305166</a></div> <div><a href=3D"https://asa.scitation.org/doi/10.1121/1.5053218" target=3D"= _blank">https://asa.scitation.org/doi/10.1121/1.5053218</a></div> <div><br> </div> <div> Also, there was some work showing how commercial ASR systems (ch= atbots) were attempting to use emotion detection, and changing their sequen= ce if they detected anger/frustration in the news years back. But while the= machines attempt new strategies, the human strategies seem less studied, unless it=92s all being kept under loc= k-and-key by companies.=94</div> <div><br> </div> <div>Sarah Ferguson also spoke to these points and offered additional refer= ences:</div> <div>=93Sharon Oviatt had a couple of papers looking at something like this= (references below) =96 talkers go into a =93hyperspeech=94 when a computer= mis-recognizes what they say.</div> <div> </div> <div>Oviatt, S., Levow, G. A., Moreton, E., & MacEachern, M. (1998). Mo= deling global and focal hyperarticulation during human-computer error resol= ution. Journal of the Acoustical Society of America, 104, 3080-3098. doi:10= .1121/1.423888</div> <div> </div> <div>Oviatt, S., MacEachern, M., & Levow, G. A. (1998). Predicting hype= rarticulate speech during human-computer error resolution. Speech Communica= tion, 24, 87-110. doi:10.1016/S0167-6393(98)00005-3</div> <div> </div> <div>I don=92t know if the idea came from these papers or if I heard it som= ewhere else, but speech recognizers are (or used to be) trained up on citat= ion-style speech, so hyperarticulation should make speech recognition worse= . I=92ve been surprised by how well Siri does and will sometimes try to =93mess with=B4 it to see how it behav= es.=94</div> <div><br> </div> <div>And for those who are interested to learn more about this topic, Phil = Green sent information about an upcoming workshop in Glasgow <a href=3D"http://speech-interaction.org/chi2019/" target=3D"_blank">http:/= /speech-interaction.org/chi2019/</a> , while Dana Urbanski suggsted searchi= ng under =93computer-directed speech=94 and Katherine Marcoux referenced a = comprehensive review article that include machine-directed speech in the context of =93other speech modifications (i= nfant direct speech, Lombard speech, clear speech, etc.)</div> <div><br> </div> <div>Cooke, M., King, S., Garnier, M., & Aubanel, V. (2014). The listen= ing talker: A review of human and algorithmic context-induced modifications= of speech. Computer Speech & Language, 28(2), 543-571."</div> <div><br> </div> <div>David Pisoni recommended another useful perspective paper by Liz Shrib= erg:</div> <div><br> </div> <div>Shriberg, E. (2005). Spontaneous speech: How people really talk and wh= y engineers should care. INERSPEECH.  <a href=3D"https://web.stan= ford.edu/class/cs424p/shriberg05.pdf" target=3D"_blank">https://web.stanfor= d.edu/class/cs424p/shriberg05.pdf</a> </div> <div><br> </div> <div><br> </div> <div>Finally, adjustments in speech directed at machines can also reflect e= motions and felling that speakers experience during the interaction. &= nbsp;</div> <div><br> </div> <div>Elle O=92Brien also suggested this interesting paper =93about how chil= dren express politeness and frustration at machines (in this study, the com= puter is actually controlled by an experimenter, but kids think it is voice= -controlled). <a href=3D"http://www-bcf.usc.edu/~dbyrd/euro01-chaimkids_ed.pdf" target=3D= "_blank">http://www-bcf.usc.edu/~dbyrd/euro01-chaimkids_ed.pdf</a>=94</div> <div><br> </div> <div>So it appears that speech production adjustments people make when spea= king to machines are a combination of efforts to improve recognition by mac= hines (which may or may not be optimal), as well as possible affect-related= production changes that arise dynamically in response to errors and the history of interactions with a specific devi= ce  and one=92s appraisal of the machine listener's capabilities (this= SNL Alexa skit captures some of these challenges <a href=3D"https://www.youtube.com/watch?v=3DYvT_gqs5ETk" target=3D"_blank"= >https://www.youtube.com/watch?v=3DYvT_gqs5ETk</a> )... At least while we a= re working out the kinks in computer speech recognition and finding better = ways to talk to machines, we are guaranteed a good and reliable supply of amusement and entertainment. </div> <div> </div> <div>Thanks to everyone who responded!</div> <div><br> </div> <div>Valeriy</div> <div><br> </div> <div>---original post-----------</div> <div><br> </div> <div><br> </div> </div> </div> <div class=3D"x_gmail_quote"> <div dir=3D"ltr" class=3D"x_gmail_attr">On Sun, Feb 3, 2019 at 11:16 AM Val= eriy Shafiro <<a href=3D"mailto:firosha@xxxxxxxx" target=3D"_blank">fir= osha@xxxxxxxx</a>> wrote:<br> </div> <blockquote class=3D"x_gmail_quote" style=3D"margin:0px 0px 0px 0.8ex; bord= er-left:1px solid rgb(204,204,204); padding-left:1ex"> <div dir=3D"ltr"> <p class=3D"x_MsoNormal" style=3D"margin:0in 0in 8pt; color:rgb(33,33,33); = line-height:15.6933px; font-size:11pt; font-family:Calibri,sans-serif"> Dear list,</p> <p class=3D"x_MsoNormal" style=3D"margin:0in 0in 8pt; line-height:15.6933px= "><font face=3D"Calibri,sans-serif" color=3D"#212121"><span style=3D"font-s= ize:11pt">I am wondering if any one has any references or suggestions about= this question.  These days I hear more and more people </span></font><span style=3D"color:rgb(33,33,33); font-fa= mily:Calibri,sans-serif; font-size:11pt">talking to machines, e.g. Siri, Go= ogle, Alexa, etc., and doing it in more and more places.  Automat= ic speech recognition has improved tremendously, but still it seems to me that when people talk to machines they often swit= ch into a different production mode.   At times it may </spa= n><span style=3D"color:rgb(33,33,33); font-family:Calibri,sans-serif; font-= size:11pt">sound like talking to a (large) dog  and sometimes like talking to a customer service agent in a land far away= who is diligently trying to follow a script rather than listen t= o what you are saying.   And I wonder whether adjustmen= ts that people make in their speech production when talking with machines in that mode are in fact optimal for improving recognition = </span><font face=3D"Calibri,sans-serif" color=3D"#212121"><span style=3D"f= ont-size:11pt">accuracy.  Since machines are not processing speech in = the same way as humans, I wonder if changes in speech production that make speech more </span><span style=3D"font-size:14.6667px">recogniza= ble</span><span style=3D"font-size:11pt"> for other people (or even pe= ts) are always the same as they are for machines.  In other words, do = people tend to make the most optimal adjustments to make their speech more recognizable to machines.  Or is it more li= ke falling back on clear speech modes that work with other kinds of listene= rs (children, nonnative speakers, pets), or something in between? &nbs= p;</span></font></p> <span class=3D"x_gmail-m_4398671363547811880gmail-m_3154213389526688550gmai= l-m_-4422852840265105174gmail-im" style=3D"color:rgb(80,0,80); font-family:= Calibri,Arial,Helvetica,sans-serif; font-size:16px"> <p class=3D"x_MsoNormal" style=3D"margin:0in 0in 8pt; line-height:15.6933px= ; font-size:11pt; font-family:Calibri,sans-serif"> I realize there is a lot to this question, but perhaps people have started = looking into it.  I am happy to collate references and replies and sen= d to the list. </p> <p class=3D"x_MsoNormal" style=3D"margin:0in 0in 8pt; line-height:15.6933px= ; font-size:11pt; font-family:Calibri,sans-serif"> Best,</p> <p class=3D"x_MsoNormal" style=3D"margin:0in 0in 8pt; line-height:15.6933px= ; font-size:11pt; font-family:Calibri,sans-serif">  </p> <p class=3D"x_MsoNormal" style=3D"margin:0in 0in 8pt; line-height:15.6933px= ; font-size:11pt; font-family:Calibri,sans-serif"> Valeriy</p> </span></div> </blockquote> </div> </div> </div> </div> </div> </div> </blockquote> </span></div> </blockquote> </div> </div> </div> </div> </blockquote> </div> </div> </div> <br> <br> The University of Derby has a published policy regarding email and reserves= the right to monitor email traffic. <br> If you believe this was sent to you in error, please reply to the sender an= d let them know.<br> <br> Key University contacts: http://www.derby.ac.uk/its/contacts/ </body> </html> --_000_CWLP265MB12835E8AD57C16A6CC169696CE740CWLP265MB1283GBRP_--

This message came from the mail archive
src/postings/2019/
maintained by:

DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University