Re: [AUDITORY] Summary: How we talk to machines that listen

"Listening" seems more purposeful, intelligently directed, and ecological than what our current technologies do, which is one reason I've preferred "machine hearing", which sounds a little more passive, or "bottom-up". But both are fine.

My book subtitle "Extracting Meaning from Sound" is a bit more on the active side, but I carefully explain that by "meaning" I mean information that helps in doing a task or application, not necessarily an intelligent or ecological or cultural one. We'll advance towards those, too, of course.

I don't understand why Kent says "The core mechanisms of ASA (integration, segmentation, sequential group, etc) can only exist in embodied, conscious organisms."

Dick

On Sat, Feb 23, 2019 at 9:10 PM Valeriy Shafiro <firosha@xxxxxxxxx> wrote:

Dear Kent,

Thank you for these ideas. I agree these are worth considering, even if I do not necessarily agree with your specific take on things. Based on my own reading of Gibson and other ecological psychologists, I don't necessarily think "machine listening" is as much of a terminological or metaphysical trap as it may seem from your post. We can argue for a long time for whether using "remembering" is better than "memory", but at some point it may be better to move on with the terms which, however imperfect, can still be useful for problems you are trying to solve. Similarly, I used listening in my post as a shorthand for machines that detect and respond to speech and that people can interact with through speech. Of course, as we move on with new technologies, there is a lot of room for ambiguity when we talk about machines that can perform more and more human-like actions. However, most people find it acceptable to talk about sunrise and sunset, while knowing that both are due to the rotation of the Earth, rather than the Sun going up and down, or say "the weather has a mind of its own" without really thinking that the weather is conscious. So, in this particular case I am not convinced that talking about listening machines and related speech production changes, represents either a symptom of "a time when numbers and funding supersedes scientific curiosity" or "represents an effort to imbue machines with the qualities of "life", "consciousness" and "humanity" which we know they do and cannot not have" That said, how we talk about machines we talk to is a very interesting subject in its own right.

Best,

Valeriy

On Sat, Feb 23, 2019 at 4:06 PM Kent Walker <kent.walker@xxxxxxxxxxxxxx> wrote:

From the perspective of auditory scene analysis (ASA), notions of "machine listening" and "computational auditory scene analysis" are premature. The core mechanisms of ASA (integration, segmentation, sequential group, etc) can only exist in embodied, conscious organisms.

I think Al Bregman would agree that there is no listening without ecology, and there is no listening without culture. His work is both homage-to and a rebellion-against Gibson's post-behaviourist ecological psychology. The use of "machine listening" and "computational auditory scene analysis" from this scientific perspective is disingenuous and r.

There are many leading researchers in the areas of embodied cognition and ecological psychoacoustics on this list. That they remain quiet when words which are important to ASA are misused signals to me that we have entered a period of time similar to that when behaviourism was king: a time when numbers and funding supersedes scientific curiosity.

From: AUDITORY - Research in Auditory Perception <AUDITORY@xxxxxxxxxxxxxxx> on behalf of Valeriy Shafiro <firosha@xxxxxxxxx>
Reply-To: Valeriy Shafiro <firosha@xxxxxxxxx>
Date: Wednesday, 20 February, 2019 1:50 PM
To: "AUDITORY@xxxxxxxxxxxxxxx" <AUDITORY@xxxxxxxxxxxxxxx>
Subject: Summary: How we talk to machines that listen

Hello List,

I recently asked on this list about how we talk to machines that listen. Specifically, whether adjustments which people make in their speech when talking to machines are in fact optimal for improving recognition. Here is a summary of replies to my post.

Several people commented on the importance of expectations about machine listening that affect speakers production, in terms of trying to optimize recognition based on what one believes the machine “needs” to hear.

Elle O’Brien observed that she tends to use clear speech in order to maximize information in the acoustic signal.

"I believe I talk in clear speech to my devices because I anticipate that automatic speech recognition can barely make use of contextual/semantic/visual clues, and so I need to maximize the information present in the acoustic signal. Because I really expect that's all the machine can go on. I think of it more as the kind of speech I'd use in an adverse listening environment, or with a non-native speaker, than a dog or a child. "

Sam Fisher shared a developer’s perspective based on his experience programming Alexa and Google Assistant

“I have so far experienced that much of a typical user’s training for how to act with a conversational agent is contextual to dealing with forms and computers moreso than dogs or people. However, I intuit that there is a “nearest neighbor” effect at play, in which a user draws on their most closely analogous experience in order to inform their production. Therefore the reference point may vary geographically and generationally.

Most of the users I program for have had to deal with annoying touchtone phone systems that present us with a discrete set of outcomes and navigate us through a “flowchart.” I believe the user’s common assumptions come from a rough understanding that they are being pathed through such a flowchart and just need to say the right thing to get what they want.

In the design of next generation conversational interfaces that are bit more capable of keeping up with a natural flow of dialogue, we try to disrupt the user’s expectations from the voice interfaces of the past that were unable to really use all the advantages of “conversation” as a means to efficient, cooperative, and frictionless information exchange. The user is trained to become comfortably speaking to it in a somewhat “easier” manner, putting more burden of comprehension on the machine conversant.”

Leon van Noorden, though not speaking about computer-directed speech specifically, also highlighted the point about speaker expectations about the listener that can lead to maladaptive adjustments in speech production.

"My wife is deaf since birth and an extremely good lipreader. She notices quite often that, when she says she is deaf, people go into an exaggerated speech mode, that does not help her to understand them."

As these example illustrate – the adjustments in speech production that people make to improve speech recognition, may or may not be optimal. While some adjustments may intuitively seem beneficial, like speaking louder to improve SNR, separating words and speaking slower to aid segmentation, or hyperarticulating specific sounds, there are some reasons to believe, that it may not always help. Or rather that it depends. Just like with other clear speech models, speakers also may differ considerably in what adjustments they make.

Bill Whitmer wrote:

“There’s been some recent work on ASR and vocal effort by Ricard Marxer while he was at Sheffield (currently at Marseilles) on this. As your question suggests, they did find that when trained on normal effort speech, the ASR fails with Lombard speech. But then when trained on material matched for vocal effort, the Lombard speech is better recognised. There was a whole lotta variance based on speaker, though – not too surprising.

https://www.sciencedirect.com/science/article/pii/S0167639317302674

I’d suspect that the Googles and Amazons have been trying to tackle this for awhile, but perhaps the variance keeps them from announcing some grand solution. Your point about people adopting non-optimal strategies is most likely true. It also might have an age factor. There’s been recent work from Valerie Hazan and Outi Tuomainen at UCL about changes in speech production during a multi-person task. Hazan gave a talk I recall where the younger participants adopted multiple different alternate speaking styles when prompted to repeat, whereas the older participants had only one alternate. Can’t find that result (maybe I’m mis-remembering – this memory ain’t what it used to be), but these might be a good start

https://www.sciencedirect.com/science/article/pii/S0378595517305166

https://asa.scitation.org/doi/10.1121/1.5053218

Also, there was some work showing how commercial ASR systems (chatbots) were attempting to use emotion detection, and changing their sequence if they detected anger/frustration in the news years back. But while the machines attempt new strategies, the human strategies seem less studied, unless it’s all being kept under lock-and-key by companies.”

Sarah Ferguson also spoke to these points and offered additional references:

“Sharon Oviatt had a couple of papers looking at something like this (references below) – talkers go into a “hyperspeech” when a computer mis-recognizes what they say.

Oviatt, S., Levow, G. A., Moreton, E., & MacEachern, M. (1998). Modeling global and focal hyperarticulation during human-computer error resolution. Journal of the Acoustical Society of America, 104, 3080-3098. doi:10.1121/1.423888

Oviatt, S., MacEachern, M., & Levow, G. A. (1998). Predicting hyperarticulate speech during human-computer error resolution. Speech Communication, 24, 87-110. doi:10.1016/S0167-6393(98)00005-3

I don’t know if the idea came from these papers or if I heard it somewhere else, but speech recognizers are (or used to be) trained up on citation-style speech, so hyperarticulation should make speech recognition worse. I’ve been surprised by how well Siri does and will sometimes try to “mess with´ it to see how it behaves.”

And for those who are interested to learn more about this topic, Phil Green sent information about an upcoming workshop in Glasgow http://speech-interaction.org/chi2019/ , while Dana Urbanski suggsted searching under “computer-directed speech” and Katherine Marcoux referenced a comprehensive review article that include machine-directed speech in the context of “other speech modifications (infant direct speech, Lombard speech, clear speech, etc.)

Cooke, M., King, S., Garnier, M., & Aubanel, V. (2014). The listening talker: A review of human and algorithmic context-induced modifications of speech. Computer Speech & Language, 28(2), 543-571."

David Pisoni recommended another useful perspective paper by Liz Shriberg:

Shriberg, E. (2005). Spontaneous speech: How people really talk and why engineers should care. INERSPEECH.  https://web.stanford.edu/class/cs424p/shriberg05.pdf

Finally, adjustments in speech directed at machines can also reflect emotions and felling that speakers experience during the interaction.

Elle O’Brien also suggested this interesting paper “about how children express politeness and frustration at machines (in this study, the computer is actually controlled by an experimenter, but kids think it is voice-controlled). http://www-bcf.usc.edu/~dbyrd/euro01-chaimkids_ed.pdf”

So it appears that speech production adjustments people make when speaking to machines are a combination of efforts to improve recognition by machines (which may or may not be optimal), as well as possible affect-related production changes that arise dynamically in response to errors and the history of interactions with a specific device and one’s appraisal of the machine listener's capabilities (this SNL Alexa skit captures some of these challenges https://www.youtube.com/watch?v=YvT_gqs5ETk )... At least while we are working out the kinks in computer speech recognition and finding better ways to talk to machines, we are guaranteed a good and reliable supply of amusement and entertainment.

Thanks to everyone who responded!

Valeriy

---original post-----------

On Sun, Feb 3, 2019 at 11:16 AM Valeriy Shafiro <firosha@xxxxxxxxx> wrote:

Dear list,

I am wondering if any one has any references or suggestions about this question. These days I hear more and more people talking to machines, e.g. Siri, Google, Alexa, etc., and doing it in more and more places. Automatic speech recognition has improved tremendously, but still it seems to me that when people talk to machines they often switch into a different production mode. At times it may sound like talking to a (large) dog and sometimes like talking to a customer service agent in a land far away who is diligently trying to follow a script rather than listen to what you are saying.   And I wonder whether adjustments that people make in their speech production when talking with machines in that mode are in fact optimal for improving recognition accuracy. Since machines are not processing speech in the same way as humans, I wonder if changes in speech production that make speech more recognizable for other people (or even pets) are always the same as they are for machines. In other words, do people tend to make the most optimal adjustments to make their speech more recognizable to machines. Or is it more like falling back on clear speech modes that work with other kinds of listeners (children, nonnative speakers, pets), or something in between?

I realize there is a lot to this question, but perhaps people have started looking into it. I am happy to collate references and replies and send to the list.

Best,

Valeriy