[AUDITORY] Summary: How we talk to machines that listen

Hello List,

I recently asked on this list about how we talk to machines that listen. Specifically, whether adjustments which people make in their speech when talking to machines are in fact optimal for improving recognition. Here is a summary of replies to my post.

Several people commented on the importance of expectations about machine listening that affect speakers production, in terms of trying to optimize recognition based on what one believes the machine “needs” to hear.

Elle O’Brien observed that she tends to use clear speech in order to maximize information in the acoustic signal.

"I believe I talk in clear speech to my devices because I anticipate that automatic speech recognition can barely make use of contextual/semantic/visual clues, and so I need to maximize the information present in the acoustic signal. Because I really expect that's all the machine can go on. I think of it more as the kind of speech I'd use in an adverse listening environment, or with a non-native speaker, than a dog or a child. "

Sam Fisher shared a developer’s perspective based on his experience programming Alexa and Google Assistant

“I have so far experienced that much of a typical user’s training for how to act with a conversational agent is contextual to dealing with forms and computers moreso than dogs or people. However, I intuit that there is a “nearest neighbor” effect at play, in which a user draws on their most closely analogous experience in order to inform their production. Therefore the reference point may vary geographically and generationally.

Most of the users I program for have had to deal with annoying touchtone phone systems that present us with a discrete set of outcomes and navigate us through a “flowchart.” I believe the user’s common assumptions come from a rough understanding that they are being pathed through such a flowchart and just need to say the right thing to get what they want.

In the design of next generation conversational interfaces that are bit more capable of keeping up with a natural flow of dialogue, we try to disrupt the user’s expectations from the voice interfaces of the past that were unable to really use all the advantages of “conversation” as a means to efficient, cooperative, and frictionless information exchange. The user is trained to become comfortably speaking to it in a somewhat “easier” manner, putting more burden of comprehension on the machine conversant.”

Leon van Noorden, though not speaking about computer-directed speech specifically, also highlighted the point about speaker expectations about the listener that can lead to maladaptive adjustments in speech production.

"My wife is deaf since birth and an extremely good lipreader. She notices quite often that, when she says she is deaf, people go into an exaggerated speech mode, that does not help her to understand them."

As these example illustrate – the adjustments in speech production that people make to improve speech recognition, may or may not be optimal. While some adjustments may intuitively seem beneficial, like speaking louder to improve SNR, separating words and speaking slower to aid segmentation, or hyperarticulating specific sounds, there are some reasons to believe, that it may not always help. Or rather that it depends. Just like with other clear speech models, speakers also may differ considerably in what adjustments they make.

Bill Whitmer wrote:

“There’s been some recent work on ASR and vocal effort by Ricard Marxer while he was at Sheffield (currently at Marseilles) on this. As your question suggests, they did find that when trained on normal effort speech, the ASR fails with Lombard speech. But then when trained on material matched for vocal effort, the Lombard speech is better recognised. There was a whole lotta variance based on speaker, though – not too surprising.

https://www.sciencedirect.com/science/article/pii/S0167639317302674

I’d suspect that the Googles and Amazons have been trying to tackle this for awhile, but perhaps the variance keeps them from announcing some grand solution. Your point about people adopting non-optimal strategies is most likely true. It also might have an age factor. There’s been recent work from Valerie Hazan and Outi Tuomainen at UCL about changes in speech production during a multi-person task. Hazan gave a talk I recall where the younger participants adopted multiple different alternate speaking styles when prompted to repeat, whereas the older participants had only one alternate. Can’t find that result (maybe I’m mis-remembering – this memory ain’t what it used to be), but these might be a good start

https://www.sciencedirect.com/science/article/pii/S0378595517305166

https://asa.scitation.org/doi/10.1121/1.5053218

Also, there was some work showing how commercial ASR systems (chatbots) were attempting to use emotion detection, and changing their sequence if they detected anger/frustration in the news years back. But while the machines attempt new strategies, the human strategies seem less studied, unless it’s all being kept under lock-and-key by companies.”

Sarah Ferguson also spoke to these points and offered additional references:

“Sharon Oviatt had a couple of papers looking at something like this (references below) – talkers go into a “hyperspeech” when a computer mis-recognizes what they say.

Oviatt, S., Levow, G. A., Moreton, E., & MacEachern, M. (1998). Modeling global and focal hyperarticulation during human-computer error resolution. Journal of the Acoustical Society of America, 104, 3080-3098. doi:10.1121/1.423888

Oviatt, S., MacEachern, M., & Levow, G. A. (1998). Predicting hyperarticulate speech during human-computer error resolution. Speech Communication, 24, 87-110. doi:10.1016/S0167-6393(98)00005-3

I don’t know if the idea came from these papers or if I heard it somewhere else, but speech recognizers are (or used to be) trained up on citation-style speech, so hyperarticulation should make speech recognition worse. I’ve been surprised by how well Siri does and will sometimes try to “mess with´ it to see how it behaves.”

And for those who are interested to learn more about this topic, Phil Green sent information about an upcoming workshop in Glasgow http://speech-interaction.org/chi2019/ , while Dana Urbanski suggsted searching under “computer-directed speech” and Katherine Marcoux referenced a comprehensive review article that include machine-directed speech in the context of “other speech modifications (infant direct speech, Lombard speech, clear speech, etc.)

Cooke, M., King, S., Garnier, M., & Aubanel, V. (2014). The listening talker: A review of human and algorithmic context-induced modifications of speech. Computer Speech & Language, 28(2), 543-571."

David Pisoni recommended another useful perspective paper by Liz Shriberg:

Shriberg, E. (2005). Spontaneous speech: How people really talk and why engineers should care. INERSPEECH. https://web.stanford.edu/class/cs424p/shriberg05.pdf

Finally, adjustments in speech directed at machines can also reflect emotions and felling that speakers experience during the interaction.

Elle O’Brien also suggested this interesting paper “about how children express politeness and frustration at machines (in this study, the computer is actually controlled by an experimenter, but kids think it is voice-controlled). http://www-bcf.usc.edu/~dbyrd/euro01-chaimkids_ed.pdf”

So it appears that speech production adjustments people make when speaking to machines are a combination of efforts to improve recognition by machines (which may or may not be optimal), as well as possible affect-related production changes that arise dynamically in response to errors and the history of interactions with a specific device and one’s appraisal of the machine listener's capabilities (this SNL Alexa skit captures some of these challenges https://www.youtube.com/watch?v=YvT_gqs5ETk )... At least while we are working out the kinks in computer speech recognition and finding better ways to talk to machines, we are guaranteed a good and reliable supply of amusement and entertainment.

Thanks to everyone who responded!

Valeriy

---original post-----------

On Sun, Feb 3, 2019 at 11:16 AM Valeriy Shafiro <firosha@xxxxxxxxx> wrote:

Dear list,
I am wondering if any one has any references or suggestions about this question. These days I hear more and more people talking to machines, e.g. Siri, Google, Alexa, etc., and doing it in more and more places. Automatic speech recognition has improved tremendously, but still it seems to me that when people talk to machines they often switch into a different production mode. At times it may sound like talking to a (large) dog and sometimes like talking to a customer service agent in a land far away who is diligently trying to follow a script rather than listen to what you are saying. And I wonder whether adjustments that people make in their speech production when talking with machines in that mode are in fact optimal for improving recognition accuracy. Since machines are not processing speech in the same way as humans, I wonder if changes in speech production that make speech more recognizable for other people (or even pets) are always the same as they are for machines. In other words, do people tend to make the most optimal adjustments to make their speech more recognizable to machines. Or is it more like falling back on clear speech modes that work with other kinds of listeners (children, nonnative speakers, pets), or something in between?
I realize there is a lot to this question, but perhaps people have started looking into it. I am happy to collate references and replies and send to the list.
Best,

Valeriy