Hello List,
I recently asked on this list about how we talk to machines that listen. Specifically, whether adjustments which people make in their speech when talking to machines are in fact optimal for improving recognition. Here is a summary of replies to my post.
Several people commented on the importance of expectations about machine listening that affect speakers production, in terms of trying to optimize recognition based on what one believes the machine “needs” to hear.
Elle O’Brien observed that she tends to use clear speech in order to maximize information in the acoustic signal.
"I believe I talk in clear speech to my devices because I anticipate that automatic speech recognition can barely make use of contextual/semantic/visual clues, and so I need to maximize the information present in the acoustic signal. Because I really expect that's all the machine can go on. I think of it more as the kind of speech I'd use in an adverse listening environment, or with a non-native speaker, than a dog or a child. "
Sam Fisher shared a developer’s perspective based on his experience programming Alexa and Google Assistant
“I have so far experienced that much of a typical user’s training for how to act with a conversational agent is contextual to dealing with forms and computers moreso than dogs or people. However, I intuit that there is a “nearest neighbor” effect at play, in which a user draws on their most closely analogous experience in order to inform their production. Therefore the reference point may vary geographically and generationally.
Most of the users I program for have had to deal with annoying touchtone phone systems that present us with a discrete set of outcomes and navigate us through a “flowchart.” I believe the user’s common assumptions come from a rough understanding that they are being pathed through such a flowchart and just need to say the right thing to get what they want.
In the design of next generation conversational interfaces that are bit more capable of keeping up with a natural flow of dialogue, we try to disrupt the user’s expectations from the voice interfaces of the past that were unable to really use all the advantages of “conversation” as a means to efficient, cooperative, and frictionless information exchange. The user is trained to become comfortably speaking to it in a somewhat “easier” manner, putting more burden of comprehension on the machine conversant.”
Leon van Noorden, though not speaking about computer-directed speech specifically, also highlighted the point about speaker expectations about the listener that can lead to maladaptive adjustments in speech production.
"My wife is deaf since birth and an extremely good lipreader. She notices quite often that, when she says she is deaf, people go into an exaggerated speech mode, that does not help her to understand them."
As these example illustrate – the adjustments in speech production that people make to improve speech recognition, may or may not be optimal. While some adjustments may intuitively seem beneficial, like speaking louder to improve SNR, separating words and speaking slower to aid segmentation, or hyperarticulating specific sounds, there are some reasons to believe, that it may not always help. Or rather that it depends. Just like with other clear speech models, speakers also may differ considerably in what adjustments they make.
Bill Whitmer wrote:
“There’s been some recent work on ASR and vocal effort by Ricard Marxer while he was at Sheffield (currently at Marseilles) on this. As your question suggests, they did find that when trained on normal effort speech, the ASR fails with Lombard speech. But then when trained on material matched for vocal effort, the Lombard speech is better recognised. There was a whole lotta variance based on speaker, though – not too surprising.
I’d suspect that the Googles and Amazons have been trying to tackle this for awhile, but perhaps the variance keeps them from announcing some grand solution. Your point about people adopting non-optimal strategies is most likely true. It also might have an age factor. There’s been recent work from Valerie Hazan and Outi Tuomainen at UCL about changes in speech production during a multi-person task. Hazan gave a talk I recall where the younger participants adopted multiple different alternate speaking styles when prompted to repeat, whereas the older participants had only one alternate. Can’t find that result (maybe I’m mis-remembering – this memory ain’t what it used to be), but these might be a good start
Also, there was some work showing how commercial ASR systems (chatbots) were attempting to use emotion detection, and changing their sequence if they detected anger/frustration in the news years back. But while the machines attempt new strategies, the human strategies seem less studied, unless it’s all being kept under lock-and-key by companies.”
Sarah Ferguson also spoke to these points and offered additional references:
“Sharon Oviatt had a couple of papers looking at something like this (references below) – talkers go into a “hyperspeech” when a computer mis-recognizes what they say.
Oviatt, S., Levow, G. A., Moreton, E., & MacEachern, M. (1998). Modeling global and focal hyperarticulation during human-computer error resolution. Journal of the Acoustical Society of America, 104, 3080-3098. doi:10.1121/1.423888
Oviatt, S., MacEachern, M., & Levow, G. A. (1998). Predicting hyperarticulate speech during human-computer error resolution. Speech Communication, 24, 87-110. doi:10.1016/S0167-6393(98)00005-3
I don’t know if the idea came from these papers or if I heard it somewhere else, but speech recognizers are (or used to be) trained up on citation-style speech, so hyperarticulation should make speech recognition worse. I’ve been surprised by how well Siri does and will sometimes try to “mess with´ it to see how it behaves.”
And for those who are interested to learn more about this topic, Phil Green sent information about an upcoming workshop in Glasgow
http://speech-interaction.org/chi2019/ , while Dana Urbanski suggsted searching under “computer-directed speech” and Katherine Marcoux referenced a comprehensive review article that include machine-directed speech in the context of “other speech modifications (infant direct speech, Lombard speech, clear speech, etc.)
Cooke, M., King, S., Garnier, M., & Aubanel, V. (2014). The listening talker: A review of human and algorithmic context-induced modifications of speech. Computer Speech & Language, 28(2), 543-571."
David Pisoni recommended another useful perspective paper by Liz Shriberg:
Finally, adjustments in speech directed at machines can also reflect emotions and felling that speakers experience during the interaction.
Elle O’Brien also suggested this interesting paper “about how children express politeness and frustration at machines (in this study, the computer is actually controlled by an experimenter, but kids think it is voice-controlled).
http://www-bcf.usc.edu/~dbyrd/euro01-chaimkids_ed.pdf”
So it appears that speech production adjustments people make when speaking to machines are a combination of efforts to improve recognition by machines (which may or may not be optimal), as well as possible affect-related production changes that arise dynamically in response to errors and the history of interactions with a specific device and one’s appraisal of the machine listener's capabilities (this SNL Alexa skit captures some of these challenges
https://www.youtube.com/watch?v=YvT_gqs5ETk )... At least while we are working out the kinks in computer speech recognition and finding better ways to talk to machines, we are guaranteed a good and reliable supply of amusement and entertainment.
Thanks to everyone who responded!
Valeriy
---original post-----------