I must admit to being surprised by the surprise engendered by this video. Anyone who was around during the early days of text-to-speech synthesis is very aware of the danger of presenting the text in advance of or simultaneous with the generated speech. The intelligibility of the resulting synthesis could be zero without the 'prior' and 100% with the visual cue.
So, given that we know that perception involves the integration of top-down expectations with bottom-up evidence (going right back to Richard Warren's work on the 'phoneme restoration effect'), why is this TikTok demo surprising? Or maybe I'm missing something?
Best wishes
Roger
--------------------------------------------------------------------------------------------
Prof ROGER K MOORE* BA(Hons) MSc PhD FIOA FISCA MIET
Chair of Spoken Language Processing
Vocal Interactivity Lab (VILab), Sheffield Robotics
Speech & Hearing Research Group (SPandH)
Department of Computer Science, UNIVERSITY OF SHEFFIELD
Regent Court, 211 Portobello, Sheffield, S1 4DP, UK
--------------------------------------------------------------------------------------------