Abstract:
There is a long-standing debate on whether speech is perceived through steady states or transitions [D. Kewley-Port and D. Pisoni, J. Acoust. Soc. Am. 73, 1779--1793 (1983)]. Furui [S. Furui, J. Acoust. Soc. Am. 80, 1016--1025 (1986)], gating Japanese monomoraic syllables, found listeners identified syllables correctly when they heard the point of maximal spectral transition (maximum D, a measure based on cepstral coefficients). In this paper, this work is extended to English, and to a wider variety of transitions. 128 English words, representing transitions between all manners of articulation, were gated at 20-ms intervals. VC, CC, and VV transitions were used in addition to Furui's CV and CjV syllables. The results show that for almost all transitions (diphones), there is a clear point at which listener's identification of the second segment becomes highly accurate. For many diphones (54%), this point is within 15 ms of the maximum D. However, for other diphones (especially those with far-reaching perceptual cues), the point of recognition is farther from the maximum D. This paper will discuss reasons for the discrepancies between Furui's experiment and the current one, including language-specific factors, and will address implications for the dynamic theory of speech perception.