Re: [AUDITORY] Tool for automatic syllable segmentation (Sarah Hawkins )

Subject: Re: [AUDITORY] Tool for automatic syllable segmentation From: Sarah Hawkins <0000033950e3b3ae-dmarc-request@xxxxxxxx> Date: Tue, 24 Sep 2024 17:39:29 +0100 --------------EMZZ9OcUlEEUH20AGJjUWpag Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-MIME-Autoconverted: from 8bit to quoted-printable by edgeum3.it.mcgill.ca id 48OGdbBl011474 Hi L=C3=A9o, The following comments and references may help. Apologies for another=20 long email. It's because I'm trying to write unambiguously, and identify=20 what may be hidden assumptions that can influence data and theory. (And=20 because I'm unsure exactly what you mean by "speech rate context cues,=20 not "directly informative" cues".) Production For production only (i.e. measuring acoustic properties regardless of=20 their perceptual salience) the duration of spread depends on the=20 language, dialect, rate and style of speech, the particular speech=20 sound/phoneme, and the part of the syllable it is in (broadly, onset vs=20 coda/end). =C2=A0=C2=A0 E.g. 1) for language, English sounds tend to spread further= than=20 Spanish. =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2)=C2=A0 for part= icular speech sounds, English /s/ tends to be=20 quite localised, whereas cues to stop voicing at the end of a syllable=20 (lab vs lap) spread across the entire syllable, and /r/ has been=20 measured up to 1000 ms before the acoustic nucleus of the corresponding=20 phoneme. (That's /r/ the approximant, not the trill found in some=20 English dialects.) =C2=A0 A comprehensive study for standard southern British English: Coleman, J. S. (2003). Discovering the acoustic correlates of=20 phonological contrasts. /Journal of Phonetics/,/31/, 351-372.=20 https://doi.org/10.1016/j.wocn.2003.10.001 Perception Of course, whether these measurable acoustic correlates of phonemes are=20 perceptually salient depends on all the above factors, as well as other=20 very powerful ones. =C2=A0=C2=A0=C2=A0 E.g. =C2=A01) whether the word is recognizable without the phoneme in questio= n=20 (essentially, predictability from context, broadly defined) =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 Richard Warren's phoneme restoration work - 'legislature' =C2=A02) listener competence and expectations Heinrich, A., Flory, Y., & Hawkins, S. (2010). Influence of English=20 r-resonances on intelligibility of speech in noise for native English=20 and German listeners. /Speech Communication/, /52/, 1038-1055.=20 https://doi.org/10.1016/j.specom.2010.09.009 =C2=A0 3)=C2=A0=C2=A0 the particular type of speech and/or its position = in the=20 syllable and word (whose composition also matter. Many examples exist.=C2= =A0=20 Laura Dilley and colleagues' work is neat (though would only work for=20 the types of sounds she uses - unlikely for /s/ in those types of=20 context, and may be what you wanted to exclude in speech rate context cue= s') =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0 Morr= ill, T. H., Dilley, L. C., McAuley, J. D., & Pitt, M.=20 A. (2014). Distal rhythm influences whether or not listeners hear a word=20 in continuous speech: Support for a perceptual grouping hypothesis.=20 Cognition, 131, 69-74. https://doi.org/10.1016/j.cognition.2013.12.006 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 Morrill, T. H., Heffner, C. C., & Dilley, L. C. (2015).=20 Interactions between distal speech rate, linguistic knowledge, and=20 speech environment=C2=A0=C2=A0=C2=A0 Psychonomic Bulletin & Review, 22, 1= 451=E2=80=931457.=20 https://doi.org/10.3758/s13423-015-0820-9 The above references all use meaningful speech. Alain has mentioned=20 misperceptions when listening to speech in a language you don't know so=20 well (see also Heinrich paper above). There are many studies of this=20 type of thing. Fun ones include so-called mondegreen perceptions of song=20 lyrics, common in both native and cross-linguistic contexts, e.g.=20 English 'all my feelings grow' heard as German 'oma fiel ins klo' =C2=A0=C2= =A0 =C2=A0=20 You can find serious research on what influences these effects. Finally, a comment on Alain's email: yes, segmentation involves some=20 weird assumptions, but can still be valid for many questions, especially=20 if what you are doing is on relatively simple speech composition=20 (obstruent-vowel sequences in clear speech being the best example), or=20 comparing like with like in different contexts. For this, there is=20 usually more than one justifiable segmentation criterion. My own=20 observations show that the particular criterion for segmentation rarely=20 affects conclusions, as long as the chosen criterion is used reliably. =C2=A0An exception to this rule is measurements of English vowel duratio= n=20 following phonologically voiced vs voiceless syllable-initial stops=20 (e.g. b vs p). For 50 or more years from the 1960s or 70s, we standardly=20 measured the end of the stop/beginning of the vowel from the stop burst=20 (as Pierre wrote in this thread). Nowadays many people measure from the=20 onset of phonation, on the mistaken assumption that all vowels are=20 always voiced/phonated, and that any aspiration belongs to the=20 consonant. (It belongs to both.) The consequence is that vowels=20 following heavily aspirated voiceless stops (English syllable-initial p=20 t k) measure as much shorter than those following the (unaspirated)=20 voiced stops, b d g. =C2=A0=C2=A0=C2=A0 There is no 'right' segmentation criterion for this s= ituation: what=20 you do depends on why you are measuring, and what else you are=20 measuring. The stop burst is better if your focus is articulation or you=20 want to pool durations regardless of initial stop voicing (reduces the=20 variance). From onset of periodicity/voicing may be better if your=20 interest is on rhythmic influences on perception. This also illustrates Alain's and my wider points that segmentation=20 criteria need to reflect the type of speech and the purpose of the=20 segmentation. I hope also that these types of comment illustrate the theoretical=20 constraints that follow from using averages summed over wildly different=20 phonological units, and then generalised to natural connected speech,=20 such as that syllable duration is about 250 ms. all the best Sarah On 2024-09-23 11:43, L=C3=A9o Varnet wrote: > > Dear all, > > I'd like to take Alain's response as a starting point for another=20 > sub-thread of discussion. > > Alain, I assume that you are referring to the research on automatic=20 > phoneme classification based on temporal patterns, which typically use=20 > a [-500ms; +500ms] window. I'm curious about the maximum distance a=20 > phonetic cue can be from the nucleus of the corresponding=20 > phoneme.=C2=A0Does anybody in the List have insights on this? In my own= =20 > experiments I have observed that in some cases cues as far as 800 ms=20 > before the target sound can influence phoneme categorization -- but=20 > these were speech rate context cues, not "directly informative" cues. > > Best > > L=C3=A9o > > > L=C3=A9o Varnet - Chercheur CNRS > Laboratoire des Syst=C3=A8mes Perceptifs, UMR 8248 > =C3=89cole Normale Sup=C3=A9rieure > 29, rue d'Ulm - 75005 Paris Cedex 05 > T=C3=A9l. : (+33)6 33 93 29 34 > https://lsp.dec.ens.fr/en/member/1066/leo-varnet > https://dbao.leo-varnet.fr/ > > Le 21/09/2024 =C3=A0 11:51, Alain de Cheveigne a =C3=A9crit=C2=A0: >> Curiously, no one yet pointed out that "segmentation" itself is ill-de= fined. Syllables, like phonemes, are defined at the phonological level w= hich is abstract and distinct from the acoustics. >> >> Acoustic cues to a phoneme (or syllable) may come from an extended int= erval of sound that overlaps with the cues that signal the next phoneme. = I seem to recall papers by Hynek Hermansky that found a support duration = on the order of 1s. If so, the "segmentation boundary" between phonemes/s= yllables is necessarily ill-defined. >> >> In practice, it may be possible to define a segmentation that "concent= rates" information pertinent to each phoneme within a segment and minimiz= es "spillover" to adjacent phonemes, but we should not be surprised if it= works less well for some boundaries, or if different methods give differ= ent results. >> >> When listening to Japanese practice tapes, I remember noticing that th= e word "futatsu" (two) sounded rather like "uftatsu", suggesting that the= acoustic-to-phonological mapping (based on my native phonological system= ) could be loose enough to allow for a swap. >> >> Alain >> to be categorized >> >> >>> On 20 Sep 2024, at 11:29, Jan Schnupp<000000e042a1ec30-dmarc-request@xxxxxxxx= LISTS.MCGILL.CA> wrote: >>> >>> Dear Remy, >>> >>> it might be useful for us to know where your meaningless CV syllable = stimuli come from. >>> But in any event, if you are any good at coding you are likely better= off working directly computing parameters of the recording waveforms and= apply criteria to those. CV syllables have an "energy arc" such that the= V is invariably louder than the C. In speech there are rarely silent gap= s between syllables, so you may be looking at a CVCVCVCV... stream where = the only "easy" handle on the syllable boundary is likely to be the end o= f end of the vowel, which should be recognizable by a marked decline in a= coustic energy, which you can quantify by some running RMS value (perhaps= after low-pass filtering given that consonants rarely have much low freq= uency energy). If that's not accurate or reliable enough then things are = likely to get a lot trickier. You could look for voicing in a running aut= ocorrelation as an additional cue given that all vowels are voiced but on= ly some consonants are. >>> How many of these do you have to process? If the number isn't huge, i= t may be quicker to find the boundaries "by ear" than trying to develop a= piece of computer code. The best way forward really depends enormously o= n the nature of your original stimulus set. o >>> >>> Best wishes, >>> >>> Jan >>> >>> --------------------------------------- >>> Prof Jan Schnupp >>> Gerald Choa Neuroscience Institute >>> The Chinese University of Hong Kong >>> Sha Tin >>> Hong Kong >>> >>> https://auditoryneuroscience.com >>> http://jan.schnupp.net >>> >>> >>> On Thu, 19 Sept 2024 at 12:19, R=C3=A9my MASSON<remy.masson@xxxxxxxx= r> wrote: >>> Hello AUDITORY list, >>> We are attempting to do automatic syllable segmentation on a collec= tion of sound files that we use in an experiment. Our stimuli are a rapid= sequence of syllables (all beginning with a consonant and ending with a = vowel) with no underlying semantic meaning and with no pauses. We would l= ike to automatically extract the syllable/speech rate and obtain the time= stamps for each syllable onset. >>> We are a bit lost on which tool to use. We tried PRAAT with the Syl= lable Nuclei v3 script, the software VoiceLab and the website WebMaus. Un= fortunately, for each of them their estimation of the total number of syl= lables did not consistently match what we were able to count manually, de= spite toggling with the parameters. >>> Do you have any advice on how to go further? Do you have any experi= ence in syllable onset extraction? >>> Thank you for your understanding, >>> R=C3=A9my MASSON >>> Research Engineer >>> Laboratory "Neural coding and neuroengineering of human speech functi= ons" (NeuroSpeech) >>> Institut de l=E2=80=99Audition =E2=80=93 Institut Pasteur (Paris)<ima= ge001.jpg> --------------EMZZ9OcUlEEUH20AGJjUWpag Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-MIME-Autoconverted: from 8bit to quoted-printable by edgeum3.it.mcgill.ca id 48OGdbBl011474 <!DOCTYPE html><html><head> <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dutf-8"> </head> <body> Hi L=C3=A9o, The following comments and references may help. Apologies for another long email. It's because I'm trying to write unambiguously, and identify what may be hidden assumptions that can influence data and theory. (And because I'm unsure exactly what you mean by "sp= eech rate context cues, not "directly informative" cues".)<= br> Production For production only (i.e. measuring acoustic properties regardless of their perceptual salience) the duration of spread depends on the language, dialect, rate and style of speech, the particular speech sound/phoneme, and the part of the syllable it is in (broadly, onset vs coda/end).    E.g. 1) for language, English sounds tend to spread furt= her than Spanish.           2)  for p= articular speech sounds, English /s/ tends to be quite localised, whereas cues to stop voicing at the end of a syllable (lab vs lap) spread across the entire syllable, and /r/ has been measured up to 1000 ms before the acoustic nucleus of the corresponding phoneme. (That's /r/ the approximant, not the trill found in some English dialects.)   A comprehensive study for standard southern British English:<b= r>    Coleman, J. S. (2003). Discovering the acoustic correlates of phonological contrasts. Journal of Phonetics, 31, 351-372. <a class=3D"moz-= txt-link-freetext" href=3D"https://doi.org/10.1016/j.wocn.2003.10.001" ta= rget=3D"_BLANK"> https://doi.org/10.1016/j.wocn.2003.10.001</a> <b= r> Perception Of course, whether these measurable acoustic correlates of phonemes are perceptually salient depends on all the above factors, as well as other very powerful ones.     E.g.  1) whether the word is recognizable without the phoneme in ques= tion (essentially, predictability from context, broadly defined)            &nb= sp;  Richard Warren's phoneme restoration work - 'legislature'          2) listener competence and expectations            Heinrich, A., Flory, Y., & Hawkins, S. (2010). Influence of English r-resonances on intelligibility of speech in noise for native English and German listeners. Speech Communication, 52, 1038-1055. <a href=3D"https://doi.org/10.1016/j.specom.2010.09.00= 9" target=3D"_BLANK" class=3D"moz-txt-link-freetext">https://doi.org/10.1= 016/j.specom.2010.09.009</a>   3)   the particular type of speech and/or its positi= on in the syllable and word (whose composition also matter. Many examples exist.  Laura Dilley and colleagues' work is neat (though would = only work for the types of sounds she uses - unlikely for /s/ in those types of context, and may be what you wanted to exclude in speech rate context cues')              M= orrill, T. H., Dilley, L. C., McAuley, J. D., & Pitt, M. A. (2014). Distal rhythm influences whether or not listeners hear a word in continuous speech: Support for a perceptual grouping hypothesis. Cognition, 131, 69-74. <a class=3D"moz-txt-link-freetext" href=3D"https://doi.org/10.1016/j.= cognition.2013.12.006" target=3D"_BLANK"> https://doi.org/10.1016/j.cognition.2013.12.006</a>            &nb= sp;  Morrill, T. H., Heffner, C. C., & Dilley, L. C. (2015). Interactions between distal speech rate, linguistic knowledge, and speech environment    Psychonomic Bulle= tin & Review, 22, 1451=E2=80=931457. <a class=3D"moz-txt-link-freetext" href=3D"https://doi.org/10.3758/s1= 3423-015-0820-9" target=3D"_BLANK"> https://doi.org/10.3758/s13423-015-0820-9</a>      The above references all use meaningful speech. Alain has mentioned misperceptions when listening to speech in a language you don't know so well (see also Heinrich paper above). There are many studies of this type of thing. Fun ones include so-called mondegreen perceptions of song lyrics, common in both native and cross-linguistic contexts, e.g. English 'all my feelings grow' heard as German 'oma fiel ins klo'      You can find serious= research on what influences these effects. Finally, a comment on Alain's email: yes, segmentation involves some weird assumptions, but can still be valid for many questions, especially if what you are doing is on relatively simple speech composition (obstruent-vowel sequences in clear speech being the best example), or comparing like with like in different contexts. For this, there is usually more than one justifiable segmentation criterion. My own observations show that the particular criterion for segmentation rarely affects conclusions, as long as the chosen criterion is used reliably.  An exception to this rule is measurements of English vowel dura= tion following phonologically voiced vs voiceless syllable-initial stops (e.g. b vs p). For 50 or more years from the 1960s or 70s, we standardly measured the end of the stop/beginning of the vowel from the stop burst (as Pierre wrote in this thread). Nowadays many people measure from the onset of phonation, on the mistaken assumption that all vowels are always voiced/phonated, and that any aspiration belongs to the consonant. (It belongs to both.) The consequence is that vowels following heavily aspirated voiceless stops (English syllable-initial p t k) measure as much shorter than those following the (unaspirated) voiced stops, b d g.     There is no 'right' segmentation criterion for thi= s situation: what you do depends on why you are measuring, and what else you are measuring. The stop burst is better if your focus is articulation or you want to pool durations regardless of initial stop voicing (reduces the variance). From onset of periodicity/voicing may be better if your interest is on rhythmic influences on perception. This also illustrates Alain's and my wider points that segmentation criteria need to reflect the type of speech and the purpose of the segmentation. I hope also that these types of comment illustrate the theoretical constraints that follow from using averages summed over wildly different phonological units, and then generalised to natural connected speech, such as that syllable duration is about 250 ms. all the best Sarah <div class=3D"moz-cite-prefix">On 2024-09-23 11:43, L=C3=A9o Varnet w= rote: </div> <blockquote type=3D"cite"> Dear all, I'd like to take Alain's response as a starting point for another sub-thread of discussion.  Alain, I assume that you are referring to the research on automatic phoneme classification based on temporal patterns, which typically use a [-500ms; +500ms] window. I'm curious about the maximum distance a phonetic cue can be from the nucleus of the corresponding phoneme. Does anybody in the List have insights on this? In my own experiments I have observed that in some cases cues as far as 800 ms before the target sound can influence phoneme categorization -- but these were speech rate context cues, not "directly informative" cues. </p= > Best L=C3=A9o <div class=3D"moz-signature"> L=C3=A9o Varnet - Chercheur CNRS Laboratoire des Syst=C3=A8mes Perceptifs, UMR 8248 =C3=89cole Normale Sup=C3=A9rieure 29, rue d'Ulm - 75005 Paris Cedex 05 T=C3=A9l. : (+33)6 33 93 29 34 <a href=3D"https://lsp.dec.ens.fr/en/member/1066/leo-varnet" cl= ass=3D"moz-txt-link-freetext" target=3D"_BLANK">https://lsp.dec.ens.fr/en= /member/1066/leo-varnet</a> <a href=3D"https://dbao.leo-varnet.fr/" class=3D"moz-txt-link-f= reetext" target=3D"_BLANK">https://dbao.leo-varnet.fr/</a> </div> <div class=3D"moz-cite-prefix">Le 21/09/2024 =C3=A0 11:51, Alain de Cheveigne a =C3=A9crit : </div> <blockquote type=3D"cite"> <pre class=3D"moz-quote-pre">Curiously, no one yet pointed out th= at "segmentation" itself is ill-defined. Syllables, like phone= mes, are defined at the phonological level which is abstract and distinct= from the acoustics. =20 Acoustic cues to a phoneme (or syllable) may come from an extended interv= al of sound that overlaps with the cues that signal the next phoneme. I s= eem to recall papers by Hynek Hermansky that found a support duration on = the order of 1s. If so, the "segmentation boundary" between pho= nemes/syllables is necessarily ill-defined.=20 In practice, it may be possible to define a segmentation that "conce= ntrates" information pertinent to each phoneme within a segment and = minimizes "spillover" to adjacent phonemes, but we should not b= e surprised if it works less well for some boundaries, or if different me= thods give different results. When listening to Japanese practice tapes, I remember noticing that the w= ord "futatsu" (two) sounded rather like "uftatsu", su= ggesting that the acoustic-to-phonological mapping (based on my native ph= onological system) could be loose enough to allow for a swap.=20 Alain=20 to be categorized </pre> <blockquote type=3D"cite"> <pre class=3D"moz-quote-pre">On 20 Sep 2024, at 11:29, Jan Schn= upp <a class=3D"moz-txt-link-rfc2396E" href=3D"mailto:000000e042a1ec30-dm= arc-request@xxxxxxxx"><000000e042a1ec30-dmarc-request@xxxxxxxx= ILL.CA></a> wrote: Dear Remy, it might be useful for us to know where your meaningless CV syllable stim= uli come from.=20 But in any event, if you are any good at coding you are likely better off= working directly computing parameters of the recording waveforms and app= ly criteria to those. CV syllables have an "energy arc" such th= at the V is invariably louder than the C. In speech there are rarely sile= nt gaps between syllables, so you may be looking at a CVCVCVCV... stream = where the only "easy" handle on the syllable boundary is likely= to be the end of end of the vowel, which should be recognizable by a mar= ked decline in acoustic energy, which you can quantify by some running RM= S value (perhaps after low-pass filtering given that consonants rarely ha= ve much low frequency energy). If that's not accurate or reliable enough = then things are likely to get a lot trickier. You could look for voicing = in a running autocorrelation as an additional cue given that all vowels a= re voiced but only some consonants are.=20 How many of these do you have to process? If the number isn't huge, it ma= y be quicker to find the boundaries "by ear" than trying to dev= elop a piece of computer code. The best way forward really depends enormo= usly on the nature of your original stimulus set. o Best wishes, Jan --------------------------------------- Prof Jan Schnupp Gerald Choa Neuroscience Institute The Chinese University of Hong Kong Sha Tin Hong Kong <a class=3D"moz-txt-link-freetext" href=3D"https://auditoryneuroscience.c= om" target=3D"_BLANK">https://auditoryneuroscience.com</a> <a class=3D"moz-txt-link-freetext" href=3D"http://jan.schnupp.net" target= =3D"_BLANK">http://jan.schnupp.net</a> On Thu, 19 Sept 2024 at 12:19, R=C3=A9my MASSON <a class=3D"moz-txt-link-= rfc2396E" href=3D"mailto:remy.masson@xxxxxxxx"><remy.masson@xxxxxxxx= fr></a> wrote: Hello AUDITORY list, We are attempting to do automatic syllable segmentation on a collection = of sound files that we use in an experiment. Our stimuli are a rapid sequ= ence of syllables (all beginning with a consonant and ending with a vowel= ) with no underlying semantic meaning and with no pauses. We would like t= o automatically extract the syllable/speech rate and obtain the timestamp= s for each syllable onset. We are a bit lost on which tool to use. We tried PRAAT with the Syllable= Nuclei v3 script, the software VoiceLab and the website WebMaus. Unfortu= nately, for each of them their estimation of the total number of syllable= s did not consistently match what we were able to count manually, despite= toggling with the parameters. =20 Do you have any advice on how to go further? Do you have any experience = in syllable onset extraction? Thank you for your understanding, R=C3=A9my MASSON Research Engineer Laboratory "Neural coding and neuroengineering of human speech funct= ions" (NeuroSpeech) Institut de l=E2=80=99Audition =E2=80=93 Institut Pasteur (Paris)<imag= e001.jpg>=20 </pre> </blockquote> </blockquote> </blockquote> </body> </html> --------------EMZZ9OcUlEEUH20AGJjUWpag--

This message came from the mail archive
postings/2024/
maintained by:

DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University