[AUDITORY] Fwd: Tool for automatic syllable segmentation (Oded Ghitza )

Subject: [AUDITORY] Fwd: Tool for automatic syllable segmentation From: Oded Ghitza <0000033c6d1eb24a-dmarc-request@xxxxxxxx> Date: Sun, 22 Sep 2024 10:10:56 -0400 --0000000000006308ef0622b5d9ba Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Humbly, here is my take on Alain's cautionary note: https://doi.org/10.3389/fpsyg.2013.00138 -- Oded. On Sun, Sep 22, 2024 at 12:08=E2=80=AFAM Alain de Cheveigne < alain.de.cheveigne@xxxxxxxx> wrote: > Curiously, no one yet pointed out that "segmentation" itself is > ill-defined. Syllables, like phonemes, are defined at the phonological > level which is abstract and distinct from the acoustics. > > Acoustic cues to a phoneme (or syllable) may come from an extended > interval of sound that overlaps with the cues that signal the next phonem= e. > I seem to recall papers by Hynek Hermansky that found a support duration = on > the order of 1s. If so, the "segmentation boundary" between > phonemes/syllables is necessarily ill-defined. > > In practice, it may be possible to define a segmentation that > "concentrates" information pertinent to each phoneme within a segment and > minimizes "spillover" to adjacent phonemes, but we should not be surprise= d > if it works less well for some boundaries, or if different methods give > different results. > > When listening to Japanese practice tapes, I remember noticing that the > word "futatsu" (two) sounded rather like "uftatsu", suggesting that the > acoustic-to-phonological mapping (based on my native phonological system) > could be loose enough to allow for a swap. > > Alain > > > > > On 20 Sep 2024, at 11:29, Jan Schnupp < > 000000e042a1ec30-dmarc-request@xxxxxxxx> wrote: > > > > Dear Remy, > > > > it might be useful for us to know where your meaningless CV syllable > stimuli come from. > > But in any event, if you are any good at coding you are likely better > off working directly computing parameters of the recording waveforms and > apply criteria to those. CV syllables have an "energy arc" such that the = V > is invariably louder than the C. In speech there are rarely silent gaps > between syllables, so you may be looking at a CVCVCVCV... stream where th= e > only "easy" handle on the syllable boundary is likely to be the end of en= d > of the vowel, which should be recognizable by a marked decline in acousti= c > energy, which you can quantify by some running RMS value (perhaps after > low-pass filtering given that consonants rarely have much low frequency > energy). If that's not accurate or reliable enough then things are likely > to get a lot trickier. You could look for voicing in a running > autocorrelation as an additional cue given that all vowels are voiced but > only some consonants are. > > How many of these do you have to process? If the number isn't huge, it > may be quicker to find the boundaries "by ear" than trying to develop a > piece of computer code. The best way forward really depends enormously on > the nature of your original stimulus set. > > > > Best wishes, > > > > Jan > > > > --------------------------------------- > > Prof Jan Schnupp > > Gerald Choa Neuroscience Institute > > The Chinese University of Hong Kong > > Sha Tin > > Hong Kong > > > > https://auditoryneuroscience.com > > http://jan.schnupp.net > > > > > > On Thu, 19 Sept 2024 at 12:19, R=C3=A9my MASSON <remy.masson@xxxxxxxx= > > wrote: > > Hello AUDITORY list, > > We are attempting to do automatic syllable segmentation on a collectio= n > of sound files that we use in an experiment. Our stimuli are a rapid > sequence of syllables (all beginning with a consonant and ending with a > vowel) with no underlying semantic meaning and with no pauses. We would > like to automatically extract the syllable/speech rate and obtain the > timestamps for each syllable onset. > > We are a bit lost on which tool to use. We tried PRAAT with the > Syllable Nuclei v3 script, the software VoiceLab and the website WebMaus. > Unfortunately, for each of them their estimation of the total number of > syllables did not consistently match what we were able to count manually, > despite toggling with the parameters. > > Do you have any advice on how to go further? Do you have any experienc= e > in syllable onset extraction? > > Thank you for your understanding, > > R=C3=A9my MASSON > > Research Engineer > > Laboratory "Neural coding and neuroengineering of human speech > functions" (NeuroSpeech) > > Institut de l=E2=80=99Audition =E2=80=93 Institut Pasteur (Paris)<image= 001.jpg> > --0000000000006308ef0622b5d9ba Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div class=3D"gmail_quote"><br><div dir=3D"ltr">Humbly, he= re is my take on Alain's=C2=A0<span style=3D"color:rgb(40,40,40);font-f= amily:MuseoSans,Helvetica,Arial,sans-serif;background-color:rgb(247,247,247= )">cautionary note:=C2=A0</span><a href=3D"https://doi.org/10.3389/fpsyg.20= 13.00138" style=3D"text-decoration-line:none;color:inherit;display:inline-b= lock;vertical-align:middle;font-family:MuseoSans,Helvetica,Arial,sans-serif= ;font-size:12px;background-color:rgb(247,247,247)" target=3D"_blank">https:= //doi.org/10.3389/fpsyg.2013.00138</a><br clear=3D"all"><div><div dir=3D"lt= r" class=3D"gmail_signature" data-smartmail=3D"gmail_signature"><div dir=3D= "ltr">--</div><div dir=3D"ltr">Oded.</div></div></div><br></div><br><div cl= ass=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Sun, Sep 22, 2= 024 at 12:08=E2=80=AFAM Alain de Cheveigne <<a href=3D"mailto:alain.de.c= heveigne@xxxxxxxx" target=3D"_blank">alain.de.cheveigne@xxxxxxxx</a>&gt= ; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px= 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Curious= ly, no one yet pointed out that "segmentation" itself is ill-defi= ned.=C2=A0 Syllables, like phonemes, are defined at the phonological level = which is abstract and distinct from the acoustics.=C2=A0 <br> <br> Acoustic cues to a phoneme (or syllable) may come from an extended interval= of sound that overlaps with the cues that signal the next phoneme. I seem = to recall papers by Hynek Hermansky that found a support duration on the or= der of 1s. If so, the "segmentation boundary" between phonemes/sy= llables is necessarily ill-defined. <br> <br> In practice, it may be possible to define a segmentation that "concent= rates" information pertinent to each phoneme within a segment and mini= mizes "spillover" to adjacent phonemes, but we should not be surp= rised if it works less well for some boundaries, or if different methods gi= ve different results.<br> <br> When listening to Japanese practice tapes, I remember noticing that the wor= d "futatsu" (two) sounded rather like "uftatsu", sugges= ting that the acoustic-to-phonological mapping (based on my native phonolog= ical system) could be loose enough to allow for a swap. <br> <br> Alain <br> <br> <br> <br> > On 20 Sep 2024, at 11:29, Jan Schnupp <<a href=3D"mailto:000000e042= a1ec30-dmarc-request@xxxxxxxx" target=3D"_blank">000000e042a1ec30-dm= arc-request@xxxxxxxx</a>> wrote:<br> > <br> > Dear Remy,<br> > <br> > it might be useful for us to know where your meaningless CV syllable s= timuli come from. <br> > But in any event, if you are any good at coding you are likely better = off working directly computing parameters of the recording waveforms and ap= ply criteria to those. CV syllables have an "energy arc" such tha= t the V is invariably louder than the C. In speech there are rarely silent = gaps between syllables, so you may be looking at a CVCVCVCV... stream where= the only "easy" handle on the syllable boundary is likely to be = the end of end of the vowel, which should be recognizable by a marked decli= ne in acoustic energy, which you can quantify by some running RMS value (pe= rhaps after low-pass filtering given that consonants rarely have much low f= requency energy). If that's not accurate or reliable enough then things= are likely to get a lot trickier. You could look for voicing in a running = autocorrelation as an additional cue given that all vowels are voiced but o= nly some consonants are. <br> > How many of these do you have to process? If the number isn't huge= , it may be quicker to find the boundaries "by ear" than trying t= o develop a piece of computer code. The best way forward really depends eno= rmously on the nature of your original stimulus set. <br> > <br> > Best wishes,<br> > <br> > Jan<br> > <br> > ---------------------------------------<br> > Prof Jan Schnupp<br> > Gerald Choa Neuroscience Institute<br> > The Chinese University of Hong Kong<br> > Sha Tin<br> > Hong Kong<br> > <br> > <a href=3D"https://auditoryneuroscience.com" rel=3D"noreferrer" target= =3D"_blank">https://auditoryneuroscience.com</a><br> > <a href=3D"http://jan.schnupp.net" rel=3D"noreferrer" target=3D"_blank= ">http://jan.schnupp.net</a><br> > <br> > <br> > On Thu, 19 Sept 2024 at 12:19, R=C3=A9my MASSON <<a href=3D"mailto:= remy.masson@xxxxxxxx" target=3D"_blank">remy.masson@xxxxxxxx</a>> wr= ote:<br> > Hello AUDITORY list,<br> >=C2=A0 We are attempting to do automatic syllable segmentation on a col= lection of sound files that we use in an experiment. Our stimuli are a rapi= d sequence of syllables (all beginning with a consonant and ending with a v= owel) with no underlying semantic meaning and with no pauses. We would like= to automatically extract the syllable/speech rate and obtain the timestamp= s for each syllable onset.<br> >=C2=A0 We are a bit lost on which tool to use. We tried PRAAT with the = Syllable Nuclei v3 script, the software VoiceLab and the website WebMaus. U= nfortunately, for each of them their estimation of the total number of syll= ables did not consistently match what we were able to count manually, despi= te toggling with the parameters.=C2=A0 <br> >=C2=A0 Do you have any advice on how to go further? Do you have any exp= erience in syllable onset extraction?<br> >=C2=A0 Thank you for your understanding,<br> >=C2=A0 R=C3=A9my MASSON<br> > Research Engineer<br> > Laboratory "Neural coding and neuroengineering of human speech fu= nctions" (NeuroSpeech)<br> > Institut de l=E2=80=99Audition =E2=80=93 Institut Pasteur (Paris)<i= mage001.jpg> <br> </blockquote></div> </div></div> --0000000000006308ef0622b5d9ba--

This message came from the mail archive
postings/2024/
maintained by:

DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University