Re: [AUDITORY] Tool for automatic syllable segmentation (Nathan Barlow )


Subject: Re: [AUDITORY] Tool for automatic syllable segmentation
From:    Nathan Barlow  <nb.audiology@xxxxxxxx>
Date:    Thu, 26 Sep 2024 07:32:03 +0100

--000000000000b7857a0622ffe75d Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Leo - clinically we use 100ms between phonemes, with success. its a gold standard (see work of the likes of S.C Purdy or Anu Sharma or Nina Kraus). The phonemes must be recorded with co-artitulation effects (i.e as a whole word), and *then* cut into syllables or phonemes. Otherwise signal onramp or off-ramp is affected away from natural speech. (compared to recording the phoneme in isolation: See Daniel Ling on articulation manner on live voice Ling Test (the orginal six phonemes test) or Jane Madel in circa 2022 on the updated Ling test with nine phonemes) best Nathan On Tue, 24 Sept 2024 at 05:14, L=C3=A9o Varnet <leo.varnet@xxxxxxxx> wrot= e: > Dear all, > > I'd like to take Alain's response as a starting point for another > sub-thread of discussion. > > Alain, I assume that you are referring to the research on automatic > phoneme classification based on temporal patterns, which typically use a > [-500ms; +500ms] window. I'm curious about the maximum distance a phoneti= c > cue can be from the nucleus of the corresponding phoneme. Does anybody in > the List have insights on this? In my own experiments I have observed tha= t > in some cases cues as far as 800 ms before the target sound can influence > phoneme categorization -- but these were speech rate context cues, not > "directly informative" cues. > > Best > > L=C3=A9o > > > L=C3=A9o Varnet - Chercheur CNRS > Laboratoire des Syst=C3=A8mes Perceptifs, UMR 8248 > =C3=89cole Normale Sup=C3=A9rieure > 29, rue d'Ulm - 75005 Paris Cedex 05 > T=C3=A9l. : (+33)6 33 93 29 34 > https://lsp.dec.ens.fr/en/member/1066/leo-varnet > https://dbao.leo-varnet.fr/ > Le 21/09/2024 =C3=A0 11:51, Alain de Cheveigne a =C3=A9crit : > > Curiously, no one yet pointed out that "segmentation" itself is ill-defin= ed. Syllables, like phonemes, are defined at the phonological level which = is abstract and distinct from the acoustics. > > Acoustic cues to a phoneme (or syllable) may come from an extended interv= al of sound that overlaps with the cues that signal the next phoneme. I see= m to recall papers by Hynek Hermansky that found a support duration on the = order of 1s. If so, the "segmentation boundary" between phonemes/syllables = is necessarily ill-defined. > > In practice, it may be possible to define a segmentation that "concentrat= es" information pertinent to each phoneme within a segment and minimizes "s= pillover" to adjacent phonemes, but we should not be surprised if it works = less well for some boundaries, or if different methods give different resul= ts. > > When listening to Japanese practice tapes, I remember noticing that the w= ord "futatsu" (two) sounded rather like "uftatsu", suggesting that the acou= stic-to-phonological mapping (based on my native phonological system) could= be loose enough to allow for a swap. > > Alain > to be categorized > > > > On 20 Sep 2024, at 11:29, Jan Schnupp <000000e042a1ec30-dmarc-request@xxxxxxxx= TS.MCGILL.CA> <000000e042a1ec30-dmarc-request@xxxxxxxx> wrote: > > Dear Remy, > > it might be useful for us to know where your meaningless CV syllable stim= uli come from. > But in any event, if you are any good at coding you are likely better off= working directly computing parameters of the recording waveforms and apply= criteria to those. CV syllables have an "energy arc" such that the V is in= variably louder than the C. In speech there are rarely silent gaps between = syllables, so you may be looking at a CVCVCVCV... stream where the only "ea= sy" handle on the syllable boundary is likely to be the end of end of the v= owel, which should be recognizable by a marked decline in acoustic energy, = which you can quantify by some running RMS value (perhaps after low-pass fi= ltering given that consonants rarely have much low frequency energy). If th= at's not accurate or reliable enough then things are likely to get a lot tr= ickier. You could look for voicing in a running autocorrelation as an addit= ional cue given that all vowels are voiced but only some consonants are. > How many of these do you have to process? If the number isn't huge, it ma= y be quicker to find the boundaries "by ear" than trying to develop a piece= of computer code. The best way forward really depends enormously on the na= ture of your original stimulus set. o > > Best wishes, > > Jan > > --------------------------------------- > Prof Jan Schnupp > Gerald Choa Neuroscience Institute > The Chinese University of Hong Kong > Sha Tin > Hong Kong > https://auditoryneuroscience.comhttp://jan.schnupp.net > > > On Thu, 19 Sept 2024 at 12:19, R=C3=A9my MASSON <remy.masson@xxxxxxxx> = <remy.masson@xxxxxxxx> wrote: > Hello AUDITORY list, > We are attempting to do automatic syllable segmentation on a collection = of sound files that we use in an experiment. Our stimuli are a rapid sequen= ce of syllables (all beginning with a consonant and ending with a vowel) wi= th no underlying semantic meaning and with no pauses. We would like to auto= matically extract the syllable/speech rate and obtain the timestamps for ea= ch syllable onset. > We are a bit lost on which tool to use. We tried PRAAT with the Syllable= Nuclei v3 script, the software VoiceLab and the website WebMaus. Unfortuna= tely, for each of them their estimation of the total number of syllables di= d not consistently match what we were able to count manually, despite toggl= ing with the parameters. > Do you have any advice on how to go further? Do you have any experience = in syllable onset extraction? > Thank you for your understanding, > R=C3=A9my MASSON > Research Engineer > Laboratory "Neural coding and neuroengineering of human speech functions"= (NeuroSpeech) > Institut de l=E2=80=99Audition =E2=80=93 Institut Pasteur (Paris)<image00= 1.jpg> > > --=20 Nathan Barlow BSc, PGDip, MSc(SpchSci)(Hons), CoP, MSc(Clinical Audiology)(Soton) www.eresope.wordpress.com @xxxxxxxx --000000000000b7857a0622ffe75d Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div dir=3D"ltr">Leo - clinically we use 100ms between pho= nemes, with success. its a gold standard (see work of the likes of S.C Purd= y or Anu Sharma or Nina Kraus). The phonemes must be recorded with co-artit= ulation effects (i.e as a whole word), and *then* cut into syllables or pho= nemes. Otherwise signal onramp or off-ramp is affected away from natural sp= eech. (compared to recording the phoneme in isolation: See Daniel Ling on a= rticulation manner on live voice Ling Test (the orginal six phonemes test) = or Jane Madel in circa 2022 on the updated Ling test with nine phonemes)=C2= =A0 <br><br></div><div>best<br></div><div>Nathan<br></div><br><div class=3D= "gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Tue, 24 Sept 2024 at= 05:14, L=C3=A9o Varnet &lt;<a href=3D"mailto:leo.varnet@xxxxxxxx">leo.va= rnet@xxxxxxxx</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" s= tyle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);pad= ding-left:1ex"><u></u> =20 =20 =20 <div> <p>Dear all,</p> <p>I&#39;d like to take Alain&#39;s response as a starting point for an= other sub-thread of discussion.=C2=A0</p> <p>Alain, I assume that you are referring to the research on automatic phoneme classification based on temporal patterns, which typically use a [-500ms; +500ms] window. I&#39;m curious about the maximum distance a phonetic cue can be from the nucleus of the corresponding phoneme.=C2=A0Does anybody in the List have insights on this? In my own experiments I have observed that in some cases cues as far as 800 ms before the target sound can influence phoneme categorization -- but these were speech rate context cues, not &quot;directly informative&quot; cues.=C2=A0</p> <p>Best</p> <p>L=C3=A9o<br> </p> <p><br> </p> <div> <p style=3D"color:gray">L=C3=A9o Varnet - Chercheur CNRS<br> Laboratoire des Syst=C3=A8mes Perceptifs, UMR 8248<br> =C3=89cole Normale Sup=C3=A9rieure<br> 29, rue d&#39;Ulm - 75005 Paris Cedex 05<br> T=C3=A9l. : (+33)6 33 93 29 34<br> <a href=3D"https://lsp.dec.ens.fr/en/member/1066/leo-varnet" target= =3D"_blank">https://lsp.dec.ens.fr/en/member/1066/leo-varnet</a><br> <a href=3D"https://dbao.leo-varnet.fr/" target=3D"_blank">https://d= bao.leo-varnet.fr/</a><br> </p> </div> <div>Le 21/09/2024 =C3=A0 11:51, Alain de Cheveigne a =C3=A9crit=C2=A0:<br> </div> <blockquote type=3D"cite"> <pre>Curiously, no one yet pointed out that &quot;segmentation&quot; = itself is ill-defined. Syllables, like phonemes, are defined at the phonol= ogical level which is abstract and distinct from the acoustics. =20 Acoustic cues to a phoneme (or syllable) may come from an extended interval= of sound that overlaps with the cues that signal the next phoneme. I seem = to recall papers by Hynek Hermansky that found a support duration on the or= der of 1s. If so, the &quot;segmentation boundary&quot; between phonemes/sy= llables is necessarily ill-defined.=20 In practice, it may be possible to define a segmentation that &quot;concent= rates&quot; information pertinent to each phoneme within a segment and mini= mizes &quot;spillover&quot; to adjacent phonemes, but we should not be surp= rised if it works less well for some boundaries, or if different methods gi= ve different results. When listening to Japanese practice tapes, I remember noticing that the wor= d &quot;futatsu&quot; (two) sounded rather like &quot;uftatsu&quot;, sugges= ting that the acoustic-to-phonological mapping (based on my native phonolog= ical system) could be loose enough to allow for a swap.=20 Alain=20 to be categorized </pre> <blockquote type=3D"cite"> <pre>On 20 Sep 2024, at 11:29, Jan Schnupp <a href=3D"mailto:000000= e042a1ec30-dmarc-request@xxxxxxxx" target=3D"_blank">&lt;000000e042a= 1ec30-dmarc-request@xxxxxxxx&gt;</a> wrote: Dear Remy, it might be useful for us to know where your meaningless CV syllable stimul= i come from.=20 But in any event, if you are any good at coding you are likely better off w= orking directly computing parameters of the recording waveforms and apply c= riteria to those. CV syllables have an &quot;energy arc&quot; such that the= V is invariably louder than the C. In speech there are rarely silent gaps = between syllables, so you may be looking at a CVCVCVCV... stream where the = only &quot;easy&quot; handle on the syllable boundary is likely to be the e= nd of end of the vowel, which should be recognizable by a marked decline in= acoustic energy, which you can quantify by some running RMS value (perhaps= after low-pass filtering given that consonants rarely have much low freque= ncy energy). If that&#39;s not accurate or reliable enough then things are = likely to get a lot trickier. You could look for voicing in a running autoc= orrelation as an additional cue given that all vowels are voiced but only s= ome consonants are.=20 How many of these do you have to process? If the number isn&#39;t huge, it = may be quicker to find the boundaries &quot;by ear&quot; than trying to dev= elop a piece of computer code. The best way forward really depends enormous= ly on the nature of your original stimulus set. o Best wishes, Jan --------------------------------------- Prof Jan Schnupp Gerald Choa Neuroscience Institute The Chinese University of Hong Kong Sha Tin Hong Kong <a href=3D"https://auditoryneuroscience.com" target=3D"_blank">https://audi= toryneuroscience.com</a> <a href=3D"http://jan.schnupp.net" target=3D"_blank">http://jan.schnupp.net= </a> On Thu, 19 Sept 2024 at 12:19, R=C3=A9my MASSON <a href=3D"mailto:remy.mass= on@xxxxxxxx" target=3D"_blank">&lt;remy.masson@xxxxxxxx&gt;</a> wrote: Hello AUDITORY list, We are attempting to do automatic syllable segmentation on a collection of= sound files that we use in an experiment. Our stimuli are a rapid sequence= of syllables (all beginning with a consonant and ending with a vowel) with= no underlying semantic meaning and with no pauses. We would like to automa= tically extract the syllable/speech rate and obtain the timestamps for each= syllable onset. We are a bit lost on which tool to use. We tried PRAAT with the Syllable N= uclei v3 script, the software VoiceLab and the website WebMaus. Unfortunate= ly, for each of them their estimation of the total number of syllables did = not consistently match what we were able to count manually, despite togglin= g with the parameters. =20 Do you have any advice on how to go further? Do you have any experience in= syllable onset extraction? Thank you for your understanding, R=C3=A9my MASSON Research Engineer Laboratory &quot;Neural coding and neuroengineering of human speech functio= ns&quot; (NeuroSpeech) Institut de l=E2=80=99Audition =E2=80=93 Institut Pasteur (Paris)&lt;image0= 01.jpg&gt;=20 </pre> </blockquote> </blockquote> </div> </blockquote></div><br clear=3D"all"><br><span class=3D"gmail_signature_pre= fix">-- </span><br><div dir=3D"ltr" class=3D"gmail_signature"><div dir=3D"l= tr">Nathan Barlow<div><font size=3D"1" style=3D"background-color:rgb(255,25= 5,255)" color=3D"#666666">BSc, PGDip, MSc(SpchSci)(Hons), CoP, MSc(Clinical= Audiology)(Soton)</font></div><div><span style=3D"background-color:rgb(255= ,255,255)"><font color=3D"#000000"><a href=3D"http://www.eresope.wordpress.= com" target=3D"_blank">www.eresope.wordpress.com</a></font></span></div><di= v><span style=3D"background-color:rgb(255,255,255)">@xxxxxxxx</span></div><= div><span style=3D"background-color:rgb(255,255,255)"><br></span></div></di= v></div></div> --000000000000b7857a0622ffe75d--


This message came from the mail archive
postings/2024/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University