Re: [AUDITORY] Tool for automatic syllable segmentation (=?UTF-8?Q?L=C3=A9o_Varnet?= )


Subject: Re: [AUDITORY] Tool for automatic syllable segmentation
From:    =?UTF-8?Q?L=C3=A9o_Varnet?=  <leo.varnet@xxxxxxxx>
Date:    Mon, 23 Sep 2024 12:43:08 +0200

This is a multi-part message in MIME format. --------------VWmsLZG1U0xu4jt5S01dn37z Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-MIME-Autoconverted: from 8bit to quoted-printable by edgeum4.it.mcgill.ca id 48NAhF1q090697 Dear all, I'd like to take Alain's response as a starting point for another=20 sub-thread of discussion. Alain, I assume that you are referring to the research on automatic=20 phoneme classification based on temporal patterns, which typically use a=20 [-500ms; +500ms] window. I'm curious about the maximum distance a=20 phonetic cue can be from the nucleus of the corresponding phoneme.=C2=A0D= oes=20 anybody in the List have insights on this? In my own experiments I have=20 observed that in some cases cues as far as 800 ms before the target=20 sound can influence phoneme categorization -- but these were speech rate=20 context cues, not "directly informative" cues. Best L=C3=A9o L=C3=A9o Varnet - Chercheur CNRS Laboratoire des Syst=C3=A8mes Perceptifs, UMR 8248 =C3=89cole Normale Sup=C3=A9rieure 29, rue d'Ulm - 75005 Paris Cedex 05 T=C3=A9l. : (+33)6 33 93 29 34 https://lsp.dec.ens.fr/en/member/1066/leo-varnet https://dbao.leo-varnet.fr/ Le 21/09/2024 =C3=A0 11:51, Alain de Cheveigne a =C3=A9crit=C2=A0: > Curiously, no one yet pointed out that "segmentation" itself is ill-def= ined. Syllables, like phonemes, are defined at the phonological level wh= ich is abstract and distinct from the acoustics. > > Acoustic cues to a phoneme (or syllable) may come from an extended inte= rval of sound that overlaps with the cues that signal the next phoneme. I= seem to recall papers by Hynek Hermansky that found a support duration o= n the order of 1s. If so, the "segmentation boundary" between phonemes/sy= llables is necessarily ill-defined. > > In practice, it may be possible to define a segmentation that "concentr= ates" information pertinent to each phoneme within a segment and minimize= s "spillover" to adjacent phonemes, but we should not be surprised if it = works less well for some boundaries, or if different methods give differe= nt results. > > When listening to Japanese practice tapes, I remember noticing that the= word "futatsu" (two) sounded rather like "uftatsu", suggesting that the = acoustic-to-phonological mapping (based on my native phonological system)= could be loose enough to allow for a swap. > > Alain > to be categorized > > >> On 20 Sep 2024, at 11:29, Jan Schnupp<000000e042a1ec30-dmarc-request@xxxxxxxx= ISTS.MCGILL.CA> wrote: >> >> Dear Remy, >> >> it might be useful for us to know where your meaningless CV syllable s= timuli come from. >> But in any event, if you are any good at coding you are likely better = off working directly computing parameters of the recording waveforms and = apply criteria to those. CV syllables have an "energy arc" such that the = V is invariably louder than the C. In speech there are rarely silent gaps= between syllables, so you may be looking at a CVCVCVCV... stream where t= he only "easy" handle on the syllable boundary is likely to be the end of= end of the vowel, which should be recognizable by a marked decline in ac= oustic energy, which you can quantify by some running RMS value (perhaps = after low-pass filtering given that consonants rarely have much low frequ= ency energy). If that's not accurate or reliable enough then things are l= ikely to get a lot trickier. You could look for voicing in a running auto= correlation as an additional cue given that all vowels are voiced but onl= y some consonants are. >> How many of these do you have to process? If the number isn't huge, it= may be quicker to find the boundaries "by ear" than trying to develop a = piece of computer code. The best way forward really depends enormously on= the nature of your original stimulus set. o >> >> Best wishes, >> >> Jan >> >> --------------------------------------- >> Prof Jan Schnupp >> Gerald Choa Neuroscience Institute >> The Chinese University of Hong Kong >> Sha Tin >> Hong Kong >> >> https://auditoryneuroscience.com >> http://jan.schnupp.net >> >> >> On Thu, 19 Sept 2024 at 12:19, R=C3=A9my MASSON<remy.masson@xxxxxxxx= > wrote: >> Hello AUDITORY list, >> We are attempting to do automatic syllable segmentation on a collect= ion of sound files that we use in an experiment. Our stimuli are a rapid = sequence of syllables (all beginning with a consonant and ending with a v= owel) with no underlying semantic meaning and with no pauses. We would li= ke to automatically extract the syllable/speech rate and obtain the times= tamps for each syllable onset. >> We are a bit lost on which tool to use. We tried PRAAT with the Syll= able Nuclei v3 script, the software VoiceLab and the website WebMaus. Unf= ortunately, for each of them their estimation of the total number of syll= ables did not consistently match what we were able to count manually, des= pite toggling with the parameters. >> Do you have any advice on how to go further? Do you have any experie= nce in syllable onset extraction? >> Thank you for your understanding, >> R=C3=A9my MASSON >> Research Engineer >> Laboratory "Neural coding and neuroengineering of human speech functio= ns" (NeuroSpeech) >> Institut de l=E2=80=99Audition =E2=80=93 Institut Pasteur (Paris)<imag= e001.jpg> --------------VWmsLZG1U0xu4jt5S01dn37z Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-MIME-Autoconverted: from 8bit to quoted-printable by edgeum4.it.mcgill.ca id 48NAhF1q090697 <!DOCTYPE html> <html> <head> <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DUTF= -8"> </head> <body> <p>Dear all,</p> <p>I'd like to take Alain's response as a starting point for another sub-thread of discussion.=C2=A0</p> <p>Alain, I assume that you are referring to the research on automatic phoneme classification based on temporal patterns, which typically use a [-500ms; +500ms] window. I'm curious about the maximum distance a phonetic cue can be from the nucleus of the corresponding phoneme.=C2=A0Does anybody in the List have insights = on this? In my own experiments I have observed that in some cases cues as far as 800 ms before the target sound can influence phoneme categorization -- but these were speech rate context cues, not "directly informative" cues.=C2=A0</p> <p>Best</p> <p>L=C3=A9o<br> </p> <p><br> </p> <div class=3D"moz-signature"> <p style=3D"color:gray;">L=C3=A9o Varnet - Chercheur CNRS<br> Laboratoire des Syst=C3=A8mes Perceptifs, UMR 8248<br> =C3=89cole Normale Sup=C3=A9rieure<br> 29, rue d'Ulm - 75005 Paris Cedex 05<br> T=C3=A9l. : (+33)6 33 93 29 34<br> <a href=3D"https://lsp.dec.ens.fr/en/member/1066/leo-varnet" class=3D"moz-txt-link-freetext">https://lsp.dec.ens.fr/en/membe= r/1066/leo-varnet</a><br> <a href=3D"https://dbao.leo-varnet.fr/" class=3D"moz-txt-link-freetext">https://dbao.leo-varnet.fr/</a>= <br> </p> </div> <div class=3D"moz-cite-prefix">Le 21/09/2024 =C3=A0 11:51, Alain de Cheveigne a =C3=A9crit=C2=A0:<br> </div> <blockquote type=3D"cite" cite=3D"mid:7B8B9BDB-24CC-41A8-92C6-2F916D493BE2@xxxxxxxx"> <pre class=3D"moz-quote-pre" wrap=3D"">Curiously, no one yet pointe= d out that "segmentation" itself is ill-defined. Syllables, like phoneme= s, are defined at the phonological level which is abstract and distinct f= rom the acoustics. =20 Acoustic cues to a phoneme (or syllable) may come from an extended interv= al of sound that overlaps with the cues that signal the next phoneme. I s= eem to recall papers by Hynek Hermansky that found a support duration on = the order of 1s. If so, the "segmentation boundary" between phonemes/syll= ables is necessarily ill-defined.=20 In practice, it may be possible to define a segmentation that "concentrat= es" information pertinent to each phoneme within a segment and minimizes = "spillover" to adjacent phonemes, but we should not be surprised if it wo= rks less well for some boundaries, or if different methods give different= results. When listening to Japanese practice tapes, I remember noticing that the w= ord "futatsu" (two) sounded rather like "uftatsu", suggesting that the ac= oustic-to-phonological mapping (based on my native phonological system) c= ould be loose enough to allow for a swap.=20 Alain=20 to be categorized </pre> <blockquote type=3D"cite"> <pre class=3D"moz-quote-pre" wrap=3D"">On 20 Sep 2024, at 11:29, = Jan Schnupp <a class=3D"moz-txt-link-rfc2396E" href=3D"mailto:000000e042a= 1ec30-dmarc-request@xxxxxxxx">&lt;000000e042a1ec30-dmarc-request@xxxxxxxx= ISTS.MCGILL.CA&gt;</a> wrote: Dear Remy, it might be useful for us to know where your meaningless CV syllable stim= uli come from.=20 But in any event, if you are any good at coding you are likely better off= working directly computing parameters of the recording waveforms and app= ly criteria to those. CV syllables have an "energy arc" such that the V i= s invariably louder than the C. In speech there are rarely silent gaps be= tween syllables, so you may be looking at a CVCVCVCV... stream where the = only "easy" handle on the syllable boundary is likely to be the end of en= d of the vowel, which should be recognizable by a marked decline in acous= tic energy, which you can quantify by some running RMS value (perhaps aft= er low-pass filtering given that consonants rarely have much low frequenc= y energy). If that's not accurate or reliable enough then things are like= ly to get a lot trickier. You could look for voicing in a running autocor= relation as an additional cue given that all vowels are voiced but only s= ome consonants are.=20 How many of these do you have to process? If the number isn't huge, it ma= y be quicker to find the boundaries "by ear" than trying to develop a pie= ce of computer code. The best way forward really depends enormously on th= e nature of your original stimulus set. o Best wishes, Jan --------------------------------------- Prof Jan Schnupp Gerald Choa Neuroscience Institute The Chinese University of Hong Kong Sha Tin Hong Kong <a class=3D"moz-txt-link-freetext" href=3D"https://auditoryneuroscience.c= om">https://auditoryneuroscience.com</a> <a class=3D"moz-txt-link-freetext" href=3D"http://jan.schnupp.net">http:/= /jan.schnupp.net</a> On Thu, 19 Sept 2024 at 12:19, R=C3=A9my MASSON <a class=3D"moz-txt-link-= rfc2396E" href=3D"mailto:remy.masson@xxxxxxxx">&lt;remy.masson@xxxxxxxx= fr&gt;</a> wrote: Hello AUDITORY list, We are attempting to do automatic syllable segmentation on a collection = of sound files that we use in an experiment. Our stimuli are a rapid sequ= ence of syllables (all beginning with a consonant and ending with a vowel= ) with no underlying semantic meaning and with no pauses. We would like t= o automatically extract the syllable/speech rate and obtain the timestamp= s for each syllable onset. We are a bit lost on which tool to use. We tried PRAAT with the Syllable= Nuclei v3 script, the software VoiceLab and the website WebMaus. Unfortu= nately, for each of them their estimation of the total number of syllable= s did not consistently match what we were able to count manually, despite= toggling with the parameters. =20 Do you have any advice on how to go further? Do you have any experience = in syllable onset extraction? Thank you for your understanding, R=C3=A9my MASSON Research Engineer Laboratory "Neural coding and neuroengineering of human speech functions"= (NeuroSpeech) Institut de l=E2=80=99Audition =E2=80=93 Institut Pasteur (Paris)&lt;imag= e001.jpg&gt;=20 </pre> </blockquote> </blockquote> </body> </html> --------------VWmsLZG1U0xu4jt5S01dn37z--


This message came from the mail archive
postings/2024/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University