[AUDITORY] Fwd: Tool for automatic syllable segmentation (Jonathan James Digby )


Subject: [AUDITORY] Fwd: Tool for automatic syllable segmentation
From:    Jonathan James Digby  <digbyphonic@xxxxxxxx>
Date:    Tue, 24 Sep 2024 08:59:40 +0100

This is a multipart message in MIME format. ------=_NextPart_000_008D_01DB0E60.175690C0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Thank you very much for sharing your article, Oded. May be off-topic, but could be of interest to the discussion: =20 Bloom, Jon, =E2=80=98A Standard for Morse Timing Using the Farnsworth = Technique=E2=80=99 (1990) = <https://www.arrl.org/files/file/Technology/x9004008.pdf> [accessed 24 = September 2024] =20 Have a good day, all. _________________________________ Jonathan J Digby =20 =20 Sent: 22 September 2024 15:11 Humbly, here is my take on Alain's cautionary note: = <https://doi.org/10.3389/fpsyg.2013.00138> = https://doi.org/10.3389/fpsyg.2013.00138 -- Oded. =20 =20 On Sun, Sep 22, 2024 at 12:08=E2=80=AFAM Alain de Cheveigne = <alain.de.cheveigne@xxxxxxxx <mailto:alain.de.cheveigne@xxxxxxxx> > = wrote: Curiously, no one yet pointed out that "segmentation" itself is = ill-defined. Syllables, like phonemes, are defined at the phonological = level which is abstract and distinct from the acoustics. =20 Acoustic cues to a phoneme (or syllable) may come from an extended = interval of sound that overlaps with the cues that signal the next = phoneme. I seem to recall papers by Hynek Hermansky that found a support = duration on the order of 1s. If so, the "segmentation boundary" between = phonemes/syllables is necessarily ill-defined.=20 In practice, it may be possible to define a segmentation that = "concentrates" information pertinent to each phoneme within a segment = and minimizes "spillover" to adjacent phonemes, but we should not be = surprised if it works less well for some boundaries, or if different = methods give different results. When listening to Japanese practice tapes, I remember noticing that the = word "futatsu" (two) sounded rather like "uftatsu", suggesting that the = acoustic-to-phonological mapping (based on my native phonological = system) could be loose enough to allow for a swap.=20 Alain=20 > On 20 Sep 2024, at 11:29, Jan Schnupp = <000000e042a1ec30-dmarc-request@xxxxxxxx = <mailto:000000e042a1ec30-dmarc-request@xxxxxxxx> > wrote: >=20 > Dear Remy, >=20 > it might be useful for us to know where your meaningless CV syllable = stimuli come from.=20 > But in any event, if you are any good at coding you are likely better = off working directly computing parameters of the recording waveforms and = apply criteria to those. CV syllables have an "energy arc" such that the = V is invariably louder than the C. In speech there are rarely silent = gaps between syllables, so you may be looking at a CVCVCVCV... stream = where the only "easy" handle on the syllable boundary is likely to be = the end of end of the vowel, which should be recognizable by a marked = decline in acoustic energy, which you can quantify by some running RMS = value (perhaps after low-pass filtering given that consonants rarely = have much low frequency energy). If that's not accurate or reliable = enough then things are likely to get a lot trickier. You could look for = voicing in a running autocorrelation as an additional cue given that all = vowels are voiced but only some consonants are.=20 > How many of these do you have to process? If the number isn't huge, it = may be quicker to find the boundaries "by ear" than trying to develop a = piece of computer code. The best way forward really depends enormously = on the nature of your original stimulus set.=20 >=20 > Best wishes, >=20 > Jan >=20 > --------------------------------------- > Prof Jan Schnupp > Gerald Choa Neuroscience Institute > The Chinese University of Hong Kong > Sha Tin > Hong Kong >=20 > https://auditoryneuroscience.com > http://jan.schnupp.net >=20 >=20 > On Thu, 19 Sept 2024 at 12:19, R=C3=A9my MASSON = <remy.masson@xxxxxxxx <mailto:remy.masson@xxxxxxxx> > wrote: > Hello AUDITORY list, > We are attempting to do automatic syllable segmentation on a = collection of sound files that we use in an experiment. Our stimuli are = a rapid sequence of syllables (all beginning with a consonant and ending = with a vowel) with no underlying semantic meaning and with no pauses. We = would like to automatically extract the syllable/speech rate and obtain = the timestamps for each syllable onset. > We are a bit lost on which tool to use. We tried PRAAT with the = Syllable Nuclei v3 script, the software VoiceLab and the website = WebMaus. Unfortunately, for each of them their estimation of the total = number of syllables did not consistently match what we were able to = count manually, despite toggling with the parameters. =20 > Do you have any advice on how to go further? Do you have any = experience in syllable onset extraction? > Thank you for your understanding, > R=C3=A9my MASSON > Research Engineer > Laboratory "Neural coding and neuroengineering of human speech = functions" (NeuroSpeech) > Institut de l=E2=80=99Audition =E2=80=93 Institut Pasteur (Paris)<imag =20 ------=_NextPart_000_008D_01DB0E60.175690C0 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable <html xmlns:v=3D"urn:schemas-microsoft-com:vml" = xmlns:o=3D"urn:schemas-microsoft-com:office:office" = xmlns:w=3D"urn:schemas-microsoft-com:office:word" = xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" = xmlns=3D"http://www.w3.org/TR/REC-html40"><head><meta = http-equiv=3DContent-Type content=3D"text/html; charset=3Dutf-8"><meta = name=3DGenerator content=3D"Microsoft Word 15 (filtered = medium)"><style><!-- /* Font Definitions */ @xxxxxxxx {font-family:Helvetica; panose-1:2 11 6 4 2 2 2 2 2 4;} @xxxxxxxx {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;} @xxxxxxxx {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @xxxxxxxx {font-family:Aptos;} @xxxxxxxx {font-family:"IBM Plex Sans";} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0cm; font-size:12.0pt; font-family:"Aptos",sans-serif; mso-ligatures:standardcontextual; mso-fareast-language:EN-US;} a:link, span.MsoHyperlink {mso-style-priority:99; color:#467886; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal-reply; font-family:"IBM Plex Sans",sans-serif; color:windowtext;} .MsoChpDefault {mso-style-type:export-only; mso-fareast-language:EN-US;} @xxxxxxxx WordSection1 {size:612.0pt 792.0pt; margin:72.0pt 72.0pt 72.0pt 72.0pt;} div.WordSection1 {page:WordSection1;} --></style><!--[if gte mso 9]><xml> <o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" /> </xml><![endif]--><!--[if gte mso 9]><xml> <o:shapelayout v:ext=3D"edit"> <o:idmap v:ext=3D"edit" data=3D"1" /> </o:shapelayout></xml><![endif]--></head><body lang=3DEN-CA = link=3D"#467886" vlink=3D"#96607D" style=3D'word-wrap:break-word'><div = class=3DWordSection1><p class=3DMsoNormal><span = style=3D'font-family:"IBM Plex Sans",sans-serif'>Thank you very much for = sharing your article, Oded.<o:p></o:p></span></p><p = class=3DMsoNormal><span style=3D'font-family:"IBM Plex = Sans",sans-serif'>May be off-topic, but could be of interest to the = discussion:<o:p></o:p></span></p><p class=3DMsoNormal><span = style=3D'font-family:"IBM Plex = Sans",sans-serif'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span = style=3D'font-family:"IBM Plex Sans",sans-serif'>Bloom, Jon, =E2=80=98A = Standard for Morse Timing Using the Farnsworth Technique=E2=80=99 (1990) = &lt;<a = href=3D"https://www.arrl.org/files/file/Technology/x9004008.pdf">https://= www.arrl.org/files/file/Technology/x9004008.pdf</a>&gt; [accessed 24 = September 2024]<o:p></o:p></span></p><p class=3DMsoNormal><span = style=3D'font-family:"IBM Plex = Sans",sans-serif'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span = style=3D'font-family:"IBM Plex Sans",sans-serif'>Have a good day, = all.<o:p></o:p></span></p><p class=3DMsoNormal><u><span lang=3DEN-GB = style=3D'font-size:8.0pt;font-family:"IBM Plex = Sans",sans-serif;mso-fareast-language:EN-CA'>____________________________= _____</span></u><span lang=3DEN-GB = style=3D'font-size:8.0pt;mso-fareast-language:EN-CA'><o:p></o:p></span></= p><p class=3DMsoNormal><span lang=3DEN-GB = style=3D'font-size:8.0pt;font-family:"IBM Plex = Sans",sans-serif;mso-fareast-language:EN-CA'>Jonathan J = Digby</span><span lang=3DEN-GB = style=3D'font-size:8.0pt;mso-fareast-language:EN-CA'><o:p></o:p></span></= p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal><span = style=3D'font-family:"IBM Plex = Sans",sans-serif'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span = lang=3DEN-US = style=3D'font-size:11.0pt;font-family:"Calibri",sans-serif'><br><b>Sent:<= /b> 22 September 2024 15:11<br><br><o:p></o:p></span></p><p = class=3DMsoNormal>Humbly, here is my take on Alain's&nbsp;<span = style=3D'font-family:"Helvetica",sans-serif;color:#282828;background:#F7F= 7F7'>cautionary note:&nbsp;</span><a = href=3D"https://doi.org/10.3389/fpsyg.2013.00138" = target=3D"_blank"><span = style=3D'font-size:9.0pt;font-family:"Helvetica",sans-serif;background:#F= 7F7F7'>https://doi.org/10.3389/fpsyg.2013.00138</span></a><br = clear=3Dall><o:p></o:p></p><p class=3DMsoNormal>--<o:p></o:p></p><p = class=3DMsoNormal>Oded.<o:p></o:p></p><p = class=3DMsoNormal><o:p>&nbsp;</o:p></p><p = class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>On Sun, Sep = 22, 2024 at 12:08<span = style=3D'font-family:"Arial",sans-serif'>=E2=80=AF</span>AM Alain de = Cheveigne &lt;<a href=3D"mailto:alain.de.cheveigne@xxxxxxxx" = target=3D"_blank">alain.de.cheveigne@xxxxxxxx</a>&gt; = wrote:<o:p></o:p></p><p class=3DMsoNormal>Curiously, no one yet pointed = out that &quot;segmentation&quot; itself is ill-defined.&nbsp; = Syllables, like phonemes, are defined at the phonological level which is = abstract and distinct from the acoustics.&nbsp; <br><br>Acoustic cues to = a phoneme (or syllable) may come from an extended interval of sound that = overlaps with the cues that signal the next phoneme. I seem to recall = papers by Hynek Hermansky that found a support duration on the order of = 1s. If so, the &quot;segmentation boundary&quot; between = phonemes/syllables is necessarily ill-defined. <br><br>In practice, it = may be possible to define a segmentation that &quot;concentrates&quot; = information pertinent to each phoneme within a segment and minimizes = &quot;spillover&quot; to adjacent phonemes, but we should not be = surprised if it works less well for some boundaries, or if different = methods give different results.<br><br>When listening to Japanese = practice tapes, I remember noticing that the word &quot;futatsu&quot; = (two) sounded rather like &quot;uftatsu&quot;, suggesting that the = acoustic-to-phonological mapping (based on my native phonological = system) could be loose enough to allow for a swap. <br><br>Alain = <br><br><br><br>&gt; On 20 Sep 2024, at 11:29, Jan Schnupp &lt;<a = href=3D"mailto:000000e042a1ec30-dmarc-request@xxxxxxxx" = target=3D"_blank">000000e042a1ec30-dmarc-request@xxxxxxxx</a>&gt; = wrote:<br>&gt; <br>&gt; Dear Remy,<br>&gt; <br>&gt; it might be useful = for us to know where your meaningless CV syllable stimuli come from. = <br>&gt; But in any event, if you are any good at coding you are likely = better off working directly computing parameters of the recording = waveforms and apply criteria to those. CV syllables have an &quot;energy = arc&quot; such that the V is invariably louder than the C. In speech = there are rarely silent gaps between syllables, so you may be looking at = a CVCVCVCV... stream where the only &quot;easy&quot; handle on the = syllable boundary is likely to be the end of end of the vowel, which = should be recognizable by a marked decline in acoustic energy, which you = can quantify by some running RMS value (perhaps after low-pass filtering = given that consonants rarely have much low frequency energy). If that's = not accurate or reliable enough then things are likely to get a lot = trickier. You could look for voicing in a running autocorrelation as an = additional cue given that all vowels are voiced but only some consonants = are. <br>&gt; How many of these do you have to process? If the number = isn't huge, it may be quicker to find the boundaries &quot;by ear&quot; = than trying to develop a piece of computer code. The best way forward = really depends enormously on the nature of your original stimulus set. = <br>&gt; <br>&gt; Best wishes,<br>&gt; <br>&gt; Jan<br>&gt; <br>&gt; = ---------------------------------------<br>&gt; Prof Jan Schnupp<br>&gt; = Gerald Choa Neuroscience Institute<br>&gt; The Chinese University of = Hong Kong<br>&gt; Sha Tin<br>&gt; Hong Kong<br>&gt; <br>&gt; <a = href=3D"https://auditoryneuroscience.com" = target=3D"_blank">https://auditoryneuroscience.com</a><br>&gt; <a = href=3D"http://jan.schnupp.net" = target=3D"_blank">http://jan.schnupp.net</a><br>&gt; <br>&gt; <br>&gt; = On Thu, 19 Sept 2024 at 12:19, R=C3=A9my MASSON &lt;<a = href=3D"mailto:remy.masson@xxxxxxxx" = target=3D"_blank">remy.masson@xxxxxxxx</a>&gt; wrote:<br>&gt; Hello = AUDITORY list,<br>&gt;&nbsp; We are attempting to do automatic syllable = segmentation on a collection of sound files that we use in an = experiment. Our stimuli are a rapid sequence of syllables (all beginning = with a consonant and ending with a vowel) with no underlying semantic = meaning and with no pauses. We would like to automatically extract the = syllable/speech rate and obtain the timestamps for each syllable = onset.<br>&gt;&nbsp; We are a bit lost on which tool to use. We tried = PRAAT with the Syllable Nuclei v3 script, the software VoiceLab and the = website WebMaus. Unfortunately, for each of them their estimation of the = total number of syllables did not consistently match what we were able = to count manually, despite toggling with the parameters.&nbsp; = <br>&gt;&nbsp; Do you have any advice on how to go further? Do you have = any experience in syllable onset extraction?<br>&gt;&nbsp; Thank you for = your understanding,<br>&gt;&nbsp; R=C3=A9my MASSON<br>&gt; Research = Engineer<br>&gt; Laboratory &quot;Neural coding and neuroengineering of = human speech functions&quot; (NeuroSpeech)<br>&gt; Institut de = l=E2=80=99Audition =E2=80=93 Institut Pasteur (Paris)&lt;imag<span = style=3D'font-family:"IBM Plex = Sans",sans-serif'><o:p></o:p></span></p><p = class=3DMsoNormal><o:p>&nbsp;</o:p></p></div></body></html> ------=_NextPart_000_008D_01DB0E60.175690C0--


This message came from the mail archive
postings/2024/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University