Re: [AUDITORY] Tool for automatic syllable segmentation (Alain de Cheveigne )


Subject: Re: [AUDITORY] Tool for automatic syllable segmentation
From:    Alain de Cheveigne  <alain.de.cheveigne@xxxxxxxx>
Date:    Sat, 21 Sep 2024 11:51:16 +0200

Curiously, no one yet pointed out that "segmentation" itself is = ill-defined. Syllables, like phonemes, are defined at the phonological = level which is abstract and distinct from the acoustics. =20 Acoustic cues to a phoneme (or syllable) may come from an extended = interval of sound that overlaps with the cues that signal the next = phoneme. I seem to recall papers by Hynek Hermansky that found a support = duration on the order of 1s. If so, the "segmentation boundary" between = phonemes/syllables is necessarily ill-defined.=20 In practice, it may be possible to define a segmentation that = "concentrates" information pertinent to each phoneme within a segment = and minimizes "spillover" to adjacent phonemes, but we should not be = surprised if it works less well for some boundaries, or if different = methods give different results. When listening to Japanese practice tapes, I remember noticing that the = word "futatsu" (two) sounded rather like "uftatsu", suggesting that the = acoustic-to-phonological mapping (based on my native phonological = system) could be loose enough to allow for a swap.=20 Alain=20 > On 20 Sep 2024, at 11:29, Jan Schnupp = <000000e042a1ec30-dmarc-request@xxxxxxxx> wrote: >=20 > Dear Remy, >=20 > it might be useful for us to know where your meaningless CV syllable = stimuli come from.=20 > But in any event, if you are any good at coding you are likely better = off working directly computing parameters of the recording waveforms and = apply criteria to those. CV syllables have an "energy arc" such that the = V is invariably louder than the C. In speech there are rarely silent = gaps between syllables, so you may be looking at a CVCVCVCV... stream = where the only "easy" handle on the syllable boundary is likely to be = the end of end of the vowel, which should be recognizable by a marked = decline in acoustic energy, which you can quantify by some running RMS = value (perhaps after low-pass filtering given that consonants rarely = have much low frequency energy). If that's not accurate or reliable = enough then things are likely to get a lot trickier. You could look for = voicing in a running autocorrelation as an additional cue given that all = vowels are voiced but only some consonants are.=20 > How many of these do you have to process? If the number isn't huge, it = may be quicker to find the boundaries "by ear" than trying to develop a = piece of computer code. The best way forward really depends enormously = on the nature of your original stimulus set.=20 >=20 > Best wishes, >=20 > Jan >=20 > --------------------------------------- > Prof Jan Schnupp > Gerald Choa Neuroscience Institute > The Chinese University of Hong Kong > Sha Tin > Hong Kong >=20 > https://auditoryneuroscience.com > http://jan.schnupp.net >=20 >=20 > On Thu, 19 Sept 2024 at 12:19, R=C3=A9my MASSON = <remy.masson@xxxxxxxx> wrote: > Hello AUDITORY list, > We are attempting to do automatic syllable segmentation on a = collection of sound files that we use in an experiment. Our stimuli are = a rapid sequence of syllables (all beginning with a consonant and ending = with a vowel) with no underlying semantic meaning and with no pauses. We = would like to automatically extract the syllable/speech rate and obtain = the timestamps for each syllable onset. > We are a bit lost on which tool to use. We tried PRAAT with the = Syllable Nuclei v3 script, the software VoiceLab and the website = WebMaus. Unfortunately, for each of them their estimation of the total = number of syllables did not consistently match what we were able to = count manually, despite toggling with the parameters. =20 > Do you have any advice on how to go further? Do you have any = experience in syllable onset extraction? > Thank you for your understanding, > R=C3=A9my MASSON > Research Engineer > Laboratory "Neural coding and neuroengineering of human speech = functions" (NeuroSpeech) > Institut de l=E2=80=99Audition =E2=80=93 Institut Pasteur = (Paris)<image001.jpg>=20


This message came from the mail archive
postings/2024/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University