[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [AUDITORY] Tool for automatic syllable segmentation



I can't advise on current software, but I can suggest things that are worth considering in corpus design, and that may help the community to give you better answers if you could provide them with the information.
The distinctions I think matter fall into two types of information: the phonetic composition of the syllables, and the type of speech.
Though the language/dialect also matters. Where it matters, I've assumed French.
 I've also assumed little knowledge of phonetics and acoustic-phonetics; forgive me if I'm wrong.

Phonetic composition:  What do you mean by 'consonant'? 
                Consonants cover 4 main classes of articulation type (plus several other types): 
   stops ( e.g. p t k    b d g );      fricatives ( e.g. f s ʃ x χ     v z ʒ ɣ ʁ );     nasals ( e.g. m n );      approximants ( e.g. w j l and some types of r )

Many people tend to think only of stops and maybe fricatives, and neglect the others, but functionally they are all consonants, and segmenting them involves very different issues. So it's helpful to give a bit more information.
 If you ONLY mean stops and fricatives, then a single word to use is 'obstruent'.
 Obstruents are usually the easiest to segment.

Next, each member of each class may be so-called voiced or voiceless.
In the above lists of obstruents, the first items in the parentheses are classed as voiceless and the second set as voiced.
The nasals and approximants I've listed above are all classed as voiced. Voiceless ones exist.

Importantly, how voiced and voiceless consonants are pronounced in any given language can be very different, and involve a great deal more than whether or not the larynx is vibrating. That is, used as above, these terms voiced and voiceless are phonological: a general descriptive term with no necessary basis in acoustic reality for a particular language or speaking style.
Example: French voiced obstruents typically contain significant vocal fold vibration, whereas British English voiced obstruents can include it, but often contain very little or none at all.

Phonation is the term indicating existence of vocal fold vibration.
     So French voiced obstruents are typically phonated; English ones are not, or not phonated throughout their duration.
Vowels are normally phonated (voiced) but not always; whisper is the common word for speech whose vowels are all unphonated.

Syllables that alternate voicelessness and voicing (non-phonation and phonation) tend to be easiest to segment: you seek silence or aperiodic noise, vs. periodicity (or an f0). Examples: /ki, sa/ (as in French qui, comme ça).

Type of speech:  How natural is your speech data, and how much variation is there in it?
         Is it synthetic or spoken by a human or humans?    If human, were the speakers trained professional speakers or untrained?
         Were their instructions clear, or where they just asked to speak fast? (e.g. were care or clarity mentioned in the instructions?)
         How was the speed achieved? E.g. by speeding up careful speech algorithmically, or by making it fast in the first place?

For your CV syllables, slow, clear, speech containing only voiceless obstruents and (phonated) vowels, with little or no variation, should be easy to segment both automatically and by minimally-trained humans.
The same, containing voiced obstruents (in French) or any nasals and approximants, will likely be less accurate compared with the human ear.

But when spoken fast by untrained speakers with no instruction to maintain clarity, even voiceless obstruents are likely to be hard to segment automatically, for two main reasons: clearly-alternating phonation and silence/frication (aperiodic noise) may be lost, and clear phonetic distinctions may be lost because pronunciation changes -- stops become more like fricatives, and both stops and fricatives may become more like approximants, which are basically just rather short vowels.

Finally, though I've limited experience of segmenting French, if your algorithms include relative amplitude as a criterion, then syllables like /di, vi, gu, vu/ (French dit, vi(sa), goȗt, vous) may be mis-segmented because the amplitude of the periodicity may be greater in the consonant than in the vowel.  I have seen this in some French, but I don't know if it's the norm. I'd be delighted if my French phonetician colleagues could tell us!

In summary: you're likely to be best off if you can use syllables that are easy to segment. But if your spoken syllables are inherently hard to segment, for whatever reason, then you will need a more sophisticated algorithm that might use, for example, spectral shape as well as relative amplitude and/or existence of an f0. Or an algorithm that you can train yourself.

Hope this helps, and apologies it doesn't!
Sarah Hawkins

On 2024-09-18 16:52, Rémy MASSON wrote:

Hello AUDITORY list,

 

We are attempting to do automatic syllable segmentation on a collection of sound files that we use in an experiment. Our stimuli are a rapid sequence of syllables (all beginning with a consonant and ending with a vowel) with no underlying semantic meaning and with no pauses. We would like to automatically extract the syllable/speech rate and obtain the timestamps for each syllable onset.

 

We are a bit lost on which tool to use. We tried PRAAT with the Syllable Nuclei v3 script, the software VoiceLab and the website WebMaus. Unfortunately, for each of them their estimation of the total number of syllables did not consistently match what we were able to count manually, despite toggling with the parameters.  

 

Do you have any advice on how to go further? Do you have any experience in syllable onset extraction?

 

Thank you for your understanding,

 

Rémy MASSON

Research Engineer

Laboratory "Neural coding and neuroengineering of human speech functions" (NeuroSpeech)

Institut de l’Audition – Institut Pasteur (Paris)

Accueil | Institut de l'audition