Re: [AUDITORY] Tool for automatic syllable segmentation

Subject: Re: [AUDITORY] Tool for automatic syllable segmentation

From: Léo Varnet <leo.varnet@xxxxxxxxxx>

Date: Mon, 23 Sep 2024 12:43:08 +0200

Arc-authentication-results: i=1; mx.google.com; dkim=pass header.i=@LISTS.MCGILL.CA header.s=SELECTOR1 header.b=hjTEMcdA; spf=pass (google.com: domain of owner-auditory@xxxxxxxxxxxxxxx designates 132.206.27.104 as permitted sender) smtp.mailfrom=owner-auditory@xxxxxxxxxxxxxxx; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ens.psl.eu

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=list-archive:list-owner:list-subscribe:list-unsubscribe:list-help :precedence:in-reply-to:to:comments:subject:from:sender:reply-to :date:message-id:content-language:references:user-agent:mime-version :approved-by:dkim-signature; bh=mVOJCNtrJHVFYkBPVaoTnwldZRWNquoSgm+YNe2uWwU=; fh=5/42mu9FVmfuMp6n0xGXVcDar2H3ENcHt8Uv11Om8gY=; b=ghpkFvdwBOkvTwiw/LK+DzKkM6cFeLc+pC3ZHE3lH4JKCc2VKXC+a59yKOl463Byh7 ZFT6P0wJTM9r6P1H5BXVhPtI//PvXsvTjDZec+0RL4P7ISmhOJCBmA59Jhb/8/rNRjkT oTmoqVKkmzfXk8Lucdd7cCAqXJdPsyMXjLlG3vHehbUpb9mgdUkbRsD9HTaxslPAk2ca /jOpk+rMm+lLLx31qdCUZRVsKIyhtFvCdowzPeFAXOrqLa/XxQ/9QndAKx5YSxBFeTVl c43dTXNHzSg4mvbgwBuU3rGLO9BfwZhMObsYw0AaXRxzyrWG+Oyjk5uljiuV73GUhyJ6 bNWA==; dara=google.com

Arc-seal: i=1; a=rsa-sha256; t=1727151238; cv=none; d=google.com; s=arc-20240605; b=decW6bRkLqJNYDaXq3FNTypYw0H42FkILGOPJGXT6WAoEt4C4Daqmc1Mc8Qa4CHXVn ey6L8jsgJj/yJ5dJfq9aAbqu5+yZEK4h8GT6e4I3vk9HawzAsYls1S8xEPBY0U/YBCBc 47okq5m9GyCspulSsY7Ql0NytFdwxB+EVE1U9hbzpttl+UvGP7+xfTVDPvMaRTB69J6/ yHz902l+y5NqjQQEN2Uc55MHA8Y0WTIr0GHxWr97SPjBq5SOKt9A5q8i02rCTneHiuQM +UAUUMAkqIUTgOdMle3TCE6l3LlphXpDNpKGAuB6NOhr+Zj6qFYhY7rj6ie9914Wvo3A HkLA==

Authentication-results: mx.google.com; dkim=pass header.i=@LISTS.MCGILL.CA header.s=SELECTOR1 header.b=hjTEMcdA; spf=pass (google.com: domain of owner-auditory@xxxxxxxxxxxxxxx designates 132.206.27.104 as permitted sender) smtp.mailfrom=owner-auditory@xxxxxxxxxxxxxxx; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ens.psl.eu

Comments: To: Alain de Cheveigne <alain.de.cheveigne@xxxxxxxxxx>

Delivered-to: dan.ellis@xxxxxxxxx

Dkim-signature: v=1; a=rsa-sha256; d=LISTS.MCGILL.CA; s=SELECTOR1; c=relaxed/relaxed; bh=mVOJCNtrJHVFYkBPVaoTnwldZRWNquoSgm+YNe2uWwU=; i=@LISTS.MCGILL.CA; h=Approved-By:Content-Type:MIME-Version:User-Agent:References:Content-Language:Message-ID:Date:Reply-To:Sender:From:Subject:To:In-Reply-To:List-Help:List-Unsubscribe:List-Subscribe:List-Owner:List-Archive; b=hjTEMcdAFnIVtQ6YzOpjsFQxje0iGUUSIS89oHGQnbmylnJI4H89W4Wgnmdch38CIrgpu3jb2/3yBLiyuuY7B5ZmIkSYtOcQYJ80SPKCman3U/5dcAuCf1R5/QHCvywNms4SNEKy5t+Dzfz5//1TxTjARdLqL6hVamp2ydEJpZYSVBilJpTaw2ma/7TYl/GRo5ddSv4p57+NHBxZvX5z6+epu0f/QIh4mh9GKyerNehij4zZjlN0G6yfMHrNlWCpIPL4xC7rH+/anTZ1amJ/7C/qWWxDPtNN+CLGQf+6Pd623+Qd67gQ2/YCz3ERyW3wsrbpye0/ZPLXjn6YxXV0fw==

In-reply-to: <7B8B9BDB-24CC-41A8-92C6-2F916D493BE2@ens.psl.eu>

List-archive: <https://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>

List-help: <https://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>, <mailto:LISTSERV@LISTS.MCGILL.CA?body=INFO%20AUDITORY>

List-owner: <mailto:AUDITORY-request@LISTS.MCGILL.CA>

List-subscribe: <mailto:AUDITORY-subscribe-request@LISTS.MCGILL.CA>

List-unsubscribe: <mailto:AUDITORY-unsubscribe-request@LISTS.MCGILL.CA>

References: <1e5ea9e398c44e609f9053065a7642d3@pasteur.fr> <CAJptJs6Z95ne5ifjDPkNnVQyz4g2F1zCF7myNRJ5CyF0b_xa-A@mail.gmail.com> <7B8B9BDB-24CC-41A8-92C6-2F916D493BE2@ens.psl.eu>

Reply-to: Léo Varnet <leo.varnet@xxxxxxxxxx>

Sender: AUDITORY - Research in Auditory Perception <AUDITORY@xxxxxxxxxxxxxxx>

User-agent: Mozilla Thunderbird

Dear all,

I'd like to take Alain's response as a starting point for another sub-thread of discussion.

Alain, I assume that you are referring to the research on automatic phoneme classification based on temporal patterns, which typically use a [-500ms; +500ms] window. I'm curious about the maximum distance a phonetic cue can be from the nucleus of the corresponding phoneme. Does anybody in the List have insights on this? In my own experiments I have observed that in some cases cues as far as 800 ms before the target sound can influence phoneme categorization -- but these were speech rate context cues, not "directly informative" cues.

Best

Léo

Léo Varnet - Chercheur CNRS
Laboratoire des Systèmes Perceptifs, UMR 8248
École Normale Supérieure
29, rue d'Ulm - 75005 Paris Cedex 05
Tél. : (+33)6 33 93 29 34
https://lsp.dec.ens.fr/en/member/1066/leo-varnet
https://dbao.leo-varnet.fr/

Le 21/09/2024 à 11:51, Alain de Cheveigne a écrit :

Curiously, no one yet pointed out that "segmentation" itself is ill-defined.  Syllables, like phonemes, are defined at the phonological level which is abstract and distinct from the acoustics.  

Acoustic cues to a phoneme (or syllable) may come from an extended interval of sound that overlaps with the cues that signal the next phoneme. I seem to recall papers by Hynek Hermansky that found a support duration on the order of 1s. If so, the "segmentation boundary" between phonemes/syllables is necessarily ill-defined. 

In practice, it may be possible to define a segmentation that "concentrates" information pertinent to each phoneme within a segment and minimizes "spillover" to adjacent phonemes, but we should not be surprised if it works less well for some boundaries, or if different methods give different results.

When listening to Japanese practice tapes, I remember noticing that the word "futatsu" (two) sounded rather like "uftatsu", suggesting that the acoustic-to-phonological mapping (based on my native phonological system) could be loose enough to allow for a swap. 

Alain 
 to be categorized

On 20 Sep 2024, at 11:29, Jan Schnupp <000000e042a1ec30-dmarc-request@xxxxxxxxxxxxxxx> wrote:

Dear Remy,

it might be useful for us to know where your meaningless CV syllable stimuli come from.
But in any event, if you are any good at coding you are likely better off working directly computing parameters of the recording waveforms and apply criteria to those. CV syllables have an "energy arc" such that the V is invariably louder than the C. In speech there are rarely silent gaps between syllables, so you may be looking at a CVCVCVCV... stream where the only "easy" handle on the syllable boundary is likely to be the end of end of the vowel, which should be recognizable by a marked decline in acoustic energy, which you can quantify by some running RMS value (perhaps after low-pass filtering given that consonants rarely have much low frequency energy). If that's not accurate or reliable enough then things are likely to get a lot trickier. You could look for voicing in a running autocorrelation as an additional cue given that all vowels are voiced but only some consonants are.
How many of these do you have to process? If the number isn't huge, it may be quicker to find the boundaries "by ear" than trying to develop a piece of computer code. The best way forward really depends enormously on the nature of your original stimulus set. o

Best wishes,

Jan

---------------------------------------
Prof Jan Schnupp
Gerald Choa Neuroscience Institute
The Chinese University of Hong Kong
Sha Tin
Hong Kong

https://auditoryneuroscience.com
http://jan.schnupp.net

On Thu, 19 Sept 2024 at 12:19, Rémy MASSON <remy.masson@xxxxxxxxxx> wrote:
Hello AUDITORY list,
We are attempting to do automatic syllable segmentation on a collection of sound files that we use in an experiment. Our stimuli are a rapid sequence of syllables (all beginning with a consonant and ending with a vowel) with no underlying semantic meaning and with no pauses. We would like to automatically extract the syllable/speech rate and obtain the timestamps for each syllable onset.
We are a bit lost on which tool to use. We tried PRAAT with the Syllable Nuclei v3 script, the software VoiceLab and the website WebMaus. Unfortunately, for each of them their estimation of the total number of syllables did not consistently match what we were able to count manually, despite toggling with the parameters.
Do you have any advice on how to go further? Do you have any experience in syllable onset extraction?
Thank you for your understanding,
Rémy MASSON
Research Engineer
Laboratory "Neural coding and neuroengineering of human speech functions" (NeuroSpeech)
Institut de l’Audition – Institut Pasteur (Paris)<image001.jpg>