[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
voiced/unvoiced detection
Dear List,
Pierre Divenyi's posting that cited de Boer's ideas on voice pitch
mentioned some key ideas like: "de Boer did not base his model on
autocorrelation" and, "stochastic process in which various alternative
(instantaneous) pitches may coexist." According to my non-conformist
interpretation, these ideas fit in with a "granular" approach to acoustic
processing. Granularity allows a better statistical interpretation for
addressing the voiced/unvoiced problem. The problem is to find the best
method for getting the statistics. Classic autocorrelation is not one of
them.
In a 1995 publication ("A model of auditory perception" in Control and
Dynamic Systems, Vol 69, Multidimensional Systems: Signal Processing and
Modeling Techniques, Ed. by C.T. Leondes, Academic Press) I described a
periodicity detector that was able to extract pitch as well as
voiced-unvoiced detection. It operates by _assuming_ that all sounds are
composed of stochastic multiple instantaneous elementary periodicities. An
elementary periodicity consists of three similar events spaced by two equal
intervals. The range of assumed periodicities covers seven octaves. Each
periodicity is recognized as an independent event and collected in a
running histogram. If there is a continuous sequence of similar
periodicities, such as a voice or a musical tone, it can be called a
"pitch." If a sound is purely random the histogram of periodicities will
show a flat distribution over the histogram spectrum. Band-limited
randomness in whispered speech forms clusters denoting formant
periodicities. Thus, with or without a periodic glottal vibration, the
formant periodicities may be identified. Examples of voiced and whispered
speech are shown in the aforementioned chapter.
This scheme gets voiced/unvoiced decisions along with the phonetic
segmentation that I also described. These phonetic segments provide a
meaningful, non-arbitrary interval for collecting statistics of the
periodic events. I first tried getting the V/UV decision by analyzing
periodicity clusters in the histogram distribution. Effectively, the
histogram thus becomes the statistical equivalent of the correlator: the
variance of the periodicity distribution can indicate the degree of
randomness.
However, I found experimentally that this method was not reliable.
Instead it is better to use past history of periodic sequences to predict
future periodic events. The method I chose compares the number of
successful predictions against the average number of hits. For speech,
predictions of voicing are limited to the normal pitch range. Experimental
results have been consistent over a variety of utterances. Fricatives,
plosives, and whispers are labeled as unvoiced, except for the vowel /oo/,
which has a formant in the pitch range.
In discussions on this granular approach the main concern is its
apparent incompatibility with current models of cochlear functions.
However, in view of the List's current quandary on unvoiced speech, the
results I have obtained indicate that a granular-based approach to auditory
modeling might be in order. In any case, this method may be useful in
speech recognition.
John Bates
Time/Space Systems
79 Sarles Lane, Pleasantville, NY 10570
(914)-747-3143, jkbates@ieee.org
Email to AUDITORY should now be sent to AUDITORY@lists.mcgill.ca
LISTSERV commands should be sent to listserv@lists.mcgill.ca
Information is available on the WEB at http://www.mcgill.ca/cc/listserv