[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: voiced/unvoiced detection



Alain de Cheveigne' wrote:
>
> For whispered speech, one should probably distinguish the issues of
> transmitting segmental information ("phoneme" identity, etc.), and
> intonation.  To the extent that segmental information is carried by
> spectral shape,

This is clearly NOT the case. If it were, how would you ever hear-out
one speaker from a second, male from female. For more on this see:

author = {Allen, J. B.},
   title = {How do humans process and recognize speech?},
   journal = {IEEE Trans. on Speech and Audio Proc.},
   volume = {2},
   number = {4},
   pages = {567-577},
   month = oct,
   year = 1994

as well as Summerfield's speech AI work (Somebody have the exact reference
please?)

> it is coded equally well if the excitation is noise-like.

The spectrum will not be the same for voiced and whispered speech unless
the source point is exactly at the same point, and the source impedance
is the same. I doubt that that either condition is true. In fact, I expect
we dont really know much about this. Does anybody know of any measurements
of the spectrum of whispered speech, re voiced speech?

> A speech recognizer trained on voiced speech should work on whispered
> speech.

I stronly suspect that modern hidden
Markov (http://www-history.mcs.st-and.ac.uk:/history//Mathematicians/Markov.html)
model (HMM) automatic speech
recognition (ASR) software would !massively! fail with whispered speech as
an input.  Has anybody ever tried it?

> In principle.  In practice there are issues such as the different
> spectral slopes of voiced and whispered excitation, and the fact that
> speakers might not articulate the same when they whisper as when they use
> voice.
>
> Intonation is another problem, as it is usually thought of as being coded
> by F0 which is absent in whispered speech.  I think it has been suggested
> that F1 might be used in the place of F0 (how to reconcile this role with
> that of coding segmental information is another mystery).  Other parameters
> are timing and intensity.  Introspection tells me that whispered
> articulation is more marked than voiced articulation, something akin to a
> sort of "Lombard effect".  It may be a mistake to equate "whispered speech"
> with "voiced speech minus the F0".

Based on the results of Quentin Summerfield (and colleagues), you can only
separate two simultaneous speakers (get a good AI score) if their f0's differ.
How do you reconcile this observation with whispered speech, where f0 is
absent?

>
> Alain
>
> Email to AUDITORY should now be sent to AUDITORY@lists.mcgill.ca
> LISTSERV commands should be sent to listserv@lists.mcgill.ca
> Information is available on the WEB at http://www.mcgill.ca/cc/listserv

Jont Allen

--
Jont B. Allen (Technology Leader)
AT&T Labs-Research, Shannon Laboratory
180 Park Ave., Room E161
Florham Park NJ 07932-0971
973/360-8545voice, x7111fax, http://www.research.att.com/info/jba
To send a fax that I get by email: 973/360-8545 (Experimental)

Email to AUDITORY should now be sent to AUDITORY@lists.mcgill.ca
LISTSERV commands should be sent to listserv@lists.mcgill.ca
Information is available on the WEB at http://www.mcgill.ca/cc/listserv