Re: Temporal Envelope based pitch perception (Imran Dhamani )


Subject: Re: Temporal Envelope based pitch perception
From:    Imran Dhamani  <imrandhamani@xxxxxxxx>
Date:    Wed, 3 Feb 2010 06:48:22 +0530
List-Archive:<http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>

--0-901191598-1265159902=:44136 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Dear Matthias, =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Thanks for you= r suggestion. So does that mean that there could be an error in the objecti= ve pitch estimation for a vocoded stimuli=A0across all the three methods i = have used to do it. But when i hear the stimuli i can still find a mismatch= in pitch subjectively also. So there does look like a improper match from = the original. And for the pitch shift i used Adobe audition with the follow= ing setting=20 Pitch Shift (Preserves Tempo) Lets you raise and lower the pitch without changing the tempo. Lower percen= tages raise the pitch and higher percentages lower it. Use this setting to = make a voice sound deeper or higher without affecting the original playback= speed. Or, use differing initial and final percentages to raise and lower = the pitch without affecting the tempo.=20 $$$$$ monty@xxxxxxxx@xxxxxxxx@xxxxxxxx@xxxxxxxx --- On Tue, 2/2/10, Matthias Milczynski <Matthias.Milczynski@xxxxxxxx= e> wrote: From: Matthias Milczynski <Matthias.Milczynski@xxxxxxxx> Subject: Re: Temporal Envelope based pitch perception To: "Imran Dhamani" <imrandhamani@xxxxxxxx> Cc: "AUDITORY@xxxxxxxx" <AUDITORY@xxxxxxxx> Date: Tuesday, 2 February, 2010, 10:29 PM Dear Imran,=20 my first guess for the mismatch in objective F0 estimation between vocoded and original speech samples would be the following:=20 I consider=A0 an autocorrelation-function (ACF) based F0-estimation approach as a frequency domain approach, i.e. it is the real part of the inverse Fourier transform of the power spectrum. This means that the resulting period will correspond to a cosine that "matches" an eventual harmonic structure (in case of e.g. a vowel) of the power spectrum best.=20 1) Sinusoidal carriers: If you consider a (assuming steady-state) frame of your vocoded signal (a few ms long) the spectrum will consist out of components at the center frequencies and sidebands resulting from the amplitude modulation (AM) with the envelope, which in the perfect case is a dc-shifted sinusoid at a particular frequency and modulation depth. Even when you modulate all carriers with the same modulator the harmonic structure is already distorted through the spacing of your center frequencies of the carriers. However, since the sidebands are more apparent (depends on the frequency resolution) as compared to noise carriers your ACF approach still has a "chance" of eventually estimating the correct period of modulation.=20 2) Noise carriers: In case of noise carriers the situation is obviously worse since the sidebands are less apparent (usually they are partially masked by the carrier spectrum), which depends on the bandwidth of your channels.=20 Maybe you could elaborate more on your pitch-shifting approach? Are you using Praat for that? Hope my explanation makes at least some sense :-). Matthias On Tue, 2010-02-02 at 16:51 +0100, Imran Dhamani wrote: > Hi everyone.=20 >=20 > I recently had a doubt pertaining to envelope based pitch perception. > I would be grateful if I can get the answer to my question. Thanks in > advance.=20 >=20 > According to the various researches that I have read till now > pertaining to the importance of temporal envelope cues in speech > perception, I could understand that the pitch/fundamental frequency > can be reliably represented via only the temporal envelope cues in > normal as well as hearing impaired and cochlear implanted listeners > (at least within a certain range/limit of Fo). In a simple laboratory > experiment I also found that my subjective judgement of the pitch of > speech sounds (word/sentence) as a trained listener was almost within > 50-60 Hz of the objective estimate of the pitch/Fo using > LPC/autocorrelation or Cepstral analysis in Matlab and Praat software. > In another series of experiments that I performed I found that when I > channel vocoded speech sounds (500 Hz sine wave and BBN noise carrier > both used alternatively) using various envelope cut off frequencies > ranging from 50-500 Hz with variable number of bands from 8-24 (based > on the greenwoods function/map), there was a drastic mismatch between > the objective estimate of fundamental frequency/pitch between the > original stimuli and the vocoded stimuli across all the conditions > (example if the pitch of the original stimuli was 120 Hz the > objectively estimated pitch of vocoded stimuli was around 70-80 Hz). > Moreover I also noticed a relatively lesser mismatch between original > and vocoded using the sine wave carrier and with increasing the > envelope cut-off frequencies. In the next set of trials I also > generated various pitch shifted versions (relatively preserving the > temporal information) of the same set of speech stimuli and then > vocoded them using the same variables and surprisingly found no > significant/drastic change (just a 10-20 Hz change) in the objectively > estimated pitch even if I shifted the original stimulus pitch by a > ratio of 70 (F0=3D220-250 Hz). Later I tried simulating the speech > stimuli using a cochlear implant simulation using variable carrier > rates from 400-10000 and channels 10-22 and found almost similar > (within 5-10 Hz) objectively estimated pitch values between the > original and simulated speech stimuli. The doubts that I had are as > follows:=20 >=20 > 1)=A0 =A0 =A0=A0=A0Are these findings due to any technical error (probabl= y in > objective pitch estimation of vocoded stimuli) or any other mistake?=20 >=20 >=A0 =A0 =A0 =A0=A0=A0( or can subjective findings mask objective data?)=20 >=20 > 2)=A0 =A0 =A0=A0=A0Is pitch representation solely dependent on temporal e= nvelope > cues or are there any other contributors like carrier frequency (other > than the Nyquist- Shannon theorem), envelope cut-off, envelope > extraction method, temporal analysis/sample length etc which may also > play a major role?=20 >=20 > 3)=A0 =A0 =A0=A0=A0How is the pitch information encoded and extracted in = such a > complex temporal envelope of speech sounds (is it completely different > than the periodicity based or spectral based pitch extraction mode)?=20 >=20 > 4)=A0 =A0 =A0=A0=A0Is it that if I band pass filter (based on auditory fi= lters) > the envelope information then the filter/channel containing the > pitch/Fo information will have a different envelope (probably more > periodic) than the other parts and maybe the pitch information is > extracted by the auditory system from the complex envelope through > this mode?=A0=A0=A0 >=20 >=A0=20 > Best regards, >=20 >=20 > Imran Dhamani=20 >=20 > PhD. student. >=20 > $$$$$ monty@xxxxxxxx@xxxxxxxx@xxxxxxxx@xxxxxxxx >=20 >=20 > ______________________________________________________________________ > Your Mail works best with the New Yahoo Optimized IE8. Get it NOW!. =0A=0A=0A The INTERNET now has a personality. YOURS! See your Yahoo! H= omepage. http://in.yahoo.com/ --0-901191598-1265159902=:44136 Content-Type: text/html; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable <table cellspacing=3D"0" cellpadding=3D"0" border=3D"0" ><tr><td valign=3D"= top" style=3D"font: inherit;"><DIV>Dear Matthias,</DIV> <DIV>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs= p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Thanks for your suggesti= on. So does that mean that there could be an error in the objective pitch e= stimation for a vocoded stimuli&nbsp;across all the three methods i have us= ed to do it. But when i hear the stimuli i can still find a mismatch in pit= ch subjectively also. So there does look like a improper match from the ori= ginal. And for the pitch shift i used Adobe audition with the following set= ting <BR></DIV> <DT class=3Ddlterm><STRONG>Pitch Shift (Preserves Tempo)</STRONG></DT> <DD>Lets you raise and lower the pitch without changing the tempo. Lower pe= rcentages raise the pitch and higher percentages lower it. Use this setting= to make a voice sound deeper or higher without affecting the original play= back speed. Or, use differing initial and final percentages to raise and lo= wer the pitch without affecting the tempo. <BR></DD> <DD>$$$$$ monty@xxxxxxxx@xxxxxxxx@xxxxxxxx@xxxxxxxx<BR><BR>--- On <B>Tue, 2/2/10, Matthias Milczynski <I>&l= t;Matthias.Milczynski@xxxxxxxx&gt;</I></B> wrote:<BR></DD> <BLOCKQUOTE style=3D"BORDER-LEFT: rgb(16,16,255) 2px solid; PADDING-LEFT: 5= px; MARGIN-LEFT: 5px"><BR>From: Matthias Milczynski &lt;Matthias.Milczynski= @xxxxxxxx&gt;<BR>Subject: Re: Temporal Envelope based pitch percepti= on<BR>To: "Imran Dhamani" &lt;imrandhamani@xxxxxxxx&gt;<BR>Cc: "AUDITORY= @xxxxxxxx" &lt;AUDITORY@xxxxxxxx&gt;<BR>Date: Tuesday, 2 Febr= uary, 2010, 10:29 PM<BR><BR> <DIV class=3DplainMail>Dear Imran, <BR><BR>my first guess for the mismatch = in objective F0 estimation between<BR>vocoded and original speech samples w= ould be the following: <BR><BR>I consider&nbsp; an autocorrelation-function= (ACF) based F0-estimation<BR>approach as a frequency domain approach, i.e.= it is the real part of the<BR>inverse Fourier transform of the power spect= rum. This means that the<BR>resulting period will correspond to a cosine th= at "matches" an eventual<BR>harmonic structure (in case of e.g. a vowel) of= the power spectrum<BR>best. <BR><BR>1) Sinusoidal carriers: If you conside= r a (assuming steady-state) frame<BR>of your vocoded signal (a few ms long)= the spectrum will consist out of<BR>components at the center frequencies a= nd sidebands resulting from the<BR>amplitude modulation (AM) with the envel= ope, which in the perfect case<BR>is a dc-shifted sinusoid at a particular = frequency and modulation depth.<BR>Even when you modulate all carriers with the same modulator the harmonic<BR>structure is already distorted thr= ough the spacing of your center<BR>frequencies of the carriers. However, si= nce the sidebands are more<BR>apparent (depends on the frequency resolution= ) as compared to noise<BR>carriers your ACF approach still has a "chance" o= f eventually estimating<BR>the correct period of modulation. <BR><BR>2) Noi= se carriers: In case of noise carriers the situation is obviously<BR>worse = since the sidebands are less apparent (usually they are partially<BR>masked= by the carrier spectrum), which depends on the bandwidth of your<BR>channe= ls. <BR><BR>Maybe you could elaborate more on your pitch-shifting approach?= Are you<BR>using Praat for that?<BR><BR>Hope my explanation makes at least= some sense :-).<BR><BR>Matthias<BR><BR>On Tue, 2010-02-02 at 16:51 +0100, = Imran Dhamani wrote:<BR>&gt; Hi everyone. <BR>&gt; <BR>&gt; I recently had = a doubt pertaining to envelope based pitch perception.<BR>&gt; I would be grateful if I can get the answer to my question. Thanks in<BR>&gt= ; advance. <BR>&gt; <BR>&gt; According to the various researches that I hav= e read till now<BR>&gt; pertaining to the importance of temporal envelope c= ues in speech<BR>&gt; perception, I could understand that the pitch/fundame= ntal frequency<BR>&gt; can be reliably represented via only the temporal en= velope cues in<BR>&gt; normal as well as hearing impaired and cochlear impl= anted listeners<BR>&gt; (at least within a certain range/limit of Fo). In a= simple laboratory<BR>&gt; experiment I also found that my subjective judge= ment of the pitch of<BR>&gt; speech sounds (word/sentence) as a trained lis= tener was almost within<BR>&gt; 50-60 Hz of the objective estimate of the p= itch/Fo using<BR>&gt; LPC/autocorrelation or Cepstral analysis in Matlab an= d Praat software.<BR>&gt; In another series of experiments that I performed= I found that when I<BR>&gt; channel vocoded speech sounds (500 Hz sine wave and BBN noise carrier<BR>&gt; both used alternatively) using var= ious envelope cut off frequencies<BR>&gt; ranging from 50-500 Hz with varia= ble number of bands from 8-24 (based<BR>&gt; on the greenwoods function/map= ), there was a drastic mismatch between<BR>&gt; the objective estimate of f= undamental frequency/pitch between the<BR>&gt; original stimuli and the voc= oded stimuli across all the conditions<BR>&gt; (example if the pitch of the= original stimuli was 120 Hz the<BR>&gt; objectively estimated pitch of voc= oded stimuli was around 70-80 Hz).<BR>&gt; Moreover I also noticed a relati= vely lesser mismatch between original<BR>&gt; and vocoded using the sine wa= ve carrier and with increasing the<BR>&gt; envelope cut-off frequencies. In= the next set of trials I also<BR>&gt; generated various pitch shifted vers= ions (relatively preserving the<BR>&gt; temporal information) of the same s= et of speech stimuli and then<BR>&gt; vocoded them using the same variables and surprisingly found no<BR>&gt; significant/drastic change (ju= st a 10-20 Hz change) in the objectively<BR>&gt; estimated pitch even if I = shifted the original stimulus pitch by a<BR>&gt; ratio of 70 (F0=3D220-250 = Hz). Later I tried simulating the speech<BR>&gt; stimuli using a cochlear i= mplant simulation using variable carrier<BR>&gt; rates from 400-10000 and c= hannels 10-22 and found almost similar<BR>&gt; (within 5-10 Hz) objectively= estimated pitch values between the<BR>&gt; original and simulated speech s= timuli. The doubts that I had are as<BR>&gt; follows: <BR>&gt; <BR>&gt; 1)&= nbsp; &nbsp; &nbsp;&nbsp;&nbsp;Are these findings due to any technical erro= r (probably in<BR>&gt; objective pitch estimation of vocoded stimuli) or an= y other mistake? <BR>&gt; <BR>&gt;&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp;( = or can subjective findings mask objective data?) <BR>&gt; <BR>&gt; 2)&nbsp;= &nbsp; &nbsp;&nbsp;&nbsp;Is pitch representation solely dependent on temporal envelope<BR>&gt; cues or are there any other contributors like ca= rrier frequency (other<BR>&gt; than the Nyquist- Shannon theorem), envelope= cut-off, envelope<BR>&gt; extraction method, temporal analysis/sample leng= th etc which may also<BR>&gt; play a major role? <BR>&gt; <BR>&gt; 3)&nbsp;= &nbsp; &nbsp;&nbsp;&nbsp;How is the pitch information encoded and extracte= d in such a<BR>&gt; complex temporal envelope of speech sounds (is it compl= etely different<BR>&gt; than the periodicity based or spectral based pitch = extraction mode)? <BR>&gt; <BR>&gt; 4)&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;Is it= that if I band pass filter (based on auditory filters)<BR>&gt; the envelop= e information then the filter/channel containing the<BR>&gt; pitch/Fo infor= mation will have a different envelope (probably more<BR>&gt; periodic) than= the other parts and maybe the pitch information is<BR>&gt; extracted by th= e auditory system from the complex envelope through<BR>&gt; this mode?&nbsp;&nbsp;&nbsp;<BR>&gt; <BR>&gt;&nbsp; <BR>&gt; Best regards,<BR>&= gt; <BR>&gt; <BR>&gt; Imran Dhamani <BR>&gt; <BR>&gt; PhD. student.<BR>&gt;= <BR>&gt; $$$$$ monty@xxxxxxxx@xxxxxxxx@xxxxxxxx@xxxxxxxx<BR>&gt; <BR>&gt; <BR>&gt; _______________________= _______________________________________________<BR>&gt; Your Mail works bes= t with the New Yahoo Optimized IE8. Get it NOW!.<BR><BR><BR></DIV></BLOCKQU= OTE></td></tr></table><br>=0A=0A=0A=0A <!--1--><hr size=3D1></hr> =0AT= he INTERNET now has a personality. YOURS! <a href=3D"http://in.rd.yahoo.com= /tagline_yyi_1/*http://in.yahoo.com/" target=3D"_blank">See your Yahoo! Hom= epage</a>. --0-901191598-1265159902=:44136--


This message came from the mail archive
/home/empire6/dpwe/public_html/postings/2010/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University