Re: [AUDITORY] On 3D audio rendering for signals with the low sampling frequency ("Richard F. Lyon" )


Subject: Re: [AUDITORY] On 3D audio rendering for signals with the low sampling frequency
From:    "Richard F. Lyon"  <dicklyon@xxxxxxxx>
Date:    Sat, 13 Aug 2022 20:03:59 -0700

--0000000000000a4e5705e62ac51f Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Yes, good idea to find some solutions to the difficult. Reviewing my book's Figure 22.7, there's a pretty good spectral notch cue to elevation in the 5.5-8 kHz region (and higher); 8 kHz might be enough for elevation up to about 45 degrees (find free book PDF via machinehearing.org -- search that blog for "free".) For resolving front/back confusion, that's hard unless you add the effects of lateralization change with head turning. Using a head tracker or gyro to change the lateral angle to the sound, relative to the head, is very effective for letting the user disambiguate, if they have time to move a little. So it depends on what you're trying to do. If it was impossible to localize sounds with a 16 kHz sample rate, it would be equally impossible to localize sounds with no energy about 8 kHz. I don't think that's the case. I can't hear anything about 8 kHz (unless it's quite intense), and I don't sense that I have any difficulty localizing sounds around me. Probably if we measured though we'd find I'm not as accurate as a person with better hearing. And I don't think the sample interval of 1/16000 sec provides a strong inherent limit on ITD accuracy. The bandwidth of 8 kHz is about half of what's "normal", so theoretical TDOA resolution should be expected to be no worse than double normal, say 20=E2=80=9340 microseconds (about half of a s= ample interval) instead of 10=E2=80=9320 microseconds. I wouldn't be surprised i= f the ITD resolution threshold was even closer to normal (around 1/4 sample interval), since our ITD-computing structure is dominated by lower-frequency input. Dick On Fri, Aug 12, 2022 at 9:20 PM Junfeng Li <junfeng.li.1979@xxxxxxxx> wrote: > Dear Frederick, > > Thank you so much for the references that you mentioned. > > "[...] up=E2=80=93down cues are located mainly in the 6=E2=80=9312-kHz ba= nd, and > front=E2=80=93back cues in the 8=E2=80=9316-kHz band." > According to this statement, it seems impossible to solve the problems of > elevation perception and front-back confusion when the output signal is > sampled at 16kHz. > Though I know it is difficult, I always try to find some solutions. > > Thanks again. > > Best regards, > Junfeng > > On Sat, Aug 13, 2022 at 12:50 AM Frederick Gallun <fgallun@xxxxxxxx> > wrote: > >> The literature on the HRTF over the past 60 years has made it very clear >> that "[...] up=E2=80=93down cues are located mainly in the 6=E2=80=9312-= kHz band, and >> front=E2=80=93back cues in the 8=E2=80=9316-kHz band." (Langendiijk and = Bronkhorst, 2002) >> >> Here are a few places to start: >> >> Langendijk, E. H. A., & Bronkhorst, A. W. (2002). Contribution of >> spectral cues to human sound localization. *The Journal of the >> Acoustical Society of America*, *112*(4), 1583=E2=80=931596. >> https://doi.org/10.1121/1.1501901 >> >> Mehrgardt, S., & Mellert, V. (1977). Transformation characteristics of >> the external human ear. *The Journal of the Acoustical Society of >> America*, *61*(6), 1567=E2=80=931576. https://doi.org/10.1121/1.381470 >> >> Shaw, E. a. G., & Teranishi, R. (1968). Sound Pressure Generated in an >> External=E2=80=90Ear Replica and Real Human Ears by a Nearby Point Sourc= e. *The >> Journal of the Acoustical Society of America*, *44*(1), 240=E2=80=93249. >> https://doi.org/10.1121/1.1911059 >> >> --------------------------------------------- >> >> Frederick (Erick) Gallun, PhD, FASA, FASHA | he/him/his >> Professor, Oregon Hearing Research Center, Oregon Health & Science >> University >> "Diversity is like being invited to a party, Inclusion is being asked to >> dance, and Belonging is dancing like no one=E2=80=99s watching" - Gregor= y Lewis >> >> >> On Thu, Aug 11, 2022 at 11:59 PM Junfeng Li <junfeng.li.1979@xxxxxxxx> >> wrote: >> >>> Dear Leslie, >>> >>> When downsampling to 8/16kHz, we really found the localization accuracy >>> decreases, even for horizon >>> Do you have any good ideas to solve it? >>> >>> Thanks a lot. >>> >>> Best regards, >>> Junfeng >>> >>> >>> On Thu, Aug 11, 2022 at 4:04 PM Prof Leslie Smith < >>> l.s.smith@xxxxxxxx> wrote: >>> >>>> I'd also wonder about the time resolution: 16KHz =3D 1/16000 sec betwe= en >>>> samples =3D 62 microseconds >>>> . >>>> That's relatively long for ITD (TDOA) estimation, which would suggest >>>> that >>>> localisation of lower frequency signals would be impeded. >>>> >>>> (I don't have evidence for this: it's just a suggestion). >>>> >>>> --Leslie Smith >>>> >>>> Junfeng Li wrote: >>>> > Dear all, >>>> > >>>> > We are working on 3D audio rendering for signals with low sampling >>>> > frequency. >>>> > As you may know, the HRTFs are normally measured at the high sampli= ng >>>> > frequency, e.g., 48kHz or 44.1kHz. However, the sampling frequency o= f >>>> > sound >>>> > signals in our application is restricted to 16 kHz. Therefore, to >>>> render >>>> > this low-frequency (=E2=89=A48kHz) signal, one straight way is to fi= rst >>>> > downsample >>>> > the HRTFs from 48kHz/44.1kHz to 16kHz and then convolve with sound >>>> > signals. >>>> > However, the sound localization performance of the signal rendered >>>> with >>>> > this approach is greatly decreased, especially elevation perception. >>>> To >>>> > improve the sound localization performance, I am now wondering wheth= er >>>> > there is a certain good method to solve or mitigate this problem in >>>> this >>>> > scenario. >>>> > >>>> > Any discussion is welcome. >>>> > >>>> > Thanks a lot again. >>>> > >>>> > Best regards, >>>> > Junfeng >>>> > >>>> >>>> >>>> -- >>>> Prof Leslie Smith (Emeritus) >>>> Computing Science & Mathematics, >>>> University of Stirling, Stirling FK9 4LA >>>> Scotland, UK >>>> Tel +44 1786 467435 >>>> Web: http://www.cs.stir.ac.uk/~lss >>>> Blog: http://lestheprof.com >>>> >>>> --0000000000000a4e5705e62ac51f Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-size:small">Yes= , good idea to find some solutions to the difficult.</div><div class=3D"gma= il_default" style=3D"font-size:small"><br></div><div class=3D"gmail_default= " style=3D"font-size:small">Reviewing my book&#39;s Figure 22.7, there&#39;= s a pretty good spectral notch cue to elevation in the 5.5-8 kHz region (an= d higher); 8 kHz might be enough for elevation up to about 45 degrees (find= free book PDF via <a href=3D"http://machinehearing.org">machinehearing.org= </a> -- search that blog for &quot;free&quot;.)<br></div><div class=3D"gmai= l_default" style=3D"font-size:small"><br></div><div class=3D"gmail_default"= style=3D"font-size:small">For resolving front/back confusion, that&#39;s h= ard unless you add the effects of lateralization change with head turning.= =C2=A0 Using a head tracker or gyro to change the lateral angle to the soun= d, relative to the head, is very effective for letting the user disambiguat= e, if they have time to move a little.=C2=A0 So it depends on what you&#39;= re trying to do.</div><div class=3D"gmail_default" style=3D"font-size:small= "><br></div><div class=3D"gmail_default" style=3D"font-size:small">If it wa= s impossible to localize sounds with a 16 kHz sample rate, it would be equa= lly impossible to localize sounds with no energy about 8 kHz.=C2=A0 I don&#= 39;t think that&#39;s the case.=C2=A0 I can&#39;t hear anything about 8 kHz= (unless it&#39;s quite intense), and I don&#39;t sense that I have any dif= ficulty localizing sounds around me.=C2=A0 Probably if we measured though w= e&#39;d find I&#39;m not as accurate as a person with better hearing.</div>= <div class=3D"gmail_default" style=3D"font-size:small"><br></div><div class= =3D"gmail_default" style=3D"font-size:small">And I don&#39;t think the samp= le interval of 1/16000 sec provides a strong inherent limit on ITD accuracy= .=C2=A0 The bandwidth of 8 kHz is about half of what&#39;s &quot;normal&quo= t;, so theoretical TDOA resolution should be expected to be no worse than d= ouble normal, say 20=E2=80=9340 microseconds (about half of a sample interv= al) instead of 10=E2=80=9320 microseconds.=C2=A0 I wouldn&#39;t be surprise= d if the ITD resolution threshold was even closer to normal (around 1/4 sam= ple interval), since our ITD-computing structure is dominated by lower-freq= uency input.<br></div><div class=3D"gmail_default" style=3D"font-size:small= "><br></div><div class=3D"gmail_default" style=3D"font-size:small">Dick</di= v><div class=3D"gmail_default" style=3D"font-size:small"><br></div><div cla= ss=3D"gmail_default" style=3D"font-size:small"><br></div><div class=3D"gmai= l_default" style=3D"font-size:small"><br></div></div><br><div class=3D"gmai= l_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Fri, Aug 12, 2022 at 9:20= PM Junfeng Li &lt;<a href=3D"mailto:junfeng.li.1979@xxxxxxxx">junfeng.li.= 1979@xxxxxxxx</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" st= yle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padd= ing-left:1ex"><div dir=3D"ltr">Dear Frederick,<div><br></div><div>Thank you= so much for the references that you mentioned.=C2=A0</div><div><br></div><= div>&quot;[...]=C2=A0up=E2=80=93down cues are located mainly in the 6=E2=80= =9312-kHz band, and front=E2=80=93back cues in the 8=E2=80=9316-kHz band.&q= uot;=C2=A0<br></div><div>According to this statement, it seems impossible t= o solve the problems of elevation perception and front-back confusion when = the output signal is sampled at 16kHz.=C2=A0</div><div>Though I know it is = difficult, I always=C2=A0try to find some solutions.</div><div><br></div><d= iv>Thanks again.</div><div><br></div><div>Best regards,</div><div>Junfeng= =C2=A0</div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"= gmail_attr">On Sat, Aug 13, 2022 at 12:50 AM Frederick Gallun &lt;<a href= =3D"mailto:fgallun@xxxxxxxx" target=3D"_blank">fgallun@xxxxxxxx</a>&gt; w= rote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0p= x 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir= =3D"ltr">The literature on the HRTF over the past 60 years has made it very= clear that &quot;[...]=C2=A0up=E2=80=93down cues are located mainly in the= 6=E2=80=9312-kHz band, and front=E2=80=93back cues in the 8=E2=80=9316-kHz= band.&quot; (Langendiijk=C2=A0and Bronkhorst, 2002)=C2=A0=C2=A0<div><br></= div><div>Here are a few places to start:<div><br></div><div>Langendijk, E. = H. A., &amp; Bronkhorst, A. W. (2002). Contribution of spectral cues to hum= an sound localization. <i>The Journal of the Acoustical Society of America<= /i>, <i>112</i>(4), 1583=E2=80=931596. <a href=3D"https://doi.org/10.1121/1= .1501901" target=3D"_blank">https://doi.org/10.1121/1.1501901</a></div><div= ><br></div><div>Mehrgardt, S., &amp; Mellert, V. (1977). Transformation cha= racteristics of the external human ear. <i>The Journal of the Acoustical So= ciety of America</i>, <i>61</i>(6), 1567=E2=80=931576. <a href=3D"https://d= oi.org/10.1121/1.381470" target=3D"_blank">https://doi.org/10.1121/1.381470= </a></div><div><div style=3D"line-height:2;margin-left:2em"> <span title=3D"url_ver=3DZ39.88-2004&amp;ctx_ver=3DZ39.88-2004&amp;rfr_id= =3Dinfo%3Asid%2Fzotero.org%3A2&amp;rft_id=3Dinfo%3Adoi%2F10.1121%2F1.381470= &amp;rft_val_fmt=3Dinfo%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;rft.genre=3Da= rticle&amp;rft.atitle=3DTransformation%20characteristics%20of%20the%20exter= nal%20human%20ear&amp;rft.jtitle=3DThe%20Journal%20of%20the%20Acoustical%20= Society%20of%20America&amp;rft.volume=3D61&amp;rft.issue=3D6&amp;rft.aufirs= t=3DS.&amp;rft.aulast=3DMehrgardt&amp;rft.au=3DS.%20Mehrgardt&amp;rft.au=3D= V.%20Mellert&amp;rft.date=3D1977-06&amp;rft.pages=3D1567-1576&amp;rft.spage= =3D1567&amp;rft.epage=3D1576&amp;rft.issn=3D0001-4966"></span></div></div><= div><br></div><div>Shaw, E. a. G., &amp; Teranishi, R. (1968). Sound Pressu= re Generated in an External=E2=80=90Ear Replica and Real Human Ears by a Ne= arby Point Source. <i>The Journal of the Acoustical Society of America</i>,= <i>44</i>(1), 240=E2=80=93249. <a href=3D"https://doi.org/10.1121/1.191105= 9" target=3D"_blank">https://doi.org/10.1121/1.1911059</a></div><div><div><= div style=3D"line-height:2;margin-left:2em"> <span title=3D"url_ver=3DZ39.88-2004&amp;ctx_ver=3DZ39.88-2004&amp;rfr_id= =3Dinfo%3Asid%2Fzotero.org%3A2&amp;rft_id=3Dinfo%3Adoi%2F10.1121%2F1.191105= 9&amp;rft_val_fmt=3Dinfo%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;rft.genre=3D= article&amp;rft.atitle=3DSound%20Pressure%20Generated%20in%20an%20External%= E2%80%90Ear%20Replica%20and%20Real%20Human%20Ears%20by%20a%20Nearby%20Point= %20Source&amp;rft.jtitle=3DThe%20Journal%20of%20the%20Acoustical%20Society%= 20of%20America&amp;rft.volume=3D44&amp;rft.issue=3D1&amp;rft.aufirst=3DE.%2= 0a.%20G.&amp;rft.aulast=3DShaw&amp;rft.au=3DE.%20a.%20G.%20Shaw&amp;rft.au= =3DR.%20Teranishi&amp;rft.date=3D1968-07&amp;rft.pages=3D240-249&amp;rft.sp= age=3D240&amp;rft.epage=3D249&amp;rft.issn=3D0001-4966"></span></div></div>= <div><br></div><div><div><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"= ><div dir=3D"ltr"><div dir=3D"ltr"><p style=3D"margin:5pt 0in"><span style= =3D"font-size:10pt"><font face=3D"Calibri" color=3D"#000000">--------------= -------------------------------</font></span></p><p style=3D"margin:0in 0in= 0pt"><font face=3D"Calibri" color=3D"#000000">Frederick (Erick) Gallun, Ph= D, FASA, FASHA | he/him/his<br></font></p><div style=3D"margin:0in 0in 0pt"= ><font face=3D"Calibri" color=3D"#000000">Professor,=C2=A0Oregon Hearing Re= search Center,=C2=A0<font face=3D"Calibri" color=3D"#000000">Oregon Health = &amp; Science University</font></font></div><div style=3D"margin:0in 0in 0p= t"><font size=3D"1" face=3D"monospace"><font color=3D"#000000">&quot;D</fon= t><span style=3D"color:rgb(76,76,76)">iversity is like being invited to a p= arty, Inclusion is being asked to dance, and Belonging is dancing like no o= ne=E2=80=99s watching&quot; - Gregory Lewis</span></font></div><div style= =3D"margin:0in 0in 0pt"><span><font size=3D"3" face=3D"Times New Roman" col= or=3D"#000000"> </font></span></div><font size=3D"3" face=3D"Times New Roman" color=3D"#000= 000"> </font></div></div></div></div></div></div><br></div></div></div></div><br>= <div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Thu, Au= g 11, 2022 at 11:59 PM Junfeng Li &lt;<a href=3D"mailto:junfeng.li.1979@xxxxxxxx= il.com" target=3D"_blank">junfeng.li.1979@xxxxxxxx</a>&gt; wrote:<br></div= ><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border= -left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr">Dear=C2= =A0 Leslie,<div><br></div><div>When downsampling to 8/16kHz, we really found th= e localization accuracy decreases,=C2=A0even for horizon</div><div>Do you h= ave any good ideas to solve it?</div><div><br></div><div>Thanks a lot.</div= ><div><br></div><div>Best regards,</div><div>Junfeng=C2=A0</div><div><br></= div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_at= tr">On Thu, Aug 11, 2022 at 4:04 PM Prof Leslie Smith &lt;<a href=3D"mailto= :l.s.smith@xxxxxxxx" target=3D"_blank">l.s.smith@xxxxxxxx</a>&gt;= wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px = 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I&#39;d = also wonder about the time resolution: 16KHz =3D 1/16000 sec between<br> samples =3D 62 microseconds<br> .<br> That&#39;s relatively long for ITD (TDOA) estimation, which would suggest t= hat<br> localisation of lower frequency signals would be impeded.<br> <br> (I don&#39;t have evidence for this: it&#39;s just a suggestion).<br> <br> --Leslie Smith<br> <br> Junfeng Li wrote:<br> &gt; Dear all,<br> &gt;<br> &gt; We are working on 3D audio rendering for signals with low sampling<br> &gt; frequency.<br> &gt; As you may know, the HRTFs=C2=A0 are normally measured at the high sam= pling<br> &gt; frequency, e.g., 48kHz or 44.1kHz. However, the sampling frequency of<= br> &gt; sound<br> &gt; signals in our application is restricted to 16 kHz. Therefore, to rend= er<br> &gt; this low-frequency (=E2=89=A48kHz) signal, one straight way is to firs= t<br> &gt; downsample<br> &gt; the HRTFs from 48kHz/44.1kHz to 16kHz and then convolve with sound<br> &gt; signals.<br> &gt; However, the sound localization performance of the signal rendered wit= h<br> &gt; this approach is greatly decreased, especially elevation perception. T= o<br> &gt; improve the sound localization performance, I am now wondering whether= <br> &gt; there is a certain good method to solve or mitigate this problem in th= is<br> &gt; scenario.<br> &gt;<br> &gt; Any discussion is welcome.<br> &gt;<br> &gt; Thanks a lot again.<br> &gt;<br> &gt; Best regards,<br> &gt; Junfeng<br> &gt;<br> <br> <br> -- <br> Prof Leslie Smith (Emeritus)<br> Computing Science &amp; Mathematics,<br> University of Stirling, Stirling FK9 4LA<br> Scotland, UK<br> Tel +44 1786 467435<br> Web: <a href=3D"http://www.cs.stir.ac.uk/~lss" rel=3D"noreferrer" target=3D= "_blank">http://www.cs.stir.ac.uk/~lss</a><br> Blog: <a href=3D"http://lestheprof.com" rel=3D"noreferrer" target=3D"_blank= ">http://lestheprof.com</a><br> <br> </blockquote></div> </blockquote></div> </blockquote></div> </blockquote></div> --0000000000000a4e5705e62ac51f--


This message came from the mail archive
src/postings/2022/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University