Subject: Re: [AUDITORY] On 3D audio rendering for signals with the low sampling frequency From: Junfeng Li <junfeng.li.1979@xxxxxxxx> Date: Thu, 11 Aug 2022 14:52:12 +0800--000000000000a0bbe605e5f19b57 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Dear Neeraj, Thank you so much for your discussions. > Thanks for sharing this observation here. I do not have a solution now bu= t > curious to know more. > I can relate the loss in elevation to poor capture of the spectral notche= s > present in the HRTF. But I did not assume that the notches beyond 8kHz ar= e > this crucial. Are the HRTF personalized? > Yeah. In theory, we can obtain high localization accuracy when individualized HRTFs are used. However, we found that even if individualized HRTFs and head tracking are used, sound localization is still not so good as we expected when using headphone rendering. Especially, the errors in elevation perception are still large. Though we know that localization accuracy in elevation is greatly lower than that in horizontal plan even for normal-hearing listeners. > > Also, I am now wondering, is it always the case that elevation informatio= n > is poor for 16 kHz audio signals. Is there literature on this? > Just a quick shot, I will also try downsampling (without low pass > filtering) the HRTF to 16 kHz and see if the aliased HRTF spectrum > significantly corrupts the 3-D perception. I will bet - not much. But wil= l > keep fingers crossed. > > I am sorry that I have no exact literatur on this issue. But, you can have a try and see the results. As an example, the sampling frequency for telephone speech is normally 8kHz, when we try to 3D render this 8kHz speech, It is quite difficult to perceive elevation. This is the main problem that we have to consider. Look forward to your results and further dicussion. Best regards, Junfeng > > On Thu, Aug 11, 2022 at 11:04 AM Junfeng Li <junfeng.li.1979@xxxxxxxx> > wrote: > >> Dear Dick, >> >> Thanks a lot for your information. >> >> Yeah, the main problem for us is the limitation of the 16kHz sampling >> frequency at the output side. Therefore, even if we do bandwidth extensi= on >> for input signal, we have to downsample to 16kHz after 3D rendering >> processing. I am wondering there is any possible/potential method using >> some pychoacoustic principle, like that? >> >> Thanks again. >> >> Best regards >> Junfeng >> >> On Thu, Aug 11, 2022 at 12:29 PM Richard F. Lyon <dicklyon@xxxxxxxx> >> wrote: >> >>> You could do "bandwidth extension" on the signals you want to >>> spatialize, e.g. with some of the methods at >>> https://gfx.cs.princeton.edu/pubs/Su_2021_BEI/ICASSP2021_Su_Wang_BWE.pd= f >>> and then apply the high-sample-rate HRTFs. >>> Of course, if your system has a 16 ksps limitation on the output side, >>> that will be of no use. >>> >>> Dick >>> >>> >>> On Wed, Aug 10, 2022 at 9:22 PM Junfeng Li <junfeng.li.1979@xxxxxxxx> >>> wrote: >>> >>>> Dear all, >>>> >>>> We are working on 3D audio rendering for signals with low sampling >>>> frequency. >>>> As you may know, the HRTFs are normally measured at the high sampling >>>> frequency, e.g., 48kHz or 44.1kHz. However, the sampling frequency of = sound >>>> signals in our application is restricted to 16 kHz. Therefore, to rend= er >>>> this low-frequency (=E2=89=A48kHz) signal, one straight way is to firs= t downsample >>>> the HRTFs from 48kHz/44.1kHz to 16kHz and then convolve with sound sig= nals. >>>> However, the sound localization performance of the signal rendered wit= h >>>> this approach is greatly decreased, especially elevation perception. T= o >>>> improve the sound localization performance, I am now wondering whether >>>> there is a certain good method to solve or mitigate this problem in th= is >>>> scenario. >>>> >>>> Any discussion is welcome. >>>> >>>> Thanks a lot again. >>>> >>>> Best regards, >>>> Junfeng >>>> >>> --000000000000a0bbe605e5f19b57 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div dir=3D"ltr">Dear Neeraj,=C2=A0<br></div><div dir=3D"l= tr"><br></div><div>Thank you so much for your discussions.</div><br><div cl= ass=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margin:0px 0= px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div = dir=3D"ltr"><div><br></div><div>Thanks for sharing this observation here. I= do not have a solution now but curious to know more.</div><div>I can relat= e the=C2=A0loss in elevation to poor capture of the spectral notches presen= t in the HRTF. But I did not assume=C2=A0that the notches beyond 8kHz are t= his crucial. Are the HRTF personalized?</div></div></blockquote><div><br></= div><div>Yeah. In theory, we can obtain high localization accuracy when ind= ividualized HRTFs are used. However, we found that even if individualized H= RTFs and head tracking are used, sound localization is still not so good as= we expected when using headphone=C2=A0rendering. Especially, the errors in= elevation perception are still large. Though we know that localization acc= uracy in elevation is greatly lower than that in horizontal plan even for n= ormal-hearing listeners.=C2=A0</div><div>=C2=A0</div><blockquote class=3D"g= mail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204= ,204,204);padding-left:1ex"><div dir=3D"ltr"><div><br></div><div>Also, I am= now wondering, is it always the case that elevation information is poor fo= r=C2=A016 kHz audio signals. Is there literature on this?</div><div>Just a = quick shot, I will also try downsampling (without low pass filtering) the H= RTF to 16 kHz and see if the aliased=C2=A0HRTF spectrum significantly corru= pts the 3-D perception. I will bet - not much. But will keep fingers crosse= d.</div><div><br></div></div></blockquote><div><br></div><div>I am sorry th= at I have no exact literatur on this issue. But, you can have a try and see= the results. As an example, the sampling frequency for telephone speech is= normally 8kHz, when we try to 3D render this 8kHz speech, It is quite diff= icult to perceive elevation. This is the main problem that we have to consi= der.=C2=A0</div><div><br></div><div>Look forward to your results and furthe= r dicussion.</div><div><br></div><div>Best regards,</div><div>Junfeng=C2=A0= </div><div><br></div><div>=C2=A0</div><blockquote class=3D"gmail_quote" sty= le=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);paddi= ng-left:1ex"><div dir=3D"ltr"><div></div></div><br><div class=3D"gmail_quot= e"><div dir=3D"ltr" class=3D"gmail_attr">On Thu, Aug 11, 2022 at 11:04 AM J= unfeng Li <<a href=3D"mailto:junfeng.li.1979@xxxxxxxx" target=3D"_blank= ">junfeng.li.1979@xxxxxxxx</a>> wrote:<br></div><blockquote class=3D"gm= ail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,= 204,204);padding-left:1ex"><div dir=3D"ltr">Dear Dick,=C2=A0<div><br></div>= <div>Thanks a lot for your information.</div><div><br></div><div>Yeah, the = main problem for us is the limitation of the 16kHz sampling frequency at th= e output side. Therefore, even if we do bandwidth extension for input signa= l, we have to downsample to 16kHz after 3D rendering processing. I am wonde= ring there is any possible/potential method using some pychoacoustic princi= ple, like that?</div><div><br></div><div>Thanks again.</div><div><br></div>= <div>Best regards</div><div>Junfeng=C2=A0</div></div><br><div class=3D"gmai= l_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Thu, Aug 11, 2022 at 12:2= 9 PM Richard F. Lyon <<a href=3D"mailto:dicklyon@xxxxxxxx" target=3D"_bla= nk">dicklyon@xxxxxxxx</a>> wrote:<br></div><blockquote class=3D"gmail_quo= te" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204= );padding-left:1ex"><div dir=3D"ltr"><div style=3D"font-size:small">You cou= ld do "bandwidth extension" on the signals you want to spatialize= , e.g. with some of the methods at <br></div><div style=3D"font-size:small"= ><a href=3D"https://gfx.cs.princeton.edu/pubs/Su_2021_BEI/ICASSP2021_Su_Wan= g_BWE.pdf" target=3D"_blank">https://gfx.cs.princeton.edu/pubs/Su_2021_BEI/= ICASSP2021_Su_Wang_BWE.pdf</a></div><div style=3D"font-size:small">and then= apply the high-sample-rate HRTFs.=C2=A0 <br></div><div style=3D"font-size:= small">Of course, if your system has a 16 ksps limitation on the output sid= e, that will be of no use.<br></div><div style=3D"font-size:small"><br></di= v><div style=3D"font-size:small">Dick</div><div style=3D"font-size:small"><= br></div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gma= il_attr">On Wed, Aug 10, 2022 at 9:22 PM Junfeng Li <<a href=3D"mailto:j= unfeng.li.1979@xxxxxxxx" target=3D"_blank">junfeng.li.1979@xxxxxxxx</a>&g= t; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0p= x 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div d= ir=3D"ltr">Dear all,=C2=A0<div><br></div><div>We are working on 3D audio re= ndering for signals with low sampling frequency.=C2=A0</div><div>As you may= know, the HRTFs=C2=A0 are normally measured at the high sampling frequency= , e.g., 48kHz or 44.1kHz. However, the sampling frequency of sound signals = in our application=C2=A0is restricted to 16 kHz. Therefore, to render this = low-frequency (=E2=89=A48kHz) signal, one straight way is to first=C2=A0dow= nsample the HRTFs from 48kHz/44.1kHz to 16kHz and then=C2=A0convolve with s= ound signals. However, the sound localization performance of the signal ren= dered=C2=A0with this approach is greatly decreased, especially elevation pe= rception. To improve the=C2=A0sound localization performance, I am now wond= ering whether there is a certain good method to solve or mitigate this prob= lem in this scenario.=C2=A0</div><div><br></div><div>Any discussion is welc= ome.</div><div><br></div><div>Thanks a lot again.</div><div><br></div><div>= Best regards,</div><div>Junfeng=C2=A0</div></div> </blockquote></div> </blockquote></div> </blockquote></div> </blockquote></div></div> --000000000000a0bbe605e5f19b57--