Re: [AUDITORY] On 3D audio rendering for signals with the low sampling frequency (Vani Rajendran )

Subject: Re: [AUDITORY] On 3D audio rendering for signals with the low sampling frequency From: Vani Rajendran <vani.g.rajendran@xxxxxxxx> Date: Tue, 16 Aug 2022 14:25:54 -0500 --00000000000038cd6605e660b863 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Junfeng, We found that "blurring" nearby locations of a (generic) HRTF could improve localisation accuracy in the vertical plane, this might be worth trying. Simple filtering could also bias perception upwards or downwards even within your sampling constraints. Rajendran, V.G. and Gamper, H., 2019. Spectral manipulation improves elevation perception with non-individualized head-related transfer functions. *The Journal of the Acoustical Society of America*, *145*(3), pp.EL222-EL228. Cheers! Vani On Tue, Aug 16, 2022 at 1:23 AM Junfeng Li <junfeng.li.1979@xxxxxxxx> wrote: > Dear Piotr, > > Our experimental results also showed the increase of sound-localization > errors in vertical planes when the sampling frequency is 16kHz or 8kHz, > which is consistent with the findings reported in the two papers that you > mentioned. > > Though the system with high sampling frequency is preferred, however, the > sampling frequency of our playback system is limited to 8 or 16kHz. We > cannot change it at present. Therefore, I am now looking for an alternati= ve > approach to solve/mitigate this problem. > > Best regards, > Junfeng > > > On Tue, Aug 16, 2022 at 12:17 PM Piotr Majdak <piotr@xxxxxxxx> wrote: > >> Dear all, >> >> With respect to the sound localization in vertical planes: We also could >> look what happens when the spectral content above 8 kHz is removed: >> Best et al. (2005, =E2=80=9CThe role of high frequencies in speech local= ization,=E2=80=9D >> JASA 118, 353=E2=80=9363): Results for the sound localization with low-p= ass >> filtered (8 kHz) speech (their Exp I) show a drastic increase of >> sound-localization errors in vertical planes (their Fig. 5), no effect i= n >> the lateral plane. >> >> Majdak et al. (2013, =E2=80=9CEffect of long-term training on sound loca= lization >> performance with spectrally warped and band-limited head-related transfe= r >> functions,=E2=80=9D JASA 134, 2148=E2=80=932159): Results for the sound = localization with >> low-pass filtered (8 kHz) white noises show a large increase of >> localization errors (Fig. 6, red circles at "Pre") in vertical planes >> (front/back, top/down), no changes in the lateral dimensions (left/right= ). >> >> It's not encouraging for systems going up to 8 kHz only, though :-(. And >> of course, consideration of head movements may help... >> >> Best regards, >> Piotr >> >> Am 15.08.2022 um 00:52 schrieb Adam Weisser: >> >> Dear Junfeng, Alain, and all, >> >> I think that some solutions to the undersampling / aliasing problem that >> you described should exist, but they likely depend on where the >> sampling-rate bottleneck lies: at the input, in processing, at the outpu= t >> stage, or in all of them. Also, it depends on the computational >> capabilities of the system and whether it has to work in real time, and = if >> so what the permissible delay is. >> >> I'm aware of two general approaches to circumvent the Nyquist criterion: >> 1. Compressed sensing - This heavily researched signal-processing method >> uses signal sparsity to faithfully reconstruct undersampled signals [1]. >> >> 2. Trading off aliasing and noise - This is a classical result that >> employs nonuniform sampling at lower rates than Nyquist, whereby the >> aliasing that would otherwise arise is replaced by noise [2]. It is thou= ght >> that this is what happens in the retina, where the optical image is dens= ely >> sampled in the fovea by the photoreceptors, but becomes gradually >> undersampled away from the fovea [3]. Had the photoreceptor density been >> uniform and regular over the retina, the resolution of the central visio= n >> would great suffer and the image would also be severely aliased. However= , >> this trick works only if the sampling is truly stochastic. If the >> "localization noise" level (maybe manifest as audio noise) can be >> sacrificed, then this approach may work, combined with dither. >> >> Regardless of the specific system architecture at hand, none of these >> methods appears straightforward to implement. >> >> Finally, regarding Alain's comment about auditory sampling - the neat >> trick that is found in spatial processing of vision may be analogous to >> what goes on in temporal processing of stimuli at the transduction stage= of >> the auditory nerve. Neural adaptation can be thought of as dense samplin= g >> of the signal around its onset / transient portion, which becomes more >> sparsely sampled quickly after the onset. Because of adaptation, this >> effect is very illusive, but I believe that it is measurable >> notwithstanding. I tried to demonstrate it psychoacoustically in Appendi= x E >> of [4]. While I don't know how it relates to binaural processing directl= y, >> there may be instantaneous effects that may be detectable there too, giv= en >> that the input to both processing types is the same. >> >> All the best, >> Adam. >> >> [1] Candes, E. J., Romberg, J. K., & Tao, T. (2006). Stable signal >> recovery from incomplete and inaccurate measurements. Communications on >> Pure and Applied Mathematics: A Journal Issued by the Courant Institute = of >> Mathematical Sciences, 59(8), 1207-1223. >> >> >> [2] Shapiro, Harold S and Silverman, Richard A. Alias-free sampling of >> random noise. Journal of the >> Society for Industrial and Applied Mathematics, 8(2):225?248, 1960. >> >> [3] Yellott, John I. Spectral consequences of photoreceptor sampling in >> the rhesus retina. Science, 221 >> (4608):382?385, 1983. >> >> [4] Weisser, A. (2021). Treatise on Hearing: The Temporal Auditory >> Imaging Theory Inspired by Optics and Communication. >> <https://arxiv.org/abs/2111.04338>*arXiv preprint arXiv:2111.04338 >> <https://arxiv.org/abs/2111.04338>*. <https://arxiv.org/abs/2111.04338> >> >> >> >> On Sun, Aug 14, 2022, at 4:47 AM, Alain de Cheveigne wrote: >> >> Hi Dick, all, >> >> A couple of thoughts. I'm no expert of spatial hearing, so they may be >> off the mark. >> >> > And I don't think the sample interval of 1/16000 sec provides a strong >> inherent limit on ITD accuracy. The bandwidth of 8 kHz is about half of >> what's "normal", so theoretical TDOA resolution should be expected to be= no >> worse than double normal, say 20=E2=80=9340 microseconds (about half of = a sample >> interval) instead of 10=E2=80=9320 microseconds. I wouldn't be surprise= d if the >> ITD resolution threshold was even closer to normal (around 1/4 sample >> interval), since our ITD-computing structure is dominated by >> lower-frequency input. >> >> >> Indeed, the sample interval does not limit ITD estimation resolution. >> You can get arbitrary resolution by interpolating the cross-correlation >> function near its peak (for example by fitting a parabola to three sampl= es >> closest to the peak). A similar argument applies to fundamental frequen= cy >> estimation (--> pitch) from the autocorrelation function as in the YIN >> method. >> >> This assumes that the CCF or ACF is smooth enough for the interpolation >> to be accurate, and for that the audio signals must be smooth, i.e. >> band-limited. The purpose of a low-pass antialiasing filter associated >> with sampling or resampling is to *insure* that this is the case for >> typical signals, but that "insurance" is unnecessary if the signals cont= ain >> no high-frequency power to start with. >> >> Thus, the choice of low-pass filter is a bit of a free parameter under >> the control of the engineer or experimenter. A wide filter (or none) is = OK >> if the signals are known to contain little or no high-frequency power, a >> sharp filter is needed if the signals are strongly high-pass. Engineers >> typically err on the side of precaution by designing filters with strong >> attenuation beyond Nyquist, usually with the additional goal of keeping = the >> pass-band flat. This requires a filter with a long impulse response. >> There's lee-way in the exact choice. EEs love the topic. >> >> This brings me to my second point. Are there perceptual correlates of >> antialiasing filtering? There are two reasons to suspect an effect on >> spatial hearing. First, a long IR might widen the CCF peak and blur the >> "crisp" peak in the short-term CCF associated with a transient. Second, = the >> frequency-domain features of the filter transfer function might interact >> with spectral notches characteristic of elevation or front-vs-back posit= ion >> of sources, particularly if those features are estimated by neural circu= its >> also sensitive to time. >> >> Again, this is pure speculation. Unfortunately, antialiasing filters are >> rarely specified in detail (in systems or studies), and I'm not aware of >> any study aiming to characterize their perceptual effects or demonstrate >> that there are none. Anecdotally, I remember being annoyed when listeni= ng >> to music on an early CD player, by what I attributed to high-frequency >> ringing of antialiasing or reconstruction filters with poles just below >> Nyquist. That was when I could still hear in that region... >> >> Alain >> >> >> >> >> >> >> >> > On 14 Aug 2022, at 05:03, Richard F. Lyon <DickLyon@xxxxxxxx> wrote: >> > >> > Yes, good idea to find some solutions to the difficult. >> > >> > Reviewing my book's Figure 22.7, there's a pretty good spectral notch >> cue to elevation in the 5.5-8 kHz region (and higher); 8 kHz might be >> enough for elevation up to about 45 degrees (find free book PDF via >> machinehearing.org -- search that blog for "free".) >> > >> > For resolving front/back confusion, that's hard unless you add the >> effects of lateralization change with head turning. Using a head tracke= r >> or gyro to change the lateral angle to the sound, relative to the head, = is >> very effective for letting the user disambiguate, if they have time to m= ove >> a little. So it depends on what you're trying to do. >> > >> > If it was impossible to localize sounds with a 16 kHz sample rate, it >> would be equally impossible to localize sounds with no energy about 8 kH= z. >> I don't think that's the case. I can't hear anything about 8 kHz (unles= s >> it's quite intense), and I don't sense that I have any difficulty >> localizing sounds around me. Probably if we measured though we'd find I= 'm >> not as accurate as a person with better hearing. >> > >> > And I don't think the sample interval of 1/16000 sec provides a strong >> inherent limit on ITD accuracy. The bandwidth of 8 kHz is about half of >> what's "normal", so theoretical TDOA resolution should be expected to be= no >> worse than double normal, say 20=E2=80=9340 microseconds (about half of = a sample >> interval) instead of 10=E2=80=9320 microseconds. I wouldn't be surprise= d if the >> ITD resolution threshold was even closer to normal (around 1/4 sample >> interval), since our ITD-computing structure is dominated by >> lower-frequency input. >> > >> > Dick >> > >> > >> > >> > >> > On Fri, Aug 12, 2022 at 9:20 PM Junfeng Li <junfeng.li.1979@xxxxxxxx> >> wrote: >> > Dear Frederick, >> > >> > Thank you so much for the references that you mentioned. >> > >> > "[...] up=E2=80=93down cues are located mainly in the 6=E2=80=9312-kHz= band, and >> front=E2=80=93back cues in the 8=E2=80=9316-kHz band." >> > According to this statement, it seems impossible to solve the problems >> of elevation perception and front-back confusion when the output signal = is >> sampled at 16kHz. >> > Though I know it is difficult, I always try to find some solutions. >> > >> > Thanks again. >> > >> > Best regards, >> > Junfeng >> > >> > On Sat, Aug 13, 2022 at 12:50 AM Frederick Gallun <fgallun@xxxxxxxx> >> wrote: >> > The literature on the HRTF over the past 60 years has made it very >> clear that "[...] up=E2=80=93down cues are located mainly in the 6=E2=80= =9312-kHz band, and >> front=E2=80=93back cues in the 8=E2=80=9316-kHz band." (Langendiijk and = Bronkhorst, 2002) >> > >> > Here are a few places to start: >> > >> > Langendijk, E. H. A., & Bronkhorst, A. W. (2002). Contribution of >> spectral cues to human sound localization. The Journal of the Acoustical >> Society of America, 112(4), 1583=E2=80=931596. https://doi.org/10.1121/1= .1501901 >> > >> > Mehrgardt, S., & Mellert, V. (1977). Transformation characteristics of >> the external human ear. The Journal of the Acoustical Society of America= , >> 61(6), 1567=E2=80=931576. https://doi.org/10.1121/1.381470 >> > >> > Shaw, E. a. G., & Teranishi, R. (1968). Sound Pressure Generated in an >> External=E2=80=90Ear Replica and Real Human Ears by a Nearby Point Sourc= e. The >> Journal of the Acoustical Society of America, 44(1), 240=E2=80=93249. >> https://doi.org/10.1121/1.1911059 >> > >> > --------------------------------------------- >> > >> > Frederick (Erick) Gallun, PhD, FASA, FASHA | he/him/his >> > Professor, Oregon Hearing Research Center, Oregon Health & Science >> University >> > "Diversity is like being invited to a party, Inclusion is being asked >> to dance, and Belonging is dancing like no one=E2=80=99s watching" - Gre= gory Lewis >> > >> > >> > On Thu, Aug 11, 2022 at 11:59 PM Junfeng Li <junfeng.li.1979@xxxxxxxx= > >> wrote: >> > Dear Leslie, >> > >> > When downsampling to 8/16kHz, we really found the localization accurac= y >> decreases, even for horizon >> > Do you have any good ideas to solve it? >> > >> > Thanks a lot. >> > >> > Best regards, >> > Junfeng >> > >> > >> > On Thu, Aug 11, 2022 at 4:04 PM Prof Leslie Smith < >> l.s.smith@xxxxxxxx> wrote: >> > I'd also wonder about the time resolution: 16KHz =3D 1/16000 sec betwe= en >> > samples =3D 62 microseconds >> > . >> > That's relatively long for ITD (TDOA) estimation, which would suggest >> that >> > localisation of lower frequency signals would be impeded. >> > >> > (I don't have evidence for this: it's just a suggestion). >> > >> > --Leslie Smith >> > >> > Junfeng Li wrote: >> > > Dear all, >> > > >> > > We are working on 3D audio rendering for signals with low sampling >> > > frequency. >> > > As you may know, the HRTFs are normally measured at the high sampli= ng >> > > frequency, e.g., 48kHz or 44.1kHz. However, the sampling frequency o= f >> > > sound >> > > signals in our application is restricted to 16 kHz. Therefore, to >> render >> > > this low-frequency (=E2=89=A48kHz) signal, one straight way is to fi= rst >> > > downsample >> > > the HRTFs from 48kHz/44.1kHz to 16kHz and then convolve with sound >> > > signals. >> > > However, the sound localization performance of the signal rendered >> with >> > > this approach is greatly decreased, especially elevation perception. >> To >> > > improve the sound localization performance, I am now wondering wheth= er >> > > there is a certain good method to solve or mitigate this problem in >> this >> > > scenario. >> > > >> > > Any discussion is welcome. >> > > >> > > Thanks a lot again. >> > > >> > > Best regards, >> > > Junfeng >> > > >> > >> > >> > -- >> > Prof Leslie Smith (Emeritus) >> > Computing Science & Mathematics, >> > University of Stirling, Stirling FK9 4LA >> > Scotland, UK >> > Tel +44 1786 467435 >> > Web: http://www.cs.stir.ac.uk/~lss >> > Blog: http://lestheprof.com >> > >> >> >> -- >> Piotr Majdak >> Fachbereich H=C3=B6ren >> <https://www.oeaw.ac.at/isf/forschung/fachbereiche-teams/hoeren> >> Institut f=C3=BCr Schallforschung <https://www.oeaw.ac.at/isf> >> =C3=96sterreichische Akademie der Wissenschaften <http://www.oeaw.ac.at/= > >> Wohllebengasse 12-14, 1040 Wien >> Tel.: +43 1 51581-2511 >> > --00000000000038cd6605e660b863 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div dir=3D"ltr">Hi Junfeng,<div><br></div><div>We found t= hat "blurring" nearby locations of a (generic) HRTF could improve= localisation accuracy in the vertical plane,=C2=A0this might be worth tryi= ng. Simple filtering could also bias perception upwards or downwards even w= ithin your sampling constraints.</div><div><br></div><div><span style=3D"fo= nt-family:Arial,sans-serif;font-size:13px">Rajendran, V.G. and Gamper, H., = 2019. Spectral manipulation improves elevation perception with non-individu= alized head-related transfer functions.=C2=A0</span><i style=3D"font-family= :Arial,sans-serif;font-size:13px">The Journal of the Acoustical Society of = America</i><span style=3D"font-family:Arial,sans-serif;font-size:13px">,=C2= =A0</span><i style=3D"font-family:Arial,sans-serif;font-size:13px">145</i><= span style=3D"font-family:Arial,sans-serif;font-size:13px">(3), pp.EL222-EL= 228.</span><br></div><div><span style=3D"font-family:Arial,sans-serif;font-= size:13px"><br></span></div><div><font face=3D"Arial, sans-serif">Cheers!</= font></div><div><font face=3D"Arial, sans-serif">Vani</font></div><div><fon= t face=3D"Arial, sans-serif"><br></font></div></div><br><div class=3D"gmail= _quote"><div dir=3D"ltr" class=3D"gmail_attr">On Tue, Aug 16, 2022 at 1:23 = AM Junfeng Li <<a href=3D"mailto:junfeng.li.1979@xxxxxxxx">junfeng.li.1= 979@xxxxxxxx</a>> wrote:<br></div><blockquote class=3D"gmail_quote" sty= le=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);paddi= ng-left:1ex"><div dir=3D"ltr"><div dir=3D"ltr">Dear Piotr,=C2=A0</div><div = dir=3D"ltr"><br></div><div>Our experimental results also showed the increas= e of sound-localization errors in vertical planes when the sampling frequen= cy is 16kHz or 8kHz, which is consistent with the findings reported in the = two papers that you mentioned.=C2=A0</div><div><br></div><div>Though the sy= stem with high sampling frequency is preferred, however, the sampling frequ= ency of our playback system is limited to 8 or 16kHz. We cannot change it a= t present. Therefore, I am now looking for an alternative approach to solve= /mitigate this problem.</div><div><br></div><div>Best regards,</div><div>Ju= nfeng=C2=A0</div><div><br></div><br><div class=3D"gmail_quote"><div dir=3D"= ltr" class=3D"gmail_attr">On Tue, Aug 16, 2022 at 12:17 PM Piotr Majdak &lt= ;<a href=3D"mailto:piotr@xxxxxxxx" target=3D"_blank">piotr@xxxxxxxx</a>= > wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px = 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> =20 =20 =20 <div bgcolor=3D"#FFFFFF"> <p>Dear all, <br> </p> <p>With respect to the sound localization in vertical planes: We also could look what happens when the spectral content above 8 kHz is removed: <br> </p> <div style=3D"line-height:1.35;margin-left:2em"> <div>Best et al. (2005, =E2=80=9CThe role of high frequencies in speech localization,=E2=80=9D JASA 118, 353=E2=80=93= 63): Results for the sound localization with low-pass filtered (8 kHz) speech (their Exp I) show a drastic increase of sound-localization errors in vertical planes (their Fig. 5), no effect in the lateral plane.=C2=A0 <br> </div> <br> </div> <div style=3D"line-height:1.35;margin-left:2em">Majdak et al. (2013, = =E2=80=9CEffect of long-term training on sound localization performance with spectrally warped and band-limited head-related transfer functions,=E2=80=9D JASA 134, 2148=E2=80=932159): Results for the sound localization with low-pass filtered (8 kHz) white noises show a large increase of localization errors (Fig. 6, red circles at "Pre") in verti= cal planes (front/back, top/down), no changes in the lateral dimensions (left/right).<br> <br> </div> <div style=3D"line-height:1.35;margin-left:2em">It's not encouragin= g for systems going up to 8 kHz only, though :-(. And of course, consideration of head movements may help...<br> </div> <div style=3D"line-height:1.35;margin-left:2em"><br> </div> <div style=3D"line-height:1.35;margin-left:2em">Best regards, <br> </div> <div style=3D"line-height:1.35;margin-left:2em">Piotr<br> </div> <div style=3D"line-height:1.35;margin-left:2em"><br> </div> <div>Am 15.08.2022 um 00:52 schrieb Adam Weisser:<br> </div> <blockquote type=3D"cite"> =20 =20 =20 <div style=3D"font-family:Arial">Dear Junfeng, Alain, and all,<br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial">I think that some solutions to the undersampling / aliasing problem that you described should exist, but they likely depend on where the sampling-rate bottleneck lies: at the input, in processing, at the output stage, or in all of them. Also, it depends on the computational capabilities of the system and whether it has to work in real time, and if so what the permissible delay is. <br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial">I'm aware of two general approaches to circumvent the Nyquist criterion: <br> </div> <div style=3D"font-family:Arial">1. Compressed sensing - This heavily researched signal-processing method uses signal sparsity to faithfully reconstruct undersampled signals [1].<br> </div> <div style=3D"font-family:Arial"><br> </div> <div><span style=3D"font-family:arial,sans-serif,sans-serif">2. Tradi= ng off aliasing and noise - This is a classical result that employs nonuniform sampling at lower rates than Nyquist, whereby the aliasing that would otherwise arise is replaced by noise [2]. It is thought that this is what happens in the retina, where the optical image is densely sampled in the fovea by the photoreceptors, but becomes gradually undersampled away from the fovea [3]. Had the photoreceptor density been uniform and regular over the retina, the resolution of the central vision would great suffer and the image would also be severely aliased. However, this trick works only if the sampling is truly stochastic. If the "localization noise" level (maybe manifest as audio= noise) can be sacrificed, then this approach may work, combined with dither.</span><span style=3D"font-family:arial,sans-serif,sans-se= rif"></span><br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial">Regardless of the specific system architecture at hand, none of these methods appears straightforward to implement.<br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial">Finally, regarding Alain's comme= nt about auditory sampling - the neat trick that is found in spatial processing of vision may be analogous to what goes on in temporal processing of stimuli at the transduction stage of the auditory nerve. Neural adaptation can be thought of as dense sampling of the signal around its onset / transient portion, which becomes more sparsely sampled quickly after the onset. Because of adaptation, this effect is very illusive, but I believe that it is measurable notwithstanding. I tried to demonstrate it psychoacoustically in Appendix E of [4]. While I don't know how it relates to binaural processing directly, ther= e may be instantaneous effects that may be detectable there too, given that the input to both processing types is the same. <br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial"> <div style=3D"font-family:Arial">All the best,<br> </div> <div style=3D"font-family:Arial">Adam.<br> </div> <div><br> </div> </div> <div style=3D"font-family:Arial">[1] Candes, E. J., Romberg, J. K., & Tao, T. (2006). Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(8), 1207-1223.<br> </div> <div> <div style=3D"font-family:Arial"><br> </div> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial"> <div style=3D"font-family:Arial">[2] Shapiro, Harold S and Silverman, Richard A. Alias-free sampling of random noise. Journal of the<br> </div> <div style=3D"font-family:Arial">Society for Industrial and Applied Mathematics, 8(2):225?248, 1960.<br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial">[3] Yellott, John I. Spectral consequences of photoreceptor sampling in the rhesus retina. Science, 221<br> </div> </div> <div style=3D"font-family:Arial">(4608):382?385, 1983.<br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial">[4] <a href=3D"https://arxiv.org/abs= /2111.04338" target=3D"_blank">Weisser, A. (2021). Treatise on Hearing: The Temporal Auditory Imaging Theory Inspired by Optics and Communication. </a><i><a href=3D"ht= tps://arxiv.org/abs/2111.04338" target=3D"_blank">arXiv preprint arXiv:2111= .04338</a></i><a href=3D"https://arxiv.org/abs/2111.04338" target=3D"_blank= ">.</a><br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial"><br> </div> <div>On Sun, Aug 14, 2022, at 4:47 AM, Alain de Cheveigne wrote:<br> </div> <blockquote type=3D"cite" id=3D"gmail-m_657013924909301866gmail-m_580= 3798299368709206qt"> <div style=3D"font-family:Arial">Hi Dick, all,=C2=A0<br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial">A couple of thoughts.=C2=A0 I'= m no expert of spatial hearing, so they may be off the mark.=C2=A0<br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial">> And I don't think the sample interval of 1/16000 sec provides a strong inherent limit on ITD accuracy.=C2=A0 The bandwidth of 8 kHz is about half of what's "normal", so theoretical TDOA resolution = should be expected to be no worse than double normal, say 20=E2=80=9340 microseconds (about half of a sample interval) instead of 10=E2=80=9320 microseconds.=C2=A0 I wouldn't be surprised if = the ITD resolution threshold was even closer to normal (around 1/4 sample interval), since our ITD-computing structure is dominated by lower-frequency input.<br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial">Indeed, the sample interval does not limit ITD estimation resolution.=C2=A0 You can get arbitrary resolution by interpolating the cross-correlation function near its peak (for example by fitting a parabola to three samples closest to the peak).=C2=A0 A similar argument applies to fundamental frequency estimation (--> pitch) from the autocorrelation function as in the YIN method.=C2=A0<br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial">This assumes that the CCF or ACF is smooth enough for the interpolation to be accurate, and for that the audio signals must be smooth, i.e. band-limited.=C2=A0 T= he purpose of a low-pass antialiasing filter associated with sampling or resampling is to *insure* that this is the case for typical signals, but that "insurance" is unnecessar= y if the signals contain no high-frequency power to start with.=C2=A0= =C2=A0<br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial">Thus, the choice of low-pass filter is a bit of a free parameter under the control of the engineer or experimenter. A wide filter (or none) is OK if the signals are known to contain little or no high-frequency power, a sharp filter is needed if the signals are strongly high-pass. Engineers typically err on the side of precaution by designing filters with strong attenuation beyond Nyquist, usually with the additional goal of keeping the pass-band flat. This requires a filter with a long impulse response. There's lee-way in the exact choice. EEs love the topic.<br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial">This brings me to my second point. Are there perceptual correlates of antialiasing filtering?=C2=A0 There are two reasons to suspect an effect on spatial hearing. First, a long IR might widen the CCF peak and blur the "crisp" peak in the short-term CCF associated = with a transient. Second, the frequency-domain features of the filter transfer function might interact with spectral notches characteristic of elevation or front-vs-back position of sources, particularly if those features are estimated by neural circuits also sensitive to time.<br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial">Again, this is pure speculation. Unfortunately, antialiasing filters are rarely specified in detail (in systems or studies), and I'm not aware of any stud= y aiming to characterize their perceptual effects or demonstrate that there are none.=C2=A0 Anecdotally, I remember being annoyed when listening to music on an early CD player, by what I attributed to high-frequency ringing of antialiasing or reconstruction filters with poles just below Nyquist. That was when I could still hear in that region...<br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial">Alain<br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial"><br> </div> <div style=3D"font-family:Arial">> On 14 Aug 2022, at 05:03, Richard F. Lyon <<a href=3D"mailto:DickLyon@xxxxxxxx" target=3D= "_blank">DickLyon@xxxxxxxx</a>> wrote:<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> Yes, good idea to find some solutions to the difficult.<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> Reviewing my book's Figur= e 22.7, there's a pretty good spectral notch cue to elevation i= n the 5.5-8 kHz region (and higher); 8 kHz might be enough for elevation up to about 45 degrees (find free book PDF via <a href=3D"http://machinehearing.org" target=3D"_blank">machinehe= aring.org</a> -- search that blog for "free".)<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> For resolving front/back confusion, that's hard unless you add the effects of lateralization change with head turning.=C2=A0 Using a head track= er or gyro to change the lateral angle to the sound, relative to the head, is very effective for letting the user disambiguate, if they have time to move a little.=C2=A0 So it depends on what you're trying to do.<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> If it was impossible to localize sounds with a 16 kHz sample rate, it would be equally impossible to localize sounds with no energy about 8 kHz.=C2=A0 I don't think that's the case.=C2=A0 I can't hear anyth= ing about 8 kHz (unless it's quite intense), and I don't sense that I= have any difficulty localizing sounds around me.=C2=A0 Probably if we measured though we'd find I'm not as accurate as a person= with better hearing.<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> And I don't think the sample interval of 1/16000 sec provides a strong inherent limit on ITD accuracy.=C2=A0 The bandwidth of 8 kHz is about half of what's "normal", so theoretical TDOA resolution = should be expected to be no worse than double normal, say 20=E2=80=9340 microseconds (about half of a sample interval) instead of 10=E2=80=9320 microseconds.=C2=A0 I wouldn't be surprised if = the ITD resolution threshold was even closer to normal (around 1/4 sample interval), since our ITD-computing structure is dominated by lower-frequency input.<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> Dick<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> On Fri, Aug 12, 2022 at 9:20 PM Junfeng Li <<a href=3D"mailto:junfeng.li.1979@xxxxxxxx= m" target=3D"_blank">junfeng.li.1979@xxxxxxxx</a>> wrote:<br> </div> <div style=3D"font-family:Arial">> Dear Frederick,<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> Thank you so much for the references that you mentioned.=C2=A0<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> "[...] up=E2=80=93down c= ues are located mainly in the 6=E2=80=9312-kHz band, and front=E2=80=93ba= ck cues in the 8=E2=80=9316-kHz band."=C2=A0<br> </div> <div style=3D"font-family:Arial">> According to this statement, it seems impossible to solve the problems of elevation perception and front-back confusion when the output signal is sampled at 16kHz.=C2=A0<br> </div> <div style=3D"font-family:Arial">> Though I know it is difficult, I always try to find some solutions.<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> Thanks again.<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> Best regards,<br> </div> <div style=3D"font-family:Arial">> Junfeng=C2=A0<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> On Sat, Aug 13, 2022 at 12:50 AM Frederick Gallun <<a href=3D"mailto:fgallun@xxxxxxxx= " target=3D"_blank">fgallun@xxxxxxxx</a>> wrote:<br> </div> <div style=3D"font-family:Arial">> The literature on the HRTF over the past 60 years has made it very clear that "[...] up=E2=80=93down cues are located mainly in the 6=E2=80=9312-kHz b= and, and front=E2=80=93back cues in the 8=E2=80=9316-kHz band." (Lang= endiijk and Bronkhorst, 2002)=C2=A0=C2=A0<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> Here are a few places to start:<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> Langendijk, E. H. A., & Bronkhorst, A. W. (2002). Contribution of spectral cues to human sound localization. The Journal of the Acoustical Society of America, 112(4), 1583=E2=80=931596.=C2=A0<a href=3D"ht= tps://doi.org/10.1121/1.1501901" target=3D"_blank">https://doi.org/10.1121/= 1.1501901</a><br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> Mehrgardt, S., & Mellert, V. (1977). Transformation characteristics of the external human ear. The Journal of the Acoustical Society of America, 61(6), 1567=E2=80=931576.=C2=A0<a href=3D"https://doi.or= g/10.1121/1.381470" target=3D"_blank">https://doi.org/10.1121/1.381470</a><= br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> Shaw, E. a. G., & Teranishi, R. (1968). Sound Pressure Generated in an External=E2=80=90Ear Replica and Real Human Ears by a Nearby Poin= t Source. The Journal of the Acoustical Society of America, 44(1), 240=E2=80=93249.=C2=A0<a href=3D"https://doi.org/10.1121/1= .1911059" target=3D"_blank">https://doi.org/10.1121/1.1911059</a><br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> ---------------------------------------------<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> Frederick (Erick) Gallun, PhD, FASA, FASHA | he/him/his<br> </div> <div style=3D"font-family:Arial">> Professor, Oregon Hearing Research Center, Oregon Health & Science University<br> </div> <div style=3D"font-family:Arial">> "Diversity is like being invited to a party, Inclusion is being asked to dance, and Belonging is dancing like no one=E2=80=99s watching" - Grego= ry Lewis<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> On Thu, Aug 11, 2022 at 11:59 PM Junfeng Li <<a href=3D"mailto:junfeng.li.1979@xxxxxxxx= om" target=3D"_blank">junfeng.li.1979@xxxxxxxx</a>> wrote:<br> </div> <div style=3D"font-family:Arial">> Dear=C2=A0 Leslie,<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> When downsampling to 8/16kHz, we really found the localization accuracy decreases, even for horizon<br> </div> <div style=3D"font-family:Arial">> Do you have any good ideas to solve it?<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> Thanks a lot.<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> Best regards,<br> </div> <div style=3D"font-family:Arial">> Junfeng=C2=A0<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> On Thu, Aug 11, 2022 at 4:04 PM Prof Leslie Smith <<a href=3D"mailto:l.s.smith@xxxxxxxx= .ac.uk" target=3D"_blank">l.s.smith@xxxxxxxx</a>> wrote:<br> </div> <div style=3D"font-family:Arial">> I'd also wonder about the time resolution: 16KHz =3D 1/16000 sec between<br> </div> <div style=3D"font-family:Arial">> samples =3D 62 microseconds<b= r> </div> <div style=3D"font-family:Arial">> .<br> </div> <div style=3D"font-family:Arial">> That's relatively long fo= r ITD (TDOA) estimation, which would suggest that<br> </div> <div style=3D"font-family:Arial">> localisation of lower frequency signals would be impeded.<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> (I don't have evidence fo= r this: it's just a suggestion).<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> --Leslie Smith<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> Junfeng Li wrote:<br> </div> <div style=3D"font-family:Arial">> > Dear all,<br> </div> <div style=3D"font-family:Arial">> ><br> </div> <div style=3D"font-family:Arial">> > We are working on 3D audio rendering for signals with low sampling<br> </div> <div style=3D"font-family:Arial">> > frequency.<br> </div> <div style=3D"font-family:Arial">> > As you may know, the HRTFs=C2=A0 are normally measured at the high sampling<br> </div> <div style=3D"font-family:Arial">> > frequency, e.g., 48kHz or 44.1kHz. However, the sampling frequency of<br> </div> <div style=3D"font-family:Arial">> > sound<br> </div> <div style=3D"font-family:Arial">> > signals in our application is restricted to 16 kHz. Therefore, to render<br> </div> <div style=3D"font-family:Arial">> > this low-frequency (=E2=89=A48kHz) signal, one straight way is to first<br> </div> <div style=3D"font-family:Arial">> > downsample<br> </div> <div style=3D"font-family:Arial">> > the HRTFs from 48kHz/44.1kHz to 16kHz and then convolve with sound<br> </div> <div style=3D"font-family:Arial">> > signals.<br> </div> <div style=3D"font-family:Arial">> > However, the sound localization performance of the signal rendered with<br> </div> <div style=3D"font-family:Arial">> > this approach is greatly decreased, especially elevation perception. To<br> </div> <div style=3D"font-family:Arial">> > improve the sound localization performance, I am now wondering whether<br> </div> <div style=3D"font-family:Arial">> > there is a certain good method to solve or mitigate this problem in this<br> </div> <div style=3D"font-family:Arial">> > scenario.<br> </div> <div style=3D"font-family:Arial">> ><br> </div> <div style=3D"font-family:Arial">> > Any discussion is welcome.<br> </div> <div style=3D"font-family:Arial">> ><br> </div> <div style=3D"font-family:Arial">> > Thanks a lot again.<br> </div> <div style=3D"font-family:Arial">> ><br> </div> <div style=3D"font-family:Arial">> > Best regards,<br> </div> <div style=3D"font-family:Arial">> > Junfeng<br> </div> <div style=3D"font-family:Arial">> ><br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial">> --=C2=A0<br> </div> <div style=3D"font-family:Arial">> Prof Leslie Smith (Emeritus)<br> </div> <div style=3D"font-family:Arial">> Computing Science & Mathematics,<br> </div> <div style=3D"font-family:Arial">> University of Stirling, Stirling FK9 4LA<br> </div> <div style=3D"font-family:Arial">> Scotland, UK<br> </div> <div style=3D"font-family:Arial">> Tel +44 1786 467435<br> </div> <div style=3D"font-family:Arial">> Web:=C2=A0<a href=3D"http://w= ww.cs.stir.ac.uk/~lss" target=3D"_blank">http://www.cs.stir.ac.uk/~lss</a><= br> </div> <div style=3D"font-family:Arial">> Blog:=C2=A0<a href=3D"http://= lestheprof.com" target=3D"_blank">http://lestheprof.com</a><br> </div> <div style=3D"font-family:Arial">>=C2=A0<br> </div> <div style=3D"font-family:Arial"><br> </div> </blockquote> <div style=3D"font-family:Arial"><br> </div> </blockquote> <div>-- <br> Piotr Majdak<br> <a href=3D"https://www.oeaw.ac.at/isf/forschung/fachbereiche-teams/ho= eren" target=3D"_blank">Fachbereich H=C3=B6ren</a><br> <a href=3D"https://www.oeaw.ac.at/isf" target=3D"_blank">Institut f= =C3=BCr Schallforschung</a><br> <a href=3D"http://www.oeaw.ac.at/" target=3D"_blank">=C3=96sterreichi= sche Akademie der Wissenschaften</a><br> Wohllebengasse 12-14, 1040 Wien<br> Tel.: +43 1 51581-2511<br> </div> </div> </blockquote></div></div> </blockquote></div></div> --00000000000038cd6605e660b863--

This message came from the mail archive
src/postings/2022/
maintained by:

DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University