Re: [AUDITORY] On 3D audio rendering for signals with the low sampling frequency (Alain de Cheveigne )


Subject: Re: [AUDITORY] On 3D audio rendering for signals with the low sampling frequency
From:    Alain de Cheveigne  <alain.de.cheveigne@xxxxxxxx>
Date:    Sun, 14 Aug 2022 09:47:33 +0200

Hi Dick, all,=20 A couple of thoughts. I'm no expert of spatial hearing, so they may be = off the mark.=20 > And I don't think the sample interval of 1/16000 sec provides a strong = inherent limit on ITD accuracy. The bandwidth of 8 kHz is about half of = what's "normal", so theoretical TDOA resolution should be expected to be = no worse than double normal, say 20=E2=80=9340 microseconds (about half = of a sample interval) instead of 10=E2=80=9320 microseconds. I wouldn't = be surprised if the ITD resolution threshold was even closer to normal = (around 1/4 sample interval), since our ITD-computing structure is = dominated by lower-frequency input. Indeed, the sample interval does not limit ITD estimation resolution. = You can get arbitrary resolution by interpolating the cross-correlation = function near its peak (for example by fitting a parabola to three = samples closest to the peak). A similar argument applies to fundamental = frequency estimation (--> pitch) from the autocorrelation function as in = the YIN method.=20 This assumes that the CCF or ACF is smooth enough for the interpolation = to be accurate, and for that the audio signals must be smooth, i.e. = band-limited. The purpose of a low-pass antialiasing filter associated = with sampling or resampling is to *insure* that this is the case for = typical signals, but that "insurance" is unnecessary if the signals = contain no high-frequency power to start with. =20 Thus, the choice of low-pass filter is a bit of a free parameter under = the control of the engineer or experimenter. A wide filter (or none) is = OK if the signals are known to contain little or no high-frequency = power, a sharp filter is needed if the signals are strongly high-pass. = Engineers typically err on the side of precaution by designing filters = with strong attenuation beyond Nyquist, usually with the additional goal = of keeping the pass-band flat. This requires a filter with a long = impulse response. There's lee-way in the exact choice. EEs love the = topic. This brings me to my second point. Are there perceptual correlates of = antialiasing filtering? There are two reasons to suspect an effect on = spatial hearing. First, a long IR might widen the CCF peak and blur the = "crisp" peak in the short-term CCF associated with a transient. Second, = the frequency-domain features of the filter transfer function might = interact with spectral notches characteristic of elevation or = front-vs-back position of sources, particularly if those features are = estimated by neural circuits also sensitive to time. Again, this is pure speculation. Unfortunately, antialiasing filters are = rarely specified in detail (in systems or studies), and I'm not aware of = any study aiming to characterize their perceptual effects or demonstrate = that there are none. Anecdotally, I remember being annoyed when = listening to music on an early CD player, by what I attributed to = high-frequency ringing of antialiasing or reconstruction filters with = poles just below Nyquist. That was when I could still hear in that = region... Alain > On 14 Aug 2022, at 05:03, Richard F. Lyon <DickLyon@xxxxxxxx> wrote: >=20 > Yes, good idea to find some solutions to the difficult. >=20 > Reviewing my book's Figure 22.7, there's a pretty good spectral notch = cue to elevation in the 5.5-8 kHz region (and higher); 8 kHz might be = enough for elevation up to about 45 degrees (find free book PDF via = machinehearing.org -- search that blog for "free".) >=20 > For resolving front/back confusion, that's hard unless you add the = effects of lateralization change with head turning. Using a head = tracker or gyro to change the lateral angle to the sound, relative to = the head, is very effective for letting the user disambiguate, if they = have time to move a little. So it depends on what you're trying to do. >=20 > If it was impossible to localize sounds with a 16 kHz sample rate, it = would be equally impossible to localize sounds with no energy about 8 = kHz. I don't think that's the case. I can't hear anything about 8 kHz = (unless it's quite intense), and I don't sense that I have any = difficulty localizing sounds around me. Probably if we measured though = we'd find I'm not as accurate as a person with better hearing. >=20 > And I don't think the sample interval of 1/16000 sec provides a strong = inherent limit on ITD accuracy. The bandwidth of 8 kHz is about half of = what's "normal", so theoretical TDOA resolution should be expected to be = no worse than double normal, say 20=E2=80=9340 microseconds (about half = of a sample interval) instead of 10=E2=80=9320 microseconds. I wouldn't = be surprised if the ITD resolution threshold was even closer to normal = (around 1/4 sample interval), since our ITD-computing structure is = dominated by lower-frequency input. >=20 > Dick >=20 >=20 >=20 >=20 > On Fri, Aug 12, 2022 at 9:20 PM Junfeng Li <junfeng.li.1979@xxxxxxxx> = wrote: > Dear Frederick, >=20 > Thank you so much for the references that you mentioned.=20 >=20 > "[...] up=E2=80=93down cues are located mainly in the 6=E2=80=9312-kHz = band, and front=E2=80=93back cues in the 8=E2=80=9316-kHz band."=20 > According to this statement, it seems impossible to solve the problems = of elevation perception and front-back confusion when the output signal = is sampled at 16kHz.=20 > Though I know it is difficult, I always try to find some solutions. >=20 > Thanks again. >=20 > Best regards, > Junfeng=20 >=20 > On Sat, Aug 13, 2022 at 12:50 AM Frederick Gallun <fgallun@xxxxxxxx> = wrote: > The literature on the HRTF over the past 60 years has made it very = clear that "[...] up=E2=80=93down cues are located mainly in the = 6=E2=80=9312-kHz band, and front=E2=80=93back cues in the 8=E2=80=9316-kHz= band." (Langendiijk and Bronkhorst, 2002) =20 >=20 > Here are a few places to start: >=20 > Langendijk, E. H. A., & Bronkhorst, A. W. (2002). Contribution of = spectral cues to human sound localization. The Journal of the Acoustical = Society of America, 112(4), 1583=E2=80=931596. = https://doi.org/10.1121/1.1501901 >=20 > Mehrgardt, S., & Mellert, V. (1977). Transformation characteristics of = the external human ear. The Journal of the Acoustical Society of = America, 61(6), 1567=E2=80=931576. https://doi.org/10.1121/1.381470 >=20 > Shaw, E. a. G., & Teranishi, R. (1968). Sound Pressure Generated in an = External=E2=80=90Ear Replica and Real Human Ears by a Nearby Point = Source. The Journal of the Acoustical Society of America, 44(1), = 240=E2=80=93249. https://doi.org/10.1121/1.1911059 >=20 > --------------------------------------------- >=20 > Frederick (Erick) Gallun, PhD, FASA, FASHA | he/him/his > Professor, Oregon Hearing Research Center, Oregon Health & Science = University > "Diversity is like being invited to a party, Inclusion is being asked = to dance, and Belonging is dancing like no one=E2=80=99s watching" - = Gregory Lewis >=20 >=20 > On Thu, Aug 11, 2022 at 11:59 PM Junfeng Li = <junfeng.li.1979@xxxxxxxx> wrote: > Dear Leslie, >=20 > When downsampling to 8/16kHz, we really found the localization = accuracy decreases, even for horizon > Do you have any good ideas to solve it? >=20 > Thanks a lot. >=20 > Best regards, > Junfeng=20 >=20 >=20 > On Thu, Aug 11, 2022 at 4:04 PM Prof Leslie Smith = <l.s.smith@xxxxxxxx> wrote: > I'd also wonder about the time resolution: 16KHz =3D 1/16000 sec = between > samples =3D 62 microseconds > . > That's relatively long for ITD (TDOA) estimation, which would suggest = that > localisation of lower frequency signals would be impeded. >=20 > (I don't have evidence for this: it's just a suggestion). >=20 > --Leslie Smith >=20 > Junfeng Li wrote: > > Dear all, > > > > We are working on 3D audio rendering for signals with low sampling > > frequency. > > As you may know, the HRTFs are normally measured at the high = sampling > > frequency, e.g., 48kHz or 44.1kHz. However, the sampling frequency = of > > sound > > signals in our application is restricted to 16 kHz. Therefore, to = render > > this low-frequency (=E2=89=A48kHz) signal, one straight way is to = first > > downsample > > the HRTFs from 48kHz/44.1kHz to 16kHz and then convolve with sound > > signals. > > However, the sound localization performance of the signal rendered = with > > this approach is greatly decreased, especially elevation perception. = To > > improve the sound localization performance, I am now wondering = whether > > there is a certain good method to solve or mitigate this problem in = this > > scenario. > > > > Any discussion is welcome. > > > > Thanks a lot again. > > > > Best regards, > > Junfeng > > >=20 >=20 > --=20 > Prof Leslie Smith (Emeritus) > Computing Science & Mathematics, > University of Stirling, Stirling FK9 4LA > Scotland, UK > Tel +44 1786 467435 > Web: http://www.cs.stir.ac.uk/~lss > Blog: http://lestheprof.com >=20


This message came from the mail archive
src/postings/2022/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University