Re: [AUDITORY] On 3D audio rendering for signals with the low sampling frequency

Dear Piotr,

Our experimental results also showed the increase of sound-localization errors in vertical planes when the sampling frequency is 16kHz or 8kHz, which is consistent with the findings reported in the two papers that you mentioned.

Though the system with high sampling frequency is preferred, however, the sampling frequency of our playback system is limited to 8 or 16kHz. We cannot change it at present. Therefore, I am now looking for an alternative approach to solve/mitigate this problem.

Best regards,

Junfeng

On Tue, Aug 16, 2022 at 12:17 PM Piotr Majdak <piotr@xxxxxxxxxx> wrote:

Dear all,

With respect to the sound localization in vertical planes: We also could look what happens when the spectral content above 8 kHz is removed:

Best et al. (2005, “The role of high frequencies in speech localization,” JASA 118, 353–63): Results for the sound localization with low-pass filtered (8 kHz) speech (their Exp I) show a drastic increase of sound-localization errors in vertical planes (their Fig. 5), no effect in the lateral plane.

Majdak et al. (2013, “Effect of long-term training on sound localization performance with spectrally warped and band-limited head-related transfer functions,” JASA 134, 2148–2159): Results for the sound localization with low-pass filtered (8 kHz) white noises show a large increase of localization errors (Fig. 6, red circles at "Pre") in vertical planes (front/back, top/down), no changes in the lateral dimensions (left/right).

It's not encouraging for systems going up to 8 kHz only, though :-(. And of course, consideration of head movements may help...

Best regards,

Piotr

Am 15.08.2022 um 00:52 schrieb Adam Weisser:

Dear Junfeng, Alain, and all,

I think that some solutions to the undersampling / aliasing problem that you described should exist, but they likely depend on where the sampling-rate bottleneck lies: at the input, in processing, at the output stage, or in all of them. Also, it depends on the computational capabilities of the system and whether it has to work in real time, and if so what the permissible delay is.

I'm aware of two general approaches to circumvent the Nyquist criterion:

1. Compressed sensing - This heavily researched signal-processing method uses signal sparsity to faithfully reconstruct undersampled signals [1].

2. Trading off aliasing and noise - This is a classical result that employs nonuniform sampling at lower rates than Nyquist, whereby the aliasing that would otherwise arise is replaced by noise [2]. It is thought that this is what happens in the retina, where the optical image is densely sampled in the fovea by the photoreceptors, but becomes gradually undersampled away from the fovea [3]. Had the photoreceptor density been uniform and regular over the retina, the resolution of the central vision would great suffer and the image would also be severely aliased. However, this trick works only if the sampling is truly stochastic. If the "localization noise" level (maybe manifest as audio noise) can be sacrificed, then this approach may work, combined with dither.

Regardless of the specific system architecture at hand, none of these methods appears straightforward to implement.

Finally, regarding Alain's comment about auditory sampling - the neat trick that is found in spatial processing of vision may be analogous to what goes on in temporal processing of stimuli at the transduction stage of the auditory nerve. Neural adaptation can be thought of as dense sampling of the signal around its onset / transient portion, which becomes more sparsely sampled quickly after the onset. Because of adaptation, this effect is very illusive, but I believe that it is measurable notwithstanding. I tried to demonstrate it psychoacoustically in Appendix E of [4]. While I don't know how it relates to binaural processing directly, there may be instantaneous effects that may be detectable there too, given that the input to both processing types is the same.

All the best,

Adam.

[1] Candes, E. J., Romberg, J. K., & Tao, T. (2006). Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(8), 1207-1223.

[2] Shapiro, Harold S and Silverman, Richard A. Alias-free sampling of random noise. Journal of the

Society for Industrial and Applied Mathematics, 8(2):225?248, 1960.

[3] Yellott, John I. Spectral consequences of photoreceptor sampling in the rhesus retina. Science, 221

(4608):382?385, 1983.

[4] Weisser, A. (2021). Treatise on Hearing: The Temporal Auditory Imaging Theory Inspired by Optics and Communication. arXiv preprint arXiv:2111.04338.

On Sun, Aug 14, 2022, at 4:47 AM, Alain de Cheveigne wrote:

Hi Dick, all,

A couple of thoughts. I'm no expert of spatial hearing, so they may be off the mark.

> And I don't think the sample interval of 1/16000 sec provides a strong inherent limit on ITD accuracy. The bandwidth of 8 kHz is about half of what's "normal", so theoretical TDOA resolution should be expected to be no worse than double normal, say 20–40 microseconds (about half of a sample interval) instead of 10–20 microseconds. I wouldn't be surprised if the ITD resolution threshold was even closer to normal (around 1/4 sample interval), since our ITD-computing structure is dominated by lower-frequency input.

Indeed, the sample interval does not limit ITD estimation resolution. You can get arbitrary resolution by interpolating the cross-correlation function near its peak (for example by fitting a parabola to three samples closest to the peak). A similar argument applies to fundamental frequency estimation (--> pitch) from the autocorrelation function as in the YIN method.

This assumes that the CCF or ACF is smooth enough for the interpolation to be accurate, and for that the audio signals must be smooth, i.e. band-limited. The purpose of a low-pass antialiasing filter associated with sampling or resampling is to *insure* that this is the case for typical signals, but that "insurance" is unnecessary if the signals contain no high-frequency power to start with.

Thus, the choice of low-pass filter is a bit of a free parameter under the control of the engineer or experimenter. A wide filter (or none) is OK if the signals are known to contain little or no high-frequency power, a sharp filter is needed if the signals are strongly high-pass. Engineers typically err on the side of precaution by designing filters with strong attenuation beyond Nyquist, usually with the additional goal of keeping the pass-band flat. This requires a filter with a long impulse response. There's lee-way in the exact choice. EEs love the topic.

This brings me to my second point. Are there perceptual correlates of antialiasing filtering? There are two reasons to suspect an effect on spatial hearing. First, a long IR might widen the CCF peak and blur the "crisp" peak in the short-term CCF associated with a transient. Second, the frequency-domain features of the filter transfer function might interact with spectral notches characteristic of elevation or front-vs-back position of sources, particularly if those features are estimated by neural circuits also sensitive to time.

Again, this is pure speculation. Unfortunately, antialiasing filters are rarely specified in detail (in systems or studies), and I'm not aware of any study aiming to characterize their perceptual effects or demonstrate that there are none. Anecdotally, I remember being annoyed when listening to music on an early CD player, by what I attributed to high-frequency ringing of antialiasing or reconstruction filters with poles just below Nyquist. That was when I could still hear in that region...

Alain

> On 14 Aug 2022, at 05:03, Richard F. Lyon <DickLyon@xxxxxxx> wrote:

>

> Yes, good idea to find some solutions to the difficult.

>

> Reviewing my book's Figure 22.7, there's a pretty good spectral notch cue to elevation in the 5.5-8 kHz region (and higher); 8 kHz might be enough for elevation up to about 45 degrees (find free book PDF via machinehearing.org -- search that blog for "free".)

>

> For resolving front/back confusion, that's hard unless you add the effects of lateralization change with head turning. Using a head tracker or gyro to change the lateral angle to the sound, relative to the head, is very effective for letting the user disambiguate, if they have time to move a little. So it depends on what you're trying to do.

>

> If it was impossible to localize sounds with a 16 kHz sample rate, it would be equally impossible to localize sounds with no energy about 8 kHz. I don't think that's the case. I can't hear anything about 8 kHz (unless it's quite intense), and I don't sense that I have any difficulty localizing sounds around me. Probably if we measured though we'd find I'm not as accurate as a person with better hearing.

>

> And I don't think the sample interval of 1/16000 sec provides a strong inherent limit on ITD accuracy. The bandwidth of 8 kHz is about half of what's "normal", so theoretical TDOA resolution should be expected to be no worse than double normal, say 20–40 microseconds (about half of a sample interval) instead of 10–20 microseconds. I wouldn't be surprised if the ITD resolution threshold was even closer to normal (around 1/4 sample interval), since our ITD-computing structure is dominated by lower-frequency input.

>

> Dick

>

>

>

>

> On Fri, Aug 12, 2022 at 9:20 PM Junfeng Li <junfeng.li.1979@xxxxxxxxx> wrote:

> Dear Frederick,

>

> Thank you so much for the references that you mentioned.

>

> "[...] up–down cues are located mainly in the 6–12-kHz band, and front–back cues in the 8–16-kHz band."

> According to this statement, it seems impossible to solve the problems of elevation perception and front-back confusion when the output signal is sampled at 16kHz.

> Though I know it is difficult, I always try to find some solutions.

>

> Thanks again.

>

> Best regards,

> Junfeng

>

> On Sat, Aug 13, 2022 at 12:50 AM Frederick Gallun <fgallun@xxxxxxxxx> wrote:

> The literature on the HRTF over the past 60 years has made it very clear that "[...] up–down cues are located mainly in the 6–12-kHz band, and front–back cues in the 8–16-kHz band." (Langendiijk and Bronkhorst, 2002)

>

> Here are a few places to start:

>

> Langendijk, E. H. A., & Bronkhorst, A. W. (2002). Contribution of spectral cues to human sound localization. The Journal of the Acoustical Society of America, 112(4), 1583–1596. https://doi.org/10.1121/1.1501901

>

> Mehrgardt, S., & Mellert, V. (1977). Transformation characteristics of the external human ear. The Journal of the Acoustical Society of America, 61(6), 1567–1576. https://doi.org/10.1121/1.381470

>

> Shaw, E. a. G., & Teranishi, R. (1968). Sound Pressure Generated in an External‐Ear Replica and Real Human Ears by a Nearby Point Source. The Journal of the Acoustical Society of America, 44(1), 240–249. https://doi.org/10.1121/1.1911059

>

> ---------------------------------------------

>

> Frederick (Erick) Gallun, PhD, FASA, FASHA | he/him/his

> Professor, Oregon Hearing Research Center, Oregon Health & Science University

> "Diversity is like being invited to a party, Inclusion is being asked to dance, and Belonging is dancing like no one’s watching" - Gregory Lewis

>

>

> On Thu, Aug 11, 2022 at 11:59 PM Junfeng Li <junfeng.li.1979@xxxxxxxxx> wrote:

> Dear Leslie,

>

> When downsampling to 8/16kHz, we really found the localization accuracy decreases, even for horizon

> Do you have any good ideas to solve it?

>

> Thanks a lot.

>

> Best regards,

> Junfeng

>

>

> On Thu, Aug 11, 2022 at 4:04 PM Prof Leslie Smith <l.s.smith@xxxxxxxxxxxxx> wrote:

> I'd also wonder about the time resolution: 16KHz = 1/16000 sec between

> samples = 62 microseconds

> .

> That's relatively long for ITD (TDOA) estimation, which would suggest that

> localisation of lower frequency signals would be impeded.

>

> (I don't have evidence for this: it's just a suggestion).

>

> --Leslie Smith

>

> Junfeng Li wrote:

> > Dear all,

> >

> > We are working on 3D audio rendering for signals with low sampling

> > frequency.

> > As you may know, the HRTFs are normally measured at the high sampling

> > frequency, e.g., 48kHz or 44.1kHz. However, the sampling frequency of

> > sound

> > signals in our application is restricted to 16 kHz. Therefore, to render

> > this low-frequency (≤8kHz) signal, one straight way is to first

> > downsample

> > the HRTFs from 48kHz/44.1kHz to 16kHz and then convolve with sound

> > signals.

> > However, the sound localization performance of the signal rendered with

> > this approach is greatly decreased, especially elevation perception. To

> > improve the sound localization performance, I am now wondering whether

> > there is a certain good method to solve or mitigate this problem in this

> > scenario.

> >

> > Any discussion is welcome.

> >

> > Thanks a lot again.

> >

> > Best regards,

> > Junfeng

> >

>

>

> --

> Prof Leslie Smith (Emeritus)

> Computing Science & Mathematics,

> University of Stirling, Stirling FK9 4LA

> Scotland, UK

> Tel +44 1786 467435

> Web: http://www.cs.stir.ac.uk/~lss

> Blog: http://lestheprof.com

>

--
Piotr Majdak
Fachbereich Hören
Institut für Schallforschung
Österreichische Akademie der Wissenschaften
Wohllebengasse 12-14, 1040 Wien
Tel.: +43 1 51581-2511