I think that some solutions to the
undersampling / aliasing problem that you described should
exist, but they likely depend on where the sampling-rate
bottleneck lies: at the input, in processing, at the output
stage, or in all of them. Also, it depends on the computational
capabilities of the system and whether it has to work in real
time, and if so what the permissible delay is.
I'm aware of two general
approaches to circumvent the Nyquist criterion:
1. Compressed sensing - This
heavily researched signal-processing method uses signal sparsity
to faithfully reconstruct undersampled signals [1].
Regardless of the specific system
architecture at hand, none of these methods appears
straightforward to implement.
Finally, regarding Alain's comment
about auditory sampling - the neat trick that is found in
spatial processing of vision may be analogous to what goes on in
temporal processing of stimuli at the transduction stage of the
auditory nerve. Neural adaptation can be thought of as dense
sampling of the signal around its onset / transient portion,
which becomes more sparsely sampled quickly after the onset.
Because of adaptation, this effect is very illusive, but I
believe that it is measurable notwithstanding. I tried to
demonstrate it psychoacoustically in Appendix E of [4]. While I
don't know how it relates to binaural processing directly, there
may be instantaneous effects that may be detectable there too,
given that the input to both processing types is the same.
[1] Candes, E. J., Romberg, J. K.,
& Tao, T. (2006). Stable signal recovery from incomplete and
inaccurate measurements. Communications on Pure and Applied
Mathematics: A Journal Issued by the Courant Institute of
Mathematical Sciences, 59(8), 1207-1223.
(4608):382?385, 1983.
Hi Dick, all,
A couple of thoughts. I'm no
expert of spatial hearing, so they may be off the mark.
> And I don't think the
sample interval of 1/16000 sec provides a strong inherent
limit on ITD accuracy. The bandwidth of 8 kHz is about half
of what's "normal", so theoretical TDOA resolution should be
expected to be no worse than double normal, say 20–40
microseconds (about half of a sample interval) instead of
10–20 microseconds. I wouldn't be surprised if the ITD
resolution threshold was even closer to normal (around 1/4
sample interval), since our ITD-computing structure is
dominated by lower-frequency input.
Indeed, the sample interval does
not limit ITD estimation resolution. You can get arbitrary
resolution by interpolating the cross-correlation function
near its peak (for example by fitting a parabola to three
samples closest to the peak). A similar argument applies to
fundamental frequency estimation (--> pitch) from the
autocorrelation function as in the YIN method.
This assumes that the CCF or ACF
is smooth enough for the interpolation to be accurate, and for
that the audio signals must be smooth, i.e. band-limited. The
purpose of a low-pass antialiasing filter associated with
sampling or resampling is to *insure* that this is the case
for typical signals, but that "insurance" is unnecessary if
the signals contain no high-frequency power to start with.
Thus, the choice of low-pass
filter is a bit of a free parameter under the control of the
engineer or experimenter. A wide filter (or none) is OK if the
signals are known to contain little or no high-frequency
power, a sharp filter is needed if the signals are strongly
high-pass. Engineers typically err on the side of precaution
by designing filters with strong attenuation beyond Nyquist,
usually with the additional goal of keeping the pass-band
flat. This requires a filter with a long impulse response.
There's lee-way in the exact choice. EEs love the topic.
This brings me to my second
point. Are there perceptual correlates of antialiasing
filtering? There are two reasons to suspect an effect on
spatial hearing. First, a long IR might widen the CCF peak and
blur the "crisp" peak in the short-term CCF associated with a
transient. Second, the frequency-domain features of the filter
transfer function might interact with spectral notches
characteristic of elevation or front-vs-back position of
sources, particularly if those features are estimated by
neural circuits also sensitive to time.
Again, this is pure speculation.
Unfortunately, antialiasing filters are rarely specified in
detail (in systems or studies), and I'm not aware of any study
aiming to characterize their perceptual effects or demonstrate
that there are none. Anecdotally, I remember being annoyed
when listening to music on an early CD player, by what I
attributed to high-frequency ringing of antialiasing or
reconstruction filters with poles just below Nyquist. That was
when I could still hear in that region...
Alain
>
> Yes, good idea to find some
solutions to the difficult.
>
> Reviewing my book's Figure
22.7, there's a pretty good spectral notch cue to elevation in
the 5.5-8 kHz region (and higher); 8 kHz might be enough for
elevation up to about 45 degrees (find free book PDF via
machinehearing.org -- search that blog for "free".)
>
> For resolving front/back
confusion, that's hard unless you add the effects of
lateralization change with head turning. Using a head tracker
or gyro to change the lateral angle to the sound, relative to
the head, is very effective for letting the user disambiguate,
if they have time to move a little. So it depends on what
you're trying to do.
>
> If it was impossible to
localize sounds with a 16 kHz sample rate, it would be equally
impossible to localize sounds with no energy about 8 kHz. I
don't think that's the case. I can't hear anything about 8
kHz (unless it's quite intense), and I don't sense that I have
any difficulty localizing sounds around me. Probably if we
measured though we'd find I'm not as accurate as a person with
better hearing.
>
> And I don't think the
sample interval of 1/16000 sec provides a strong inherent
limit on ITD accuracy. The bandwidth of 8 kHz is about half
of what's "normal", so theoretical TDOA resolution should be
expected to be no worse than double normal, say 20–40
microseconds (about half of a sample interval) instead of
10–20 microseconds. I wouldn't be surprised if the ITD
resolution threshold was even closer to normal (around 1/4
sample interval), since our ITD-computing structure is
dominated by lower-frequency input.
>
> Dick
>
>
>
>
> Dear Frederick,
>
> Thank you so much for the
references that you mentioned.
>
> "[...] up–down cues are
located mainly in the 6–12-kHz band, and front–back cues in
the 8–16-kHz band."
> According to this
statement, it seems impossible to solve the problems of
elevation perception and front-back confusion when the output
signal is sampled at 16kHz.
> Though I know it is
difficult, I always try to find some solutions.
>
> Thanks again.
>
> Best regards,
> Junfeng
>
> The literature on the HRTF
over the past 60 years has made it very clear that "[...]
up–down cues are located mainly in the 6–12-kHz band, and
front–back cues in the 8–16-kHz band." (Langendiijk and
Bronkhorst, 2002)
>
> Here are a few places to
start:
>
> Langendijk, E. H. A., &
Bronkhorst, A. W. (2002). Contribution of spectral cues to
human sound localization. The Journal of the Acoustical
Society of America, 112(4), 1583–1596.
https://doi.org/10.1121/1.1501901
>
> Mehrgardt, S., &
Mellert, V. (1977). Transformation characteristics of the
external human ear. The Journal of the Acoustical Society of
America, 61(6), 1567–1576.
https://doi.org/10.1121/1.381470
>
> Shaw, E. a. G., &
Teranishi, R. (1968). Sound Pressure Generated in an
External‐Ear Replica and Real Human Ears by a Nearby Point
Source. The Journal of the Acoustical Society of America,
44(1), 240–249.
https://doi.org/10.1121/1.1911059
>
>
---------------------------------------------
>
> Frederick (Erick) Gallun,
PhD, FASA, FASHA | he/him/his
> Professor, Oregon Hearing
Research Center, Oregon Health & Science University
> "Diversity is like being
invited to a party, Inclusion is being asked to dance, and
Belonging is dancing like no one’s watching" - Gregory Lewis
>
>
> Dear Leslie,
>
> When downsampling to
8/16kHz, we really found the localization accuracy decreases,
even for horizon
> Do you have any good ideas
to solve it?
>
> Thanks a lot.
>
> Best regards,
> Junfeng
>
>
> I'd also wonder about the
time resolution: 16KHz = 1/16000 sec between
> samples = 62 microseconds
> .
> That's relatively long for
ITD (TDOA) estimation, which would suggest that
> localisation of lower
frequency signals would be impeded.
>
> (I don't have evidence for
this: it's just a suggestion).
>
> --Leslie Smith
>
> Junfeng Li wrote:
> > Dear all,
> >
> > We are working on 3D
audio rendering for signals with low sampling
> > frequency.
> > As you may know, the
HRTFs are normally measured at the high sampling
> > frequency, e.g., 48kHz
or 44.1kHz. However, the sampling frequency of
> > sound
> > signals in our
application is restricted to 16 kHz. Therefore, to render
> > this low-frequency
(≤8kHz) signal, one straight way is to first
> > downsample
> > the HRTFs from
48kHz/44.1kHz to 16kHz and then convolve with sound
> > signals.
> > However, the sound
localization performance of the signal rendered with
> > this approach is
greatly decreased, especially elevation perception. To
> > improve the sound
localization performance, I am now wondering whether
> > there is a certain
good method to solve or mitigate this problem in this
> > scenario.
> >
> > Any discussion is
welcome.
> >
> > Thanks a lot again.
> >
> > Best regards,
> > Junfeng
> >
>
>
> --
> Prof Leslie Smith
(Emeritus)
> Computing Science &
Mathematics,
> University of Stirling,
Stirling FK9 4LA
> Scotland, UK
> Tel +44 1786 467435
>