Re: [AUDITORY] On 3D audio rendering for signals with the low sampling frequency

Subject: Re: [AUDITORY] On 3D audio rendering for signals with the low sampling frequency

From: Junfeng Li <junfeng.li.1979@xxxxxxxxx>

Date: Thu, 11 Aug 2022 14:52:12 +0800

Approved-by: junfeng.li.1979@xxxxxxxxx

Arc-authentication-results: i=1; mx.google.com; dkim=pass header.i=@LISTS.MCGILL.CA header.s=SELECTOR1 header.b=wyDtl0KU; spf=pass (google.com: domain of owner-auditory@xxxxxxxxxxxxxxx designates 132.206.27.103 as permitted sender) smtp.mailfrom=owner-auditory@xxxxxxxxxxxxxxx; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-archive:list-owner:list-subscribe:list-unsubscribe:list-help :precedence:in-reply-to:to:comments:subject:from:sender:reply-to :date:message-id:references:mime-version:approved-by:dkim-signature; bh=OVZHXUQyVT96+bcECK042PKEbs6zan+1dqiafESLbKs=; b=Ix5HvwU1UAQElJcP5xfw6UEFLA/Td3pJVUfQ37iQFpdXM0NMdMbj6K9jcfznqZKipq SMd0lF/16ynrZpKTnw905tlo3WRyHRCBW4gdbusVZdjFYEDaeuxBhwKdyZlJnwUEJoWd IQwEHjCtk9OqINIAIhVowKqqUX20+JoIJVlN8b8w+0hdkxYQzIHwnSHX1EJSVAWGgyj1 BX7W8UPhMwocEtsgf8pWUIQFxXYii1OK8AC3j3+hNf9cAflygylOx8G3FKlV9rfpo4NG D+PgchFDg3ecwkaqnyQp9OZqVUiZxDBvyYuI3xcD3rkxCH39tave59HvjoC3frKzxmZz rqkA==

Arc-seal: i=1; a=rsa-sha256; t=1660203973; cv=none; d=google.com; s=arc-20160816; b=AVxIBupwWZlg5eQGtEYKFLiKnk89RSyg2C5iBV+iwD+MbdCG5kp0HVUnOfvLwWQVsV 1zcTfUAxLP89fVDBgJZ6IUFOaXX+aGQmENEmHP/NQOpnjEvzs2p9RVczENNgcj9i1kDQ rDuJWnTfSnJWQVFhL7q9uii4Vlsc0w2/E2kBsteDyS9DukJaByktIUqVK8FUMsGZ8EZH HU60+4g9SJAIugqmS0D1jjifyuh4V97q/xeuYSFU79/YOcgP+lyqJWkeLFU6lZA7VWSt q71Wv5qF6tfc3YOvAXOCZJEnT8+uqw1XkX8dGPiFUsJSybUlUHy2vM9nUJZRQW7txtFc SZCQ==

Authentication-results: mx.google.com; dkim=pass header.i=@LISTS.MCGILL.CA header.s=SELECTOR1 header.b=wyDtl0KU; spf=pass (google.com: domain of owner-auditory@xxxxxxxxxxxxxxx designates 132.206.27.103 as permitted sender) smtp.mailfrom=owner-auditory@xxxxxxxxxxxxxxx; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com

Comments: To: Neeraj Sharma <neerajww@xxxxxxxxx>

Delivered-to: dan.ellis@xxxxxxxxx

Dkim-signature: v=1; a=rsa-sha256; d=LISTS.MCGILL.CA; s=SELECTOR1; c=relaxed/relaxed; bh=OVZHXUQyVT96+bcECK042PKEbs6zan+1dqiafESLbKs=; i=@LISTS.MCGILL.CA; h=Approved-By:MIME-Version:References:Content-Type:Message-ID:Date:Reply-To:Sender:From:Subject:To:In-Reply-To:List-Help:List-Unsubscribe:List-Subscribe:List-Owner:List-Archive; b=wyDtl0KUOp2K1u4hTCQsevWShnjVaTB0NKssHNx7e7S8YFT1JMepx5Bzxr9Z5NL/Kc+2qaeTlrtBmV/5Rm6ecEVy2jzbbK4Oad3/pSWucG2fe60M/A31urCKlR/AkkIx116nWTOYxbD9CGugozpeZ3bU1MB+k8yaTl9HPF6HLeydP2IOTDfsarUPiOrqiMCLhbi6VsY1yFspWFpZoJ97hBOSqXZFIH5lqVpdSfzWlmKIgf6YvPGZbeMJCBO1AR+x2QgDyy4zzyKOuwLcCXAa4hwqCaPKYf6b0/NvD/XkOX4IFn0GivaTpsvUOTGYyAS46UWM7IXZC0nXYIbCRhnvpg==

In-reply-to: <CACcKo1CL2TDE-eQ1VGxBm5aehNcYnSvX__HrEZ=U9tSuD5iNDA@mail.gmail.com>

List-archive: <https://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>

List-help: <https://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>, <mailto:LISTSERV@LISTS.MCGILL.CA?body=INFO%20AUDITORY>

List-owner: <mailto:AUDITORY-request@LISTS.MCGILL.CA>

List-subscribe: <mailto:AUDITORY-subscribe-request@LISTS.MCGILL.CA>

List-unsubscribe: <mailto:AUDITORY-unsubscribe-request@LISTS.MCGILL.CA>

References: <CAJ0_ud+Joi36yuX+1q8qE650BBTf3hiTCu2J2hMs1U4my_F4ZQ@mail.gmail.com> <CAA=YKqizQiG14AbN3gSA54Y4fEct86R5LdT9WGdeJe7_mVNsTw@mail.gmail.com> <CAJ0_udJZwQr=Q1xg8iHLQJwnLUpiEGE3tJ7bqOyysq=K7GsT8g@mail.gmail.com> <CACcKo1CL2TDE-eQ1VGxBm5aehNcYnSvX__HrEZ=U9tSuD5iNDA@mail.gmail.com>

Reply-to: Junfeng Li <junfeng.li.1979@xxxxxxxxx>

Sender: AUDITORY - Research in Auditory Perception <AUDITORY@xxxxxxxxxxxxxxx>

Dear Neeraj,

Thank you so much for your discussions.

Thanks for sharing this observation here. I do not have a solution now but curious to know more.
I can relate the loss in elevation to poor capture of the spectral notches present in the HRTF. But I did not assume that the notches beyond 8kHz are this crucial. Are the HRTF personalized?

Yeah. In theory, we can obtain high localization accuracy when individualized HRTFs are used. However, we found that even if individualized HRTFs and head tracking are used, sound localization is still not so good as we expected when using headphone rendering. Especially, the errors in elevation perception are still large. Though we know that localization accuracy in elevation is greatly lower than that in horizontal plan even for normal-hearing listeners.

Also, I am now wondering, is it always the case that elevation information is poor for 16 kHz audio signals. Is there literature on this?
Just a quick shot, I will also try downsampling (without low pass filtering) the HRTF to 16 kHz and see if the aliased HRTF spectrum significantly corrupts the 3-D perception. I will bet - not much. But will keep fingers crossed.

I am sorry that I have no exact literatur on this issue. But, you can have a try and see the results. As an example, the sampling frequency for telephone speech is normally 8kHz, when we try to 3D render this 8kHz speech, It is quite difficult to perceive elevation. This is the main problem that we have to consider.

Look forward to your results and further dicussion.

Best regards,

Junfeng

On Thu, Aug 11, 2022 at 11:04 AM Junfeng Li <junfeng.li.1979@xxxxxxxxx> wrote:
Dear Dick,

Thanks a lot for your information.

Yeah, the main problem for us is the limitation of the 16kHz sampling frequency at the output side. Therefore, even if we do bandwidth extension for input signal, we have to downsample to 16kHz after 3D rendering processing. I am wondering there is any possible/potential method using some pychoacoustic principle, like that?

Thanks again.

Best regards
Junfeng

On Thu, Aug 11, 2022 at 12:29 PM Richard F. Lyon <dicklyon@xxxxxxx> wrote:
You could do "bandwidth extension" on the signals you want to spatialize, e.g. with some of the methods at
https://gfx.cs.princeton.edu/pubs/Su_2021_BEI/ICASSP2021_Su_Wang_BWE.pdf
and then apply the high-sample-rate HRTFs.
Of course, if your system has a 16 ksps limitation on the output side, that will be of no use.

Dick

On Wed, Aug 10, 2022 at 9:22 PM Junfeng Li <junfeng.li.1979@xxxxxxxxx> wrote:
Dear all,

We are working on 3D audio rendering for signals with low sampling frequency.
As you may know, the HRTFs are normally measured at the high sampling frequency, e.g., 48kHz or 44.1kHz. However, the sampling frequency of sound signals in our application is restricted to 16 kHz. Therefore, to render this low-frequency (≤8kHz) signal, one straight way is to first downsample the HRTFs from 48kHz/44.1kHz to 16kHz and then convolve with sound signals. However, the sound localization performance of the signal rendered with this approach is greatly decreased, especially elevation perception. To improve the sound localization performance, I am now wondering whether there is a certain good method to solve or mitigate this problem in this scenario.

Any discussion is welcome.

Thanks a lot again.

Best regards,
Junfeng