Subject: [AUDITORY] ISCA SIGML seminar: Audio Spectrogram Transformer for Audio Scene Analysis From: Hao Tang <haotang@xxxxxxxx> Date: Thu, 10 Jun 2021 00:47:58 +0100Dear colleagues, We are hosting a talk that might be of interest to the people on this list. Next Wednesday (16 Jun) at 5pm (UTC+0), Yuan Gong from MIT will talk about audio scene analysis with transformers. The details of the talk can be found at the end of the email and at the seminar webpage https://homepages.inf.ed.ac.uk/htang2/sigml/seminar/. If you are interested, the link to the talk will be distributed through our mailing list https://groups.google.com/g/isca-sigml. Please subscribe and stay tuned! Best, Hao --- Title: Audio Spectrogram Transformer for Audio Scene Analysis Abstract: Audio scene analysis is an active research area and has a wide range of applications. Since the release of AudioSet, great progress has been made in advancing model performance, which mostly comes from the development of novel model architectures and attention modules. However, we find that appropriate training techniques are equally important for building audio tagging models, but have not received the attention they deserve. In the first part of the talk, I will present PSLA, a collection of training techniques that can noticeably boost the model accuracy. On the other hand, in the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In the second part of the talk, I will answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. Bio: Yuan Gong is a postdoctoral associate at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). He received his Ph.D. degree in Computer Science from the University of Notre Dame, and his B.S. degree in Biomedical Engineering from Fudan University. He won the 2017 AVEC depression detection challenge and one of his papers was nominated for the best student paper award in Interspeech 2019. Currently, his research interests include: audio scene analysis, speech-based health systems, voice anti-spoofing.