[AUDITORY] AudioSet: An ontology and human-labeled dataset for audio events

Dear colleagues,

We are excited to announce AudioSet, a comprehensive ontology of over 600 sound classes and a dataset of over 2 million 10-second YouTube clips annotated with sound labels.

The ontology is a manually-assembled hierarchy of sound event classes, ranging from “Child speech” to “Ukulele” to “Boing.” It is informed by comparison with other sound research and sound event sets, and in response to what we’ve learned annotating the videos. It remains a work in progress and we hope to see community contributions and refinements.

The dataset was created by mining YouTube for videos likely to contain a target sound, followed by crowdsourced human verification. For mining we used a range of approaches ranging from title search to content-based techniques. The ontology and dataset construction are described in more detail in our ICASSP 2017 paper.

The data release includes the URLs of all the excerpts along with the sound classes judged present, as well as precalculated audio features from a VGG-inspired acoustic model.

You can browse the ontology, explore and download the data at g.co/audioset

Jort Gemmeke

On behalf of the sound and video understanding teams in the Machine Perception Research organization at Google.