[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[AUDITORY] Audio captioning dataset



Dear list, 

=== Apologies for cross posting ===

We are happy to announce the release of Clotho, a novel and freely available dataset for audio captioning, which consists of 4981 audio samples (focusing on general sounds) of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length. 

Clotho is built with focus on audio content and caption diversity. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. 


You can find Clotho online at Zenodo: https://zenodo.org/record/3490684 

The paper that presents Clotho is on arXiv: https://arxiv.org/abs/1910.09387 

We also have realised code for handling the dataset: https://github.com/dr-costas/clotho-baseline-dataset


— For those that are not familiar with audio captioning — 
Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. 


Enjoy!

Konstantinos Drossos, 
Postdoc researcher, Audio Research Group
Tampere University, Finland