[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Million Song Dataset
It is our pleasure to announce the release of The Million Song
dataset, a new resource to support music information research.
The Million Song Dataset is a freely-available collection of audio
features and metadata for a million contemporary popular music tracks.
http://labrosa.ee.columbia.edu/millionsong/
Its purposes are:
* To encourage research on algorithms that scale to commercial sizes
* To provide a reference dataset for evaluating research
* As a shortcut alternative to creating a large dataset with The
Echo Nest's API
* To help new researchers get started in the MIR field
The core of the dataset is the feature analysis and metadata for one
million songs, provided by The Echo Nest. The dataset does not include
any audio, only the derived features. Note, however, that sample audio
can be fetched from services like 7digital, using code we provide.
The Million Song Dataset is a collaborative project between The Echo
Nest and LabROSA. It is hosted by Infochimps and supported in part by
the NSF.
Aside from instructions on how to get the dataset, the website contains:
* code and tutorials to get you started
* benchmark results for some example tasks (automatic tagging,
artist recognition, ...)
* artist-level mappings to link to the Yahoo Ratings Dataset (91%
of the artist ratings covered)
* demos including how to fetch audio snippets, mapping artists on
a world map, ...
* forum, FAQ, blog, etc.
To better understand where this dataset comes from and what it aims to
achieve, you can read Dan Ellis' blog post: http://bit.ly/hF8ozR
We are keen to receive questions, comments and suggestions, and we
look forward to your new number-crunching MIR algorithms!
Thierry Bertin-Mahieux and Dan Ellis, for the Million Song Dataset team
http://labrosa.ee.columbia.edu/millionsong/