[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

New dataset of speech recorded on common consumer devices and professionally produced speech



Dear List,

I would like to announce a new dataset to help in research on the following problem:

Can we Automatically Transform Speech Recorded on Common Consumer Devices in Real-World Environments into Professional Production Quality Speech?

The goal of speech enhancement is typically to recover clean speech from noisy, reverberant, and often bandlimited speech in order to yield improved intelligibility, clarity, or automatic speech recognition performance. However, the acoustic goal for a great deal of speech content such as voice overs, podcasts, demo videos, lecture videos, and audio stories is often not merely clean speech, but speech that is aesthetically pleasing. This is achieved in professional recording studios by having a skilled sound engineer record clean speech in an acoustically treated room and then edit and process it with audio effects (referred to as production). A growing amount of speech content is being recorded on common consumer devices such as tablets, smartphones, and laptops. Moreover, it is typically recorded in common but non-acoustically treated environments such as homes and offices. It would therefore be useful to develop algorithms that automatically transform this low quality speech into professional production quality speech. This dataset was created to help address this problem. Specifically, it contains:

- Clean Speech - High quality studio recordings of 20 speakers with about 14 minutes of speech per speaker. The scripts are excerpts from public domain stories.

- Device Speech - The clean studio recordings were played through a high quality loudspeaker and recorded onto devices (tablet and smartphone) in a number of real-world environments. It contains a total of 12 device/room combinations. This strategy was used so that the nuances of a real recording would be captured while still allowing the recordings to be time aligned to each other and ground truth data.

- Produced Speech -  A professional sound engineer (with a great deal of voice over experience) was asked to process the clean studio recordings to make them sound professionally produced and aesthetically pleasing as he does for his standard voice over projects.

All versions of speech are time aligned.

Although the dataset was created for the problem mentioned above, it could be useful for other applications such as traditional speech enhancement and voice conversion.

The dataset is here:

https://archive.org/details/daps_dataset

A sample with all versions (of a single excerpt of a single speaker) is here:

https://archive.org/download/daps_dataset/sample.zip [ archive.org/download/daps_dataset/sample.zip ]

My recent paper in the Signal Processing Letters describes the dataset as well as the problem at hand in more detail:

https://ccrma.stanford.edu/~gautham/Site/Publications_files/mysore-spl2015.pdf

Please let me know if you have any questions.

Apologies for cross-posting!

Gautham