Subject: New dataset of speech recorded on common consumer devices and professionally produced speech From: "Gautham J. Mysore" <gautham@xxxxxxxx> Date: Thu, 8 Jan 2015 00:24:20 +0530 List-Archive:<http://lists.mcgill.ca/scripts/wa.exe?LIST=AUDITORY>--bcaec51d20f485f074050c14732d Content-Type: text/plain; charset=UTF-8 Content-Disposition: inline Dear List, I would like to announce a new dataset to help in research on the following problem: Can we Automatically Transform Speech Recorded on Common Consumer Devices in Real-World Environments into Professional Production Quality Speech? The goal of speech enhancement is typically to recover clean speech from noisy, reverberant, and often bandlimited speech in order to yield improved intelligibility, clarity, or automatic speech recognition performance. However, the acoustic goal for a great deal of speech content such as voice overs, podcasts, demo videos, lecture videos, and audio stories is often not merely clean speech, but speech that is aesthetically pleasing. This is achieved in professional recording studios by having a skilled sound engineer record clean speech in an acoustically treated room and then edit and process it with audio effects (referred to as production). A growing amount of speech content is being recorded on common consumer devices such as tablets, smartphones, and laptops. Moreover, it is typically recorded in common but non-acoustically treated environments such as homes and offices. It would therefore be useful to develop algorithms that automatically transform this low quality speech into professional production quality speech. This dataset was created to help address this problem. Specifically, it contains: - Clean Speech - High quality studio recordings of 20 speakers with about 14 minutes of speech per speaker. The scripts are excerpts from public domain stories. - Device Speech - The clean studio recordings were played through a high quality loudspeaker and recorded onto devices (tablet and smartphone) in a number of real-world environments. It contains a total of 12 device/room combinations. This strategy was used so that the nuances of a real recording would be captured while still allowing the recordings to be time aligned to each other and ground truth data. - Produced Speech - A professional sound engineer (with a great deal of voice over experience) was asked to process the clean studio recordings to make them sound professionally produced and aesthetically pleasing as he does for his standard voice over projects. All versions of speech are time aligned. Although the dataset was created for the problem mentioned above, it could be useful for other applications such as traditional speech enhancement and voice conversion. The dataset is here: https://archive.org/details/daps_dataset A sample with all versions (of a single excerpt of a single speaker) is here: https://archive.org/download/daps_dataset/sample.zip My recent paper in the Signal Processing Letters describes the dataset as well as the problem at hand in more detail: https://ccrma.stanford.edu/~gautham/Site/Publications_files/mysore-spl2015.pdf Please let me know if you have any questions. Apologies for cross-posting! Gautham --bcaec51d20f485f074050c14732d Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline <div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-size:small">Dea= r List,<br><br>I would like to announce a new dataset to help in research o= n the following problem:<br><br>Can we Automatically Transform Speech Recor= ded on Common Consumer Devices in Real-World Environments into Professional= Production Quality Speech?<br><br>The goal of speech enhancement is typica= lly to recover clean speech from noisy, reverberant, and often bandlimited = speech in order to yield improved intelligibility, clarity, or automatic sp= eech recognition performance. However, the acoustic goal for a great deal o= f speech content such as voice overs, podcasts, demo videos, lecture videos= , and audio stories is often not merely clean speech, but speech that is ae= sthetically pleasing. This is achieved in professional recording studios by= having a skilled sound engineer record clean speech in an acoustically tre= ated room and then edit and process it with audio effects (referred to as p= roduction). A growing amount of speech content is being recorded on common = consumer devices such as tablets, smartphones, and laptops. Moreover, it is= typically recorded in common but non-acoustically treated environments suc= h as homes and offices. It would therefore be useful to develop algorithms = that automatically transform this low quality speech into professional prod= uction quality speech. This dataset was created to help address this proble= m. Specifically, it contains:<br><br>- Clean Speech - High quality studio r= ecordings of 20 speakers with about 14 minutes of speech per speaker. The s= cripts are excerpts from public domain stories. <br><br>- Device Speech - T= he clean studio recordings were played through a high quality loudspeaker a= nd recorded onto devices (tablet and smartphone) in a number of real-world = environments. It contains a total of 12 device/room combinations. This stra= tegy was used so that the nuances of a real recording would be captured whi= le still allowing the recordings to be time aligned to each other and groun= d truth data.<br><br>- Produced Speech -=C2=A0 A professional sound enginee= r (with a great deal of voice over experience) was asked to process the cle= an studio recordings to make them sound professionally produced and aesthet= ically pleasing as he does for his standard voice over projects.<br><br>All= versions of speech are time aligned.<br><br>Although the dataset was creat= ed for the problem mentioned above, it could be useful for other applicatio= ns such as traditional speech enhancement and voice conversion.<br><br>The = dataset is here:<br><br><a href=3D"https://archive.org/details/daps_dataset= " target=3D"_blank">https://archive.org/details/daps_dataset</a><br><br></d= iv><div class=3D"gmail_default" style=3D"font-size:small">A sample with all= versions (of a single excerpt of a single speaker) is here:<br><br><!-- <a= href=3D"https://archive.org/download/daps_dataset/sample.zip" target=3D"_b= lank"> -->https://archive.org/download/daps_dataset/sample.zip <font color=3Dgray>[ archive.org/download/daps_dataset/sample.zip ]</font>= <!-- </a> --><br><br>My recent paper in the Signal Processing Letters desc= ribes the dataset as well as the problem at hand in more detail:<br><br><a = href=3D"https://ccrma.stanford.edu/~gautham/Site/Publications_files/mysore-= spl2015.pdf" target=3D"_blank">https://ccrma.stanford.edu/~gautham/Site/Pub= lications_files/mysore-spl2015.pdf</a><br><br>Please let me know if you hav= e any questions.<br><br>Apologies for cross-posting!<br><br>Gautham<br><br>= <br></div></div> --bcaec51d20f485f074050c14732d--