[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

MULTIPLE SIMULTANEOUS SPEAKER CORPUS - a proposal



    MULTIPLE SIMULTANEOUS SPEAKER CORPUS - a proposal

The Speech, Hearing and Language Group in Sheffield is about to
embark upon the collection of a multiple simultaneous speaker corpus.
The purpose of this mailing is to solicit suggestions and expressions
of interest in the project.

Possible uses for this corpus include studies in:

 * sound localisation
 * speaker identification
 * source segregation
 * dialogue/discourse analysis
 * intention processing
 * prosodic analysis
 * spoken language understanding
 * audio-visual analyses
 * automatic speech recognition
 * ...<** suggestions **>

Background: Our requirement is a corpus of 'lightly-annotated'
sampled data with which to evaluate our own work in sound source
separation (the hearing people) and intention analysis (the NLP
people). To this end, we are interested in environments in which
several acoustic sources are present at the same time. One such
scenario is exposed by considering the task of minute-taking.


Proposed scenario: Five people will engage in a task-oriented
discussion along with an artificial secretary (actually, just a
mannekin equipped for binaural recording). The task will be chosen to
provide not only the likelihood of much overlapped acoustic material,
but also to be a realistic topic of discourse for researchers in NLP.
We welcome suggestions for task domains.

Outline requirements <** comments please **>

A. Collection


  1. The data will be collected in acoustically controlled
conditions, with reverberation etc commensurate with an average-sized
meetings room.

  2. Eight channels of data will be recorded: 1 for each of five
speakers, 2 binaural from a mannekin and 1 from a centrally-placed
omnidirectional microphone.
  3. A video recording of the session will be made.
  4. A small amount of speech data will be collected from each
speaker individually.


B. Analysis


  1. Individual microphone signals subjected to semi-automatic
endpoint detection.
  2. Orthographic transcription of each individual signal (and of any
extraneous material picked up at the other receptors).
  3. Each 'unit of discourse' may be tagged with speaker intention
and other high-level descriptors.
  4. <** suggestions **>

C. Mastering/Distribution

  To make available (at cost) to the speech, hearing and natural
language community the following on CD-ROMs: the 8 channels of
sampled data, compressed video signal and whatever transcriptions
exist.


Our ideas are rather further developed than this deliberately loose
description; however, we feel that the corpus may be of sufficient
interest to warrant suggestions at this early stage of design. We
have funding to satisfy our own requirements; ideally, however, we'd
like to broaden the project so that it finds wider use in the
community. If you can offer advice, assistance or analysis or would
like to be involved in this project, we'd love to hear from you!


Rough timetable:
  requirements analysis (now until end September)
  data collection (to complete by end '93)
  analysis (basic stuff to be complete by end Q2 '94)

Martin Cooke

pp Guy Brown, Malcolm Crawford, Phil Green, Paul Mc Kevitt and Yorick
Wilks.


email: m.cooke@dcs.shef.ac.uk
fax: +44 742 780972

(circulated to auditory, ear-mail, salt, elsnet, ectl; please forward
this to any other relevant lists)