Abstract:
One of the typical problems in speech recognition of discourse which consists of different subject domains is to identify what kinds of subjects are included in the discourse. In this paper, radio news is focused on, and a method is proposed for dividing it into several subject domains. First, a method for the classification of newspaper articles in order to make referential subjects is described. Each subject is characterized by a (chi)[sup 2] vector which is calculated based on word frequencies. Then, by using (chi)[sup 2] vectors, a method for dividing radio news and classifying it into suitable subject domains is shown. Two experiments have been conducted. The first experiment is concerned with the clustering technique for newspaper articles. The data used contained 60 articles. They were classified into six referential subject domains. The results showed that the correct ratio was 96.6%. The second experiment is concerned with the segmentation for radio news and was conducted by using (chi)[sup 2] vectors which were obtained in the first experiment. Radio news utterances were manually translated into the romaji strings which nearly correspond to phoneme strings and were then used. The results showed that the precision, recall, and correctness attained about 90%, 30%, and 70%, respectively, and demonstrated the effect of the method.