Abstract:
Phoneme-based approaches to ASR can achieve good performances in medium-sized vocabularies for individual languages. Transfer to a different target language, however, generally produces poor results. An alternative, phonologically based model which uses a set of seven universal subsegmental elements as recognition targets is presented. The elements are phonologically monovalent and are postulated to have a direct encoding in the acoustic signal. Speech segments are expressed in terms of elements and recognition proceeds by identifying the elements present in the speech signal. This model, evolved from government phonology theory, forms the basis of a single recognition module which can be applied to most languages by selection of the appropriate lexicon and linguistic constraints. Initial experiments to test this approach have used perceptually based front ends and a MLP classifier. For each element, individual neural nets were trained on the TIMIT (American-English) database. These classifiers were then tested on Spanish, German, and British-English (isolated words recorded by native speakers in nonstudio conditions). For elements recognized on TIMIT, average transfer rates to these languages were of the order -6%. The results are encouraging in terms of transfer to different languages. Results for Mandarin Chinese and Japanese will also be presented.