Re: reference needed (ASR) (Toth Laszlo )


Subject: Re: reference needed (ASR)
From:    Toth Laszlo  <tothl@xxxxxxxx>
Date:    Mon, 2 Oct 2006 15:06:35 +0200

Dear List Members, First of all, I would like to thank everybody who replied to my question regarding the "better phone recognition - poorer word recognition" phenomenon in ASR. Some comments and replies: 1. Basically I understand how this can occur, well, at least I 'feel' it. I do not doubt that this is a well-known phenomenon but, as Prof. Moore wrote, there seems to be no definitive reference on this (e.g. a paper that carefully analyizes when and why this occurs). Also, I agree with Jont Allen that phone recognition results are relatively rarely published. I think this is because the word error rate is much more important from a practical point of view. I personally think that the phone-level results are really important for gaining an insight, but I understand that those working on applications do not care about it. Also, those papers that focus on phone recognition mostly don't move on to word recognition, so the mentioned counter-intuitive behaviour will not be observed. Anyway, I received a couple of very good references from you that will be very useful, thanks a lot. 2. To Jont Allen: I don't think that I'm doing any particularly interesting or special. I'm just experimenting with some HMM/ANN hybrid phone models, and have found that while some of my models give better phone recognition results, they perform much worse in word recognition tasks. That's all. 3. In reply to those who said that human and machine recognition should work similarly in this respect: unfortunately it is not true, simply because in speech recognizers the acoustic and language information get combined in a very unintelligent and counter-intuitive way. Basically, a hypothesis is accepted only if it is supported BOTH by the acoustic and the language model. Imagine that you have a very good phone-level recognizer that can recognize every phone apart from one rarely occuring phone, which it is not willing to accept. Then all the words that contain this phone will be misrecognized, so the word-level error can be quite high, in spite of the good phone-level performance. This can be more or less handled by improving the pronunciation models, but I think it does not have much to do with how the human brain handles this problem. Laszlo Toth Hungarian Academy of Sciences * Research Group on Artificial Intelligence * "Failure only begins e-mail: tothl@xxxxxxxx * when you stop trying" http://www.inf.u-szeged.hu/~tothl *


This message came from the mail archive
http://www.auditory.org/postings/2006/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University