[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: reference needed (ASR)
Dear List Members,
First of all, I would like to thank everybody who replied to my question
regarding the "better phone recognition - poorer word recognition"
phenomenon in ASR. Some comments and replies:
1. Basically I understand how this can occur, well, at least I 'feel' it.
I do not doubt that this is a well-known phenomenon but, as Prof. Moore
wrote, there seems to be no definitive reference on this (e.g. a paper
that carefully analyizes when and why this occurs). Also, I agree with
Jont Allen that phone recognition results are relatively rarely
published. I think this is because the word error rate is much more
important from a practical point of view. I personally think that the
phone-level results are really important for gaining an insight, but I
understand that those working on applications do not care about it. Also,
those papers that focus on phone recognition mostly don't move on to word
recognition, so the mentioned counter-intuitive behaviour will not be
observed. Anyway, I received a couple of very good references from you
that will be very useful, thanks a lot.
2. To Jont Allen: I don't think that I'm doing any particularly
interesting or special. I'm just experimenting with some HMM/ANN hybrid
phone models, and have found that while some of my models give better
phone recognition results, they perform much worse in word recognition
tasks. That's all.
3. In reply to those who said that human and machine recognition should
work similarly in this respect: unfortunately it is not true, simply
because in speech recognizers the acoustic and language information get
combined in a very unintelligent and counter-intuitive way. Basically, a
hypothesis is accepted only if it is supported BOTH by the acoustic and
the language model. Imagine that you have a very good phone-level
recognizer that can recognize every phone apart from one rarely occuring
phone, which it is not willing to accept. Then all the words that contain
this phone will be misrecognized, so the word-level error can be quite
high, in spite of the good phone-level performance. This can be more or
less handled by improving the pronunciation models, but I think it does
not have much to do with how the human brain handles this problem.
Laszlo Toth
Hungarian Academy of Sciences *
Research Group on Artificial Intelligence * "Failure only begins
e-mail: tothl@xxxxxxxxxxxxxxx * when you stop trying"
http://www.inf.u-szeged.hu/~tothl *