Abstract:
Reducing the dimension of acoustic feature space is realized using a wine-glass-type neural network, which has fewer number units in the middle layer than in the input and output layers, trained for the identity mapping. Three-layer wine-glass-type neural networks, which have 32 units for both input and output layers and two to five units for the middle layer are trained so as to map the input 32 LPC derived log spectrum vectors to the identical output vectors. After 500 iterations of backpropagation learning, the network can map the input to itself by 2.7 dB (five units) to 4.14 dB (two units) signal-to-deviation ratio of log spectrum for the training data. For the evaluation, DTW isolated word recognition experiments are performed using 132 similar city name utterances of a male speaker. Using the output of the middle layer units as the reduced feature vector, the recognition accuracy are 75.6% (two units) to 88.6% (five units). Since the accuracy using 16 cepstral coefficients derived from the original speech is 89.4%, the effectiveness of nonlinear identity mapping for reducing the feature dimension is confirmed.