Speech Recognition Systems - PDF gyan.iitg.ernet.in

Throughout this thesis, the ASR performances are evaluated on systems developed using HTK toolkit [89] for two different tasks viz., connected digit recognition task and continuous speech recognition task.

The speech analysis is done using a Hamming window of length 25 ms, frame rate of 100 Hz and a pre-emphasis factor of 0.97. The 13-dimensional MFCC [90] base features (C₀−C₁₂) are computed using a 21-channel filterbank using HTK. In HTK, the Mel filterbank is implemented as a uniform filterbank in Mel frequency domain and then mapped to linear frequency. As a result, in the Mel filterbank implementation in HTK, the bandwidths of the filters up to 1 kHz do not turn out to be of strictly constant value unlike in the implementation proposed by Malcolm Slaney [91]. In addition to the base features, their first and second order temporal derivatives, computed over a span of 5 frames, are also appended making the final features 39-dimensional and henceforth, referred to as the ‘default’

MFCC features. Cepstral mean subtraction is also applied to all features. The details of the MFCC feature computation process are given in Appendix A.

The word error rate (WER) is used to evaluate the speech recognition performance of various techniques throughout the work in this thesis. The word error rate is computed as follows:

%W ER= Sub + Del + Ins

Total No. of Words×100 (2.1)

where, ‘Sub’ represents the number of substitutions, ‘Del’ represents the number of deletions and ‘Ins’

represents the number insertions made in the hypothesized text transcript with respect to the true transcription.

Table 2.2: Age group-wise break up of children’s speech corpus used in the connected digit recognition task.

Age Group (Yrs.)

6-7 8-9 10-11 12-13 14-15

No. of Speakers 8 31 42 17 3

(Boys/Girls) (5/3) (12/19) (27/15) (5/12) (1/2) No. of Utterances 615 2386 3231 1309 231

Table 2.3: Age group-wise break up of children’s speech corpus used in the continuous speech recognition task.

Age Group (Yrs.)

4-5 6-7 8-9 10-11 12-13

No. of Speakers 1 12 16 28 3

(Boys/Girls) (1/0) (5/7) (5/11) (18/10) (3/0)

No. of Utterances 2 20 45 58 4

2.3.1 Connected Digit Recognition

For the connected digit recognition task, the recognizers are developed following the setup described in [92]. The 11 digits (0-9 and OH) are modeled as whole word continuous density HMM using 16 emitting states per word. Each state is a mixture of 5 diagonal-covariance Gaussian distributions with simple left-to-right transitions without any skips over the states. A 3-state model with 6 diagonal- covariance components is used for modeling silence. A single state model with 6 diagonal-covariance components (allowing skip) is used for the short-pause model tied to the center state of the silence model. The details of the procedure used for training and testing a continuous density isolated unit HMM are given in Appendix B. An adults’ speech trained recognizer is trained using the adults’ speech data set ‘ADtr’ and is tested against the children’s speech data set ‘CHts1’ and the adults’ speech data set ‘ADts’. For developing a matched children’s ASR system, CHts1 data set which comprises all children’s speech data available in TIDIGITS corpus is split into two disjoint sets ‘CHtr’ (used for training) and ‘CHts2’ (used for matched testing). The adults’ speech recognition performance on children’s speech trained models is evaluated using the same adults’ speech data set ‘ADts’. The baseline recognition performance (in WER) for ADts and CHts1 test sets on the adults’ speech trained digit recognizer is 0.43% and 11.37%, respectively. The baseline recognition performance (in WER) for ADts and CHts2 test sets on the children’s speech trained digit recognizer is 13.28% and 1.01%, respectively.

2.3.2 Continuous Speech Recognition

For the continuous speech recognition task, the recognizer is developed using cross-word tri-phone acoustic models along with decision tree based state tying. Each tri-phone acoustic model consists of 3 emitting states with 8 diagonal-covariance Gaussian components for each state. A 3-state model with 16 diagonal-covariance Gaussian components is used for modeling silence, and a short-pause model (allowing skip) is constructed with all states tied to the silence model. The adults’ speech trained recognizer is trained using the adults’ speech data set ‘CAMtr’ resulting in 2499 tied-states after doing state tying. To evaluate the adults’ speech and children’s speech recognition performance on this adults’ speech trained continuous speech recognizer the ‘CAMts’ data set and the ‘PFts’ data set is used, respectively. The children’s speech trained recognizer is trained using the children’s speech data set ‘PFtr’ while its recognition performance is evaluated against the children’s data set ‘PFts’ and the

adults’ speech data set ‘CAMts’. The standard WSJ0 5,000 words closed non-verbalized punctuation vocabulary set and the standard MIT-Lincoln Labs 5k Wall Street Journal bi-gram language model are used for recognition of the adults’ test set CAMts having no out of vocabulary (OOV) word. For recognition of the children’s test set PFts, a 1,500 words non-verbalized punctuation vocabulary set and a 1.5k bi-gram language model are used. The language model for recognizing children’s test set PFts is trained using the transcripts of the children’s speech data set PFtr such that the PFts test set has perplexity of 1.02% OOV. The pronunciations for all words are obtained from the British English Example Pronunciation (BEEP) dictionary [87, 93]. The baseline recognition performance (in WER) for CAMts and PFts test sets on the adults’ speech trained continuous speech recognizer is 9.92% and 56.34%, respectively. The recognition performance for the children’s speech data set PFts is far worse than that obtained on the adults’ speech data set CAMts due to the large acoustic mismatch between the adults’ training and the children’s test data and also due to the loss of spectral information in case of narrowband children’s speech. On the children’s speech trained continuous speech recognizer, the baseline recognition performance (in WER) for CAMts and PFts test sets is 68.36% and 12.41%, respectively. The poor ASR performance for children’s test speech on matched children’s speech trained acoustic models in comparison to that for adults’ test speech on matched adults’ speech trained models is attributed to greater intra- and inter-speaker variability among children than in adults [28, 30]. It is to note that the trend observed in the ASR performances obtained for the above data sets and experimental setups are consistent with that already reported in literature.

The above described recognition systems for both connected digit and continuous speech recognition tasks are used throughout this thesis. This thesis focusses on the children’s speech recognition performances on the adults’ speech trained models i.e., under mismatched condition. So, unless spec- ified otherwise, all children’s speech recognition performances in this thesis refer to children’s speech recognition performances on the adults’ speech trained models i.e., under mismatched condition.

Dalam dokumen PDF gyan.iitg.ernet.in (Halaman 47-50)