Truncation of MFCC Features for Children’s ASR

7.1 Introduction

In Section 4.3, the pitch-dependent distortions appearing in the Mel spectral envelope of the high pitch signals have been found to increase the dynamic range of the higher order coefficients of 13-D truncated MFCC (C₀−C₁₂) features. This in turn causes the variances of those higher order MFCCs to increase with increase in the pitch of the signals as noted in Section 3.2.1. It is already known that children’s speech have higher pitch frequencies in comparison to those of the adults’ speech [27].

Therefore, higher order coefficients of MFCC features corresponding to children’s speech would have significantly higher variances in comparison to those in case of the adults’ speech.

Motivated by these, in this chapter, the truncation of MFCC features is explored for children’s ASR on adults’ speech trained models. The role of MFCC feature truncation in reducing the pitch mismatch between the adults’ and the children’s speech is then explored. Based on the correlation observed between the length of the base MFCC features for a test signal and the degree of acoustic mismatch with respect to speech recognition acoustic models, an automatic algorithm is proposed for choosing utterance-specific appropriate truncation of MFCC features for children’s speech recognition on adults’ speech trained models.

The outline of this chapter is as follows: In Section 7.2, truncation of MFCC features is explored for children’s speech recognition on adults’ speech trained models. The role of MFCC feature truncation in addressing the pitch mismatch between the adults’ and the children’s speech in context of children’s ASR on adults’ speech trained models is studied in Section 7.3. In Section 7.4, an automatic algorithm for utterance-specific MFCC truncation for test signals is proposed. VTLN and CMLLR techniques are explored in combination with the proposed algorithm for ASR in Section 7.5. Finally, the summary of the work presented in this chapter is given in Section 7.6.

using the corresponding dimensionality models. For optimal ASR evaluations, the acoustic models are required to be re-estimated to match the dimensionality of the truncated test features. However, in order to avoid the large computational complexity and the need of a large storage space, we explore using the suboptimal truncated models extracted from the full 39-D baseline models for ASR evaluations. On the connected digit recognition task, a difference of only 0.3% is noted in the children’s mismatched ASR performances obtained using the re-trained models and the truncated models corresponding to the 3-D base MFCC features. Therefore, in this work, though all ASR experiments on the connected digit recognition task are done using the re-estimated reduced dimensionality models but, on the continuous speech recognition task the ASR performances are evaluated using the truncated models extracted from the 39-D baseline models.

The recognition performances for the children’s test sets CHts1 and PFts on their respective systems for different dimensions of the truncated test features on corresponding adults’ speech trained models with matching feature dimensions are given in Table 7.1. It is noted that the recognition performance for the children’s test set improves consistently with MFCC truncation on both connected digit and continuous speech recognition tasks. The best relative improvement of about 54% is obtained over the baseline for the CHts1 test set for 12-dimensional MFCC features (include 4-dimensional base features and their first and second order temporal derivatives) while the best relative improvement of about 38% is obtained over the baseline for the PFts test set using 18-dimensional MFCC features (include 6-dimensional base features and their first and second order temporal derivatives). Thus, greater degree of cepstral truncation of default base MFCC features helps children’s speech recognition on the adults’ speech trained models.

It seemed bit surprising that such an improved children’s ASR performance could be obtained on the adults’ speech trained models with base MFCC feature length of as low as 4 (on connected digit recognition task) and 6 (on continuous speech recognition task) in comparison to default base MFCC feature length of 13. So, to explore the reason for such behavior, the contribution of each of the coefficients of the default 39-D MFCC features (C₀−C₁₂ base MFCC features and their first and second order temporal derivatives) to the children’s ASR performance is evaluated. For sake of ease, the study is done on a connected digit recognition task.

For determining the contribution of each of the coefficients of the default 39-D MFCC features to the children’s ASR performance, separate acoustic model sets are derived from the 39-D baseline

able7.1:Performancesforchildren’stestsetusingMFCCfeaturesconsistingofvaryingbasefeaturedimensionsonbothconnecteddigitrecognition ndcontinuousspeechrecognitiontasks.Forrecognition,thevarioustruncatedbaseMFCCfeaturesarealsoappendedwiththeircorrespondingfirstand condordertemporalderivatives. %WER RecognitionBaseMFCCFeatures TaskDefaultC0−C11C0−C10C0−C9C0−C8C0−C7C0−C6C0−C5C0−C4C0−C3C0−C2 (C0−C12) Connected11.3711.2011.3810.719.037.806.776.216.035.215.47 Digit Continuous56.3452.1048.3944.7242.1939.8939.0235.1338.2541.5040.62 Speech

Table 7.2: Performances (in descending order) of each of the coefficients of the 39-D default MFCC features for children’s test set CHts1 on models trained on adults’ speech data set ADtr.

Rank Coefficient % WER Rank Coefficient % WER Rank Coefficient % WER

1 ∆C₀ 63.12 14 C₁₀ 88.16 27 ∆C₆ 96.29

2 C₁ 67.41 15 C₆ 88.51 28 ∆C₁₁ 96.33

3 C₀ 69.93 16 C₅ 89.17 29 ∆C₉ 96.49

4 ∆C₁ 72.98 17 C₈ 90.24 30 ∆C₁₀ 96.86

5 ∆C₂ 73.16 18 C₉ 91.49 31 ∆∆C₅ 97.24

6 ∆∆C₀ 74.47 19 C₁₁ 92.07 32 ∆C₇ 97.41

7 C₂ 76.25 20 ∆∆C₃ 92.24 33 ∆∆C₆ 97.47

8 ∆∆C₂ 84.50 21 C₇ 92.39 34 ∆∆C₁₂ 97.74

9 C₃ 85.43 22 C₁₂ 92.57 35 ∆∆C₉ 97.80

10 C₄ 85.66 23 ∆∆C₄ 93.71 36 ∆∆C₁₁ 97.84

11 ∆∆C₁ 86.38 24 ∆C₅ 94.95 37 ∆∆C₈ 98.03

12 ∆C₃ 87.87 25 ∆C₁₂ 95.70 38 ∆∆C₇ 98.23

13 ∆C₄ 88.12 26 ∆C₈ 96.27 39 ∆∆C₁₀ 98.28

acoustic model set trained on ADtr data set for each MFCC coefficient. The model set corresponding to a MFCC coefficient is derived by selecting the dimension corresponding to that particular coefficient from all mean and variance parameters of the 39-D baseline acoustic model set while keeping all the mixture weights and the transition probabilities same as default. The recognition performance of CHts1 data set is computed on each of those separate acoustic models with matching dimension of MFCC test features. This procedure is repeated separately for all coefficients of the default 39-D MFCC features and the recognition performances in descending order on the children’s test set CHts1 computed using each of the individual coefficients of the 39-D MFCC features are given in Table 7.2.

It is to note that some coefficients of the default 39-D MFCC features seem to have no significant role in children’s ASR on adults’ speech trained models.

Further, the performances with different MFCC features obtained by taking top performing coefficients as noted from Table 7.2 are also computed and are given in Table 7.3 along with those obtained with truncated MFCC features of matching length for ease of comparison. It is noted that the best relative improvement of 53% is obtained in the children’s ASR performance using rank-ordered MFCC features of 12 dimensions for both training and test speech data. It is also noted that comparable

able7.3:Performancesforchildren’stestsetCHts1onmodelstrainedonadults’speechdatasetADtrusingMFCCfeaturevectorsofdifferent imensionsselectingthetopdcoefficientsfromtherank-orderedlistgiveninTable7.2basedonthecontributionofeachofthecoefficientsof39-DMFCC aturevectortothespeechrecognitionperformancesforthechildren’stestsetCHts1onmodelstrainedonadults’speechdatasetADtr,referredtoas ankBasedFeatureSelection’.Foreaseofcomparison,thechildren’sASRperformancesforCHts1testsetcorrespondingtoMFCCfeaturestruncated differentdimensionsobtainedinTable7.1arealsogiven. %WERonConnectedDigitRecognitionTask MethodMFCCFeatureLength 36912151821242730333639 RankBased24.5110.245.995.356.037.818.648.739.329.9310.4110.7411.37 FeatureSelection CepstralTruncationBased--5.475.216.036.216.777.809.0310.7111.3811.2011.37 FeatureSelection

Table 7.4: Rank ordering of each of the coefficients of the 39-D MFCC features based on their ASR performances for children’s test set CHts1 on models trained on adults’ speech data set ADtr within the base, the ∆ and the ∆∆ feature streams.

Base MFCC ∆ MFCC ∆∆ MFCC

Rank Coefficient Rank Coefficient Rank Coefficient

1 C1 1 ∆C0 1 ∆∆C0

2 C0 2 ∆C1 2 ∆∆C2

3 C2 3 ∆C2 3 ∆∆C1

4 C3 4 ∆C3 4 ∆∆C3

5 C4 5 ∆C4 5 ∆∆C4

6 C₁₀ 6 ∆C₅ 6 ∆∆C₅

7 C₆ 7 ∆C₁₂ 7 ∆∆C₆

8 C₅ 8 ∆C₈ 8 ∆∆C₁₂

9 C₈ 9 ∆C₆ 9 ∆∆C₉

10 C₉ 10 ∆C₁₁ 10 ∆∆C₁₁

11 C₁₁ 11 ∆C₉ 11 ∆∆C₈

12 C₇ 12 ∆C₁₀ 12 ∆∆C₇

13 C₁₂ 13 ∆C₇ 13 ∆∆C₁₀

improvements are obtained in the children’s ASR performance along with similar reduction in the overall dimensionality of MFCC features from both rank based low-dimensional MFCC features and low-dimensional MFCC features obtained through cepstral truncation. However, it is difficult to ex- plain the spectral meaning of such arbitrary selection of coefficients in case of rank based MFCC features.

On observing the top 12 performing coefficients of 39-D MFCC features from Table 7.2, it is noted that the rank based 12-D MFCC features comprise mainly the coefficients up to C₃ and their first and second order derivatives. Further, on observing the rank ordering of all coefficients among the base, the ∆ and the ∆∆ MFCC feature streams in Table 7.4 based on their performances as noted in Table 7.2, it is noted that the higher contributions to recognition performance in each of the three streams of MFCC features have come from the lower order coefficients. This justifies the comparable best improvement obtained in the children’s ASR performance for CHts1 test set using truncated base MFCC features of 4 dimensions (C₀−C₃).

Dalam dokumen PDF gyan.iitg.ernet.in (Halaman 140-146)