• Tidak ada hasil yang ditemukan

Role of MFCC Feature Truncation in Pitch Mismatch Reduction

Dalam dokumen PDF gyan.iitg.ernet.in (Halaman 146-150)

measured for different order of truncation of MFCC features. For this study, the TIMIT data is used and nearly 2000 frames (central steady-state portions) of 7 different vowels from speech signals belonging to 75-100 Hz, 100-125 Hz and 200-250 Hz pitch ranges are extracted. The average pitch (Fo) of the signals is estimated using the ESPS tool available in the Wavesurfer software package [88]

as described in Section 2.2. The variances of the squared MD of the MFCC feature vectors of the

‘low’ (100-125 Hz) and the ‘high’ (200-250 Hz) pitch group signals from the distribution of MFCC features of 75-100 Hz pitch group signals corresponding to different truncations of MFCC features are measured for different vowels and those for few representative vowels are given in Table 7.5. The MD is computed for the MFCC feature vectors of all signals of both pitch groups using Eqn. 7.1 where x is the MFCC feature vector whose distance is to be computed. µL is the mean and ΣL is the diagonal-covariance of the distribution of MFCC features of 75-100 Hz pitch group signals. The table also shows the ratio of the variances of the squared MD of the feature vectors of the ‘low’ and ‘high’

pitch group signals for different order of truncation of MFCC features.

From Table 7.5, it is noted that, similar to our observation in Section 3.2.1, for MFCC features of all feature lengths the feature vectors of high pitch group signals have larger variances of squared MD with respect to the 75-100 Hz pitch group distribution in comparison to those for the low pitch group signals. As the degree of truncation of MFCC features increases the variance of squared MD of feature vectors of both pitch groups decreases for all vowels. However, the decrement in the variances of squared MD is more for high pitch group signals than the low pitch group signals in all cases.

Also, it is to note that greater decrease in the variances of squared MD is observed when the higher order MFCCs (beyondC4) are excluded from the feature vector in comparison to the exclusion of the lower order MFCC (up toC4) for both pitch groups. Observing the ratio of the variances of squared MD of feature vectors of the high and the low pitch groups, it is noted that with increase in truncation of higher order MFCCs the ratio of the variances of squared MD of feature vectors of both groups decreases significantly with the most decrease on exclusion of the coefficientsC10−C12. This indicates that with increase in truncation of MFCC features the feature vectors of high pitch group signals come closer to the distribution of MFCC features of the 75-100 Hz pitch group signals to an extent similar to those of the low pitch group but more when the higher order MFCC are truncated by reducing the pitch mismatch. This supports our earlier hypothesis that the higher order MFCCs have much higher distances in comparison to those for the lower order MFCC for high pitch signals compared to those

able7.5:VariancesofsquaredMahalanobisdistancesofMFCCfeaturevectorsofthe‘low(100-125Hz)andthe‘high(200-250Hz)pitchgroupsignals omthedistributionofMFCCfeaturesof75-100HzpitchgroupsignalswithdifferenttruncationsofMFCCfeaturesfordifferentvowels. BaseMFCCVowel/AE/Vowel/IY/Vowel/UW/ FeaturesFo=100-125HzFo=200-250HzRatioFo=100-125HzFo=200-250HzRatioFo=100-125HzFo=200-250HzRatio Default60.22553.142.485.31735.420.3114.41826.516.0 C1−C1149.41191.624.175.81076.814.289.8968.710.8 C1−C1041.1459.811.260.9524.38.666.2339.75.1 C1−C934.2195.15.750.5254.85.060.1236.63.9 C1−C829.0153.85.341.7101.12.436.478.32.2 C1−C723.7111.44.731.371.52.322.454.52.4 C1−C620.255.42.722.853.62.417.131.61.8 C1−C514.253.23.716.247.62.913.731.12.3 C1−C410.130.53.012.040.93.49.521.22.2 C1−C37.927.03.48.430.43.66.713.42.0 C1−C25.013.82.84.55.71.33.64.81.3 C11−C126.1429.670.47.1624.588.010.2414.340.6 C10−C129.1662.672.812.31017.482.713.0544.241.9 C9−C1211.4718.563.017.01232.272.521.11156.054.8 C8−C1215.4806.052.323.01217.352.921.71119.251.6 C7−C1219.41196.561.726.71323.849.631.81199.937.7 C6−C1227.51901.469.132.11309.040.835.61218.134.2 C5−C1231.92258.770.842.21360.932.243.21186.327.5 C4−C1238.32492.465.152.21450.927.855.51284.323.1 C3−C1244.02394.154.465.41538.423.580.21503.818.8 C2−C1254.92411.543.978.41626.320.793.91518.216.2

0 0.5 1 1.5 2 2.5 3 3.5

−60

−40

−20 0

Frequency (kHz)

Magnitude (dB) Linear DFT

0 0.5 1 1.5 2 2.5 3 3.5

−40

−20 0

Frequency (kHz)

Magnitude (dB)

13 MFCC 10 MFCC 7 MFCC 4 MFCC (a)

(b)

Figure 7.1: Plots for vowel /IY/ having pitch value of around 300 Hz (a) Linear DFT spectrum (b) Smoothed Mel spectra corresponding to the base MFCC features of different dimensions.

corresponding to the low pitch signals. Since children’s speech have significantly higher pitch values than adults’ speech, the exclusion of the higher order coefficients from the default 13-D base MFCC features (C0−C12) helps in reducing the distance between the children’s test features and the adults’

speech trained models by reducing their pitch mismatch.

To illustrate the effect of varied truncation of MFCC features on their corresponding spectra, the plots of the smoothed spectra corresponding to various truncated base feature lengths including C0 for vowel /IY/ having pitch value of around 300 Hz are shown in Figure 7.1. It is to note that with increased truncation, the pitch-dependent distortions in the smoothed Mel spectral envelope are better smoothed out. This explains the reduction in the pitch mismatch by increased truncation of higher order MFCCs. However, it is to note here that with greater truncation of higher order MFCCs the spectral peaks (formants) also start getting smoothed out. This truncation is applied to both the test features as well as the mean and variance parameters of the acoustic model. This corresponds to similar smoothing being applied to the adults’ speech spectra as well. This results in reduction in the spectral mismatch between the adults’ and children’s speech spectra due to vocal tract length (VTL) differences. Thus, with the increased MFCC feature truncation for children’s ASR on adults’

speech trained models has the potential to reduce the acoustic mismatch between the adults’ and the children’s speech arising due to the pitch differences, the VTL differences and any other sources of acoustic mismatch which induce fast varying changes in the speech spectrum.

7.4 Adaptive MFCC Feature Truncation for Pitch Mismatch Re-

Dalam dokumen PDF gyan.iitg.ernet.in (Halaman 146-150)