• Tidak ada hasil yang ditemukan

Proposed Algorithm for Adults’ Speech Trained ASR Models

Dalam dokumen PDF gyan.iitg.ernet.in (Halaman 151-157)

7.4 Adaptive MFCC Feature Truncation for Pitch Mismatch Reduction

7.4.2 Proposed Algorithm for Adults’ Speech Trained ASR Models

Based on the observed correlation between the appropriate MFCC test feature truncation and its degree of acoustic mismatch, further, we propose an automatic algorithm for doing an utterance- specific truncation of default base MFCC features for test signals depending upon their acoustic mismatch with respect to the adults’ speech trained models. To show the generality of the proposed algorithm, the ASR performances are evaluated using the same proposed algorithm for both the children’s as well as the adults’ test sets on both the connected digit recognition and continuous speech recognition tasks.

able7.6:Performanceforchildren’stestsetCHts1onmodelstrainedonadults’speechdatasetADtrforvarioustruncationsofbaseMFCCfeatures longwiththeirVTLNwarpfactor-wisebreakup.ThetruncatedbaseMFCCfeaturesarealsoappendedwiththeircorrespondingfirstandsecondorder mporalderivativesforrecognition.ThequantityinparenthesesgivesthenumberofutterancescorrespondingtothatVTLNwarpfactor. MFCC%WER BaseVTLNWarpFactorValues FeatureAll0.880.900.920.940.960.981.001.021.041.061.081.10 Length(7,772)(1384)(4782)(740)(278)(366)(88)(112)(9)(8)(1)(2)(2) Default(C0−C12)11.3716.2410.898.0014.456.364.0212.3722.226.670.00100.00100.00 C0−C1111.2016.2710.837.3713.016.003.5710.6022.226.670.00100.0050.00 C0−C1011.3816.1211.107.7812.866.003.5710.6016.676.670.0050.00100.00 C0−C910.7114.5810.576.9112.575.913.5711.6616.676.670.0050.00100.00 C0−C89.0312.258.966.1410.264.392.239.1922.226.670.000.0050.00 C0−C77.8010.497.715.419.684.032.687.4211.110.000.000.0050.00 C0−C66.779.686.624.328.243.312.236.715.560.000.0050.0050.00 C0−C56.218.896.014.187.513.132.686.015.566.670.0050.0050.00 C0−C46.038.505.774.237.803.582.237.775.566.670.0050.0050.00 C0−C35.216.375.104.416.073.583.576.360.006.670.0050.0050.00 C0−C25.476.645.364.557.803.133.126.710.006.670.00100.00100.00

Table 7.7: Performance for children’s test set PFts on models trained on adults’ speech data set CAMtr for various truncations of base MFCC features along with their VTLN warp factor-wise breakup.

MFCC % WER

Base VTLN Warp Factor Values

Feature All 0.88 0.90 0.92 0.94 0.96 0.98 1.06 1.08

Length (129) (91) (24) (7) (1) (2) (2) (1) (1)

Default (C0−C12) 56.34 64.07 40.97 26.32 25.00 8.42 94.44 42.86 5.13 C0−C11 52.10 58.70 38.95 23.08 25.00 10.53 93.52 42.86 5.13 C0−C10 48.39 54.76 34.21 22.67 25.00 8.42 93.52 42.86 5.13 C0−C9 44.72 50.21 31.48 22.67 25.00 9.47 98.15 28.57 5.13 C0−C8 42.19 47.27 28.96 22.27 25.00 8.42 99.07 30.95 5.13 C0−C7 39.89 44.87 28.15 16.60 25.00 8.42 90.74 26.19 5.13 C0−C6 39.02 43.99 27.35 16.19 13.64 8.42 93.52 21.43 5.13 C0−C5 35.13 39.56 23.71 14.98 13.64 9.47 88.89 26.19 2.56 C0−C4 38.25 42.25 29.26 16.60 22.73 9.47 90.74 23.81 2.56 C0−C3 41.50 45.62 35.22 15.38 9.09 12.63 86.11 21.43 2.56 C0−C2 40.62 44.22 33.70 21.05 9.09 25.26 82.41 16.67 0.00

It has already been noted that the adults’ speech (matched) and the children’s speech (mismatched) have different degrees of acoustic mismatches with respect to the adults’ speech trained models, and therefore, choose different truncations of base MFCC features for improved ASR i.e., lesser truncation for matched adults’ speech while greater truncation for mismatched children’s speech on adults’ speech trained models. Thus, it can be concluded that, in general, matched test speech would require lesser truncation in comparison to the mismatched test speech for recognition. Also, it has been observed that in case of both matched and mismatched ASR the appropriate base MFCC feature truncation increases with increase in the degree of acoustic mismatch between the test speech and the adults’

speech trained models. Therefore, following these observations, for the sake of generality, we propose a piece-wise linear relationship between the length of base MFCC features and the VTLN warp factor (ˆα) for both matched and mismatched test speech signals as shown graphically in Figure 7.2.

able7.8:Performanceforadults’testsetADtsonmodelstrainedonadults’speechdatasetADtrforvarioustruncationsofbaseMFCCfeaturesalong iththeirVTLNwarpfactor-wisebreakup. MFCC%WER BaseVTLNWarpFactorValues FeatureAll0.880.900.920.940.960.9811.021.041.061.081.101.12 Length(3,303)(46)(646)(424)(301)(460)(166)(662)(178)(232)(64)(42)(52)(30) Default(C0−C12)0.430.000.330.210.220.260.360.450.160.750.903.312.141.43 C0−C110.430.000.330.280.340.200.180.360.330.990.902.652.141.43 C0−C100.440.000.290.420.220.260.180.270.331.240.903.312.141.43 C0−C90.510.000.380.350.220.330.540.270.331.241.353.972.861.43 C0−C80.550.000.480.280.340.200.540.450.331.241.353.972.861.43 C0−C70.551.160.520.280.340.200.900.310.331.371.353.971.431.43 C0−C60.530.000.520.350.340.130.900.310.161.490.903.971.431.43 C0−C50.590.000.520.280.220.200.900.580.491.370.903.972.141.43 C0−C40.580.000.570.280.220.330.720.450.491.241.354.640.712.86 C0−C30.570.000.570.490.110.330.900.400.330.991.353.971.432.86 C0−C20.980.000.950.560.560.790.900.850.821.491.806.622.862.86

Table 7.9: Performance for adults’ test set CAMts on models trained on adults’ speech data set CAMtr for various truncations of base MFCC features along with their VTLN warp factor-wise breakup.

MFCC % WER

Base VTLN Warp Factor Values

Features All 0.92 0.94 0.96 0.98 1.00 1.02 1.04 1.06 1.08 1.12 (314) (8) (14) (39) (79) (70) (13) (46) (8) (33) (4) Default (C0−C12) 9.92 6.31 8.87 13.76 8.12 9.82 13.71 13.93 4.93 5.01 9.76

C0−C11 9.76 6.31 7.39 13.30 7.89 9.42 13.31 14.55 4.93 5.18 9.76 C0−C10 10.28 4.50 8.37 13.15 8.82 10.06 11.69 14.68 10.56 5.87 9.76 C0−C9 9.92 5.41 7.88 12.54 9.13 9.19 11.69 14.55 4.23 6.22 9.76 C0−C8 10.60 5.41 9.36 13.46 9.82 9.50 12.90 15.68 5.63 6.56 4.88 C0−C7 11.02 5.41 10.84 13.91 10.21 10.30 12.10 15.81 6.34 6.56 7.32 C0−C6 11.86 5.41 9.85 15.29 10.90 11.66 12.10 17.31 6.34 6.56 7.32 C0−C5 12.95 7.21 12.81 16.97 11.60 12.38 13.71 18.70 4.23 7.94 9.76 C0−C4 15.08 7.21 15.76 18.96 13.77 14.54 14.11 22.08 9.86 8.64 7.32 C0−C3 21.52 13.51 24.14 27.68 20.42 20.69 16.94 28.23 13.38 14.51 17.07 C0−C2 29.15 24.32 33.99 37.77 27.38 27.16 25.00 37.39 16.90 21.42 14.63

In order to automate the procedure for determining the appropriate MFCC feature truncation for a test signal on ASR models, it is required to first categorize the input speech as of an adult or of a child based on its degree of gross acoustic mismatch with respect to the recognition models. With respect to adults’ speech trained models, the children’s speech spectra may require the scaling of frequency axis by a factor as low as 0.88 for normalizing the differences in their formant frequencies which would be unlikely in case of adults’ speech spectra. Therefore, the log likelihoods of the default MFCC features of the test signal and of the features corresponding to VTLN warp factor of 0.88 with respect to the adults’ speech trained 39-D baseline models, already computed during VTLN warp factor estimation, are compared. The input test speech is categorized as child’s speech if its likelihood is more for features corresponding to VTLN warp factor of 0.88 than that corresponding to the default

0 0.88 0.9 0.92 0.94 0.96 0.98 1.0 1.02 1.04 1.06 1.08 1.1 1.12 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Matched (L<0)

Mismatched (L>0)

VTLN Warp Factor

Appropriate Length of MFCC Base Feature

Figure 7.2: Graph showing the proposed relation between the length of base MFCC features and the VTLN warp factor for both matched and mismatched test speech signals.

MFCC features or else as adult’s speech. The flow-diagram of the algorithm proposed for ASR on adults’ speech trained models is shown in Figure 7.3. Once the input test speech is categorized as adult’s or child’s speech, the appropriate length of base MFCC features for its recognition is chosen corresponding to its VTLN warp factor estimated with respect to the 39-D baseline models from the piece-wise linear relation given in Figure 7.2.

The recognition performances for both children and adults test sets using the proposed algorithm on the adults’ speech trained models on both digit and continuous speech recognition tasks are given in Table 7.10. It is noted that using the proposed algorithm for adaptive truncation of MFCC features relative improvements of 38% and 36% are obtained over baseline in children’s ASR performances on adults’ speech trained models on the connected digit recognition and the continuous speech recognition tasks. On comparing the improvements obtained in the ASR performances by fixed truncation of MFCC features for all test signals and those using the proposed adaptive MFCC feature truncation algorithm for recognition, it is to note that comparable performances are obtained in both cases on the continuous speech recognition task. The lesser improvement with the proposed adaptive MFCC feature truncation algorithm in comparison to that obtained with fixed truncation of MFCC features for all test signals on connected digit recognition task is attributed to the proposed relation between the VTLN warp factor and the base MFCC feature length. Since, the proposed relation is favored for

L0.88 L1.00 MFCC Features (13D-ǻ-ǻǻ) Computation including Linear Frequency Warping for VTLN

0.88 ”Į” 1.12 Į=1.12

Į=1.00 Į=0.88

L1.12 L1.00

L0.88

Max

39-D Baseline Models (Adults)

L0.88L1.00

Lookup Table Base MFCC Features Length

Log Likelihood Computation for Warped Features w.r.t. Baseline Models

ˆ VTLN Warp Factor

ǻL

X0.88 X1.00 X1.12 Speech

Signal

Į

Figure 7.3: Flow diagram of the proposed algorithm to determine the appropriate length of base MFCC features for recognizing a test speech signal on adults’ speech trained models. Here the ‘Lookup Table’ refers to the proposed relation between the length of base MFCC features and the VTLN warp factor shown graphically in Figure 7.2.

the continuous speech recognition task, the improvements obtained on connected digit recognition task using that relation are limited where the children’s test speech actually show the need for greater degree of truncation in comparison to that required on the continuous speech recognition task. On recognizing the adults’ test speech data, slight insignificant reduction in the ASR performance is observed as compared to its baseline on the adults’ speech trained models. Thus, the proposed algorithm is highly efficient in reducing the acoustic mismatch between the children’s and the adults’ speech data for children’s ASR on adults’ speech trained models, without using any prior knowledge about the speaker of the test utterance with an additional advantage of reduced MFCC feature dimensions.

Dalam dokumen PDF gyan.iitg.ernet.in (Halaman 151-157)