Proposed Algorithm for Children’s Speech Trained ASR Models

7.4 Adaptive MFCC Feature Truncation for Pitch Mismatch Reduction

7.4.3 Proposed Algorithm for Children’s Speech Trained ASR Models

L^0.88 L^1.00 MFCC Features (13D-ǻ-ǻǻ) Computation including Linear Frequency Warping for VTLN

0.88 Į 1.12 _Į_=1.12

Į=1.00 Į=0.88

L^1.12 L^1.00

L^0.88

Max

39-D Baseline Models (Adults)

L^0.88– L^1.00

Lookup Table Base MFCC Features Length

Log Likelihood Computation for Warped Features w.r.t. Baseline Models

ˆ VTLN Warp Factor

ǻL

X^0.88 X^1.00 X^1.12 Speech

Signal

Figure 7.3: Flow diagram of the proposed algorithm to determine the appropriate length of base MFCC features for recognizing a test speech signal on adults’ speech trained models. Here the ‘Lookup Table’ refers to the proposed relation between the length of base MFCC features and the VTLN warp factor shown graphically in Figure 7.2.

the continuous speech recognition task, the improvements obtained on connected digit recognition task using that relation are limited where the children’s test speech actually show the need for greater degree of truncation in comparison to that required on the continuous speech recognition task. On recognizing the adults’ test speech data, slight insignificant reduction in the ASR performance is observed as compared to its baseline on the adults’ speech trained models. Thus, the proposed algorithm is highly efficient in reducing the acoustic mismatch between the children’s and the adults’ speech data for children’s ASR on adults’ speech trained models, without using any prior knowledge about the speaker of the test utterance with an additional advantage of reduced MFCC feature dimensions.

Table 7.10: Performances for children’s and adults’ test sets on adults’ speech trained models using default MFCC features (referred to as ‘Baseline’) and MFCC features derived using the proposed algorithm (referred to as ‘Proposed’) on both connected digit recognition and continuous speech recognition tasks.

% WER

Recognition Task Children’s Mismatched ASR Adults’ Matched ASR Baseline Proposed Baseline Proposed

Connected Digit 11.37 7.09 0.43 0.53

Continuous Speech 56.34 36.21 9.92 10.28

for children’s matched ASR. With respect to the children’s speech trained models, the adults’ speech spectra might need expansion by a frequency warp factor of as high as 1.12 which would very excep- tionally be required in case of children. So, on children’s speech trained models for classifying the test speech signals as of an adult or a child, the log likelihoods of the default MFCC features of the test signal and of the features corresponding to VTLN warp factor of 1.12 with respect to the children’s speech trained 39-D baseline models are compared. The input test speech is categorized as adult’s speech if the likelihood for the features corresponding to VTLN warp factor of 1.12 is more than that of the default features or else as child’s speech. The flow-diagram of the algorithm proposed for ASR on children’s speech trained models is shown in Figure 7.4 which is identical to the one proposed for adults’ speech trained models shown in Figure 7.3 except for the rule employed to classify the test speech being adult’s speech or child’s speech.

The recognition performances for both children’s and adults’ test sets using the proposed algorithm on the children’s speech trained models on both digit and continuous speech recognition tasks are given in Table 7.11. It is noted that consistent significant improvements are obtained in the ASR performances for both mismatched adults’ test speech and matched children’s test speech on both digit and continuous speech recognition tasks. Relative improvements of 49% and 31% are obtained over baseline in children’s matched ASR performances on the connected digit recognition and continuous speech recognition tasks, respectively. For adults’ speech recognition on children’s speech trained models, relative improvements of 35% and 10% are obtained over baseline on the connected digit recognition and continuous speech recognition tasks, respectively. However, it is to note that larger

L^1.00 L^1.12 MFCC Features (13D-ǻ-ǻǻ) Computation including Linear Frequency Warping for VTLN

0.88 Į 1.12 _Į_=1.12

Į=1.00 Į=0.88

L^1.12 L^1.00

L^0.88

Max

39-D Baseline Models (Children)

L^1.12– L^1.00

Lookup Table Base MFCC Features Length

Log Likelihood Computation for Warped Features w.r.t. Baseline Models

ˆ VTLN Warp Factor

ǻL

X^0.88 X^1.00 X^1.12 Speech

Signal

Figure 7.4: Flow diagram of the proposed algorithm to determine the appropriate length of base MFCC features for recognizing a test speech signal on children’s speech trained models. Here the ‘Lookup Table’

refers to the proposed relation between the length of base MFCC features and the VTLN warp factor shown graphically in Figure 7.2.

improvements are obtained for children’s speech than for adults’ speech on children’s speech trained models. Also, for adults’ mismatched ASR the improvements are noted to be comparatively less than that for children’s mismatched ASR. These are attributed to the poor children’s speech trained models having higher variances of the observation densities of phone models due to their higher inter-speaker variability than in case of adults [27,30]. This means that the class-dependent Gaussian densities have more spread and as a result the acoustic classes become less separable in the acoustic feature space in case of children’s speech trained models than for adults’ speech trained models. This statement can be further substantiated by comparing the average Bhattacharyya distance (BD) [138] between phone classes for adults’s and children’s speech trained phone models.

Given two Gaussian distributions N(µ_i,Σ_i) and N(µ_j,Σ_j) representing phone ‘i’ and phone ‘j’, respectively, the BD between these distributions is computed as:

BD(i, j) = 1

8(µi−µj)^T(Σi+ Σj

2 )⁻¹(µi−µj) +1

2log |^Σⁱ^+Σ₂ ^j|

p|Σi||Σj| (7.5)

Table 7.11: Performances for children’s and adults’ test sets on children’s speech trained models using default MFCC features and MFCC features derived using the proposed algorithm on both connected digit recognition and continuous speech recognition tasks.

% WER

Recognition Task Children’s Matched ASR Adults’ Mismatched ASR Baseline Proposed Baseline Proposed

Connected Digit 1.01 0.52 13.28 8.70

Continuous Speech 12.41 8.62 68.36 61.43

Given a set of M Gaussian densities representingM phone classes, the average BD is determined as follows:

AvgBD= 2

M(M−1)

M−1

i=1 M

j=i+1

BD(i, j) (7.6)

The average BD, AvgBD, can be considered as a statistical measure of how scattered the M phones are in the acoustic space. High values of AvgBD indicate that phone distributions are well scattered in the acoustic space and thus phones should be more easily discriminated, while low values of AvgBD can be interpreted as a higher superposition of phone distributions and thus the phone discrimination task would become harder.

To compare the inter-speaker variability of adults’ and children’s speech, the average BD is mea- sured for both adults’ and children’s speech trained phone models. The phone models are built using a 3-state left-to-right topology with a single Gaussian density per state. Each speech frame is pa- rameterized by a 39-D observation vector composed of 13 MFCCs (C0 −C12) plus their first and second order temporal derivatives. Cepstral mean subtraction is performed on static features on an utterance-by-utterance basis. For children, two sets of phone models are trained each using CHtr and PFtr data sets while for adults, the two sets of phone models are trained using ADtr and CAMtr data sets. In computing the average BD, only the Gaussian densities associated to the central states of the phone models are considered. This is done based on our assumption that the Gaussian density associated to the central state of a model better reflects the acoustic characteristics of the modeled phone than Gaussian densities associated to the initial and final states.

Connected Digit Continuous Speech 0

0.5 1 1.5 2 2.5

ASR Models

Bhattacharyya Distance

Adults Children

Figure 7.5: Bar graph showing average Bhattacharyya distance across vowel sounds for models trained on adults’ and children’s speech for connected digit and continuous speech recognition tasks.

The average BD for the adults’ and the children’s speech trained phone models is shown in Fig- ure 7.5. Seven different vowels are considered in computing BD measures in all cases. It can be noted that the average BD among vowel distributions is greater for adults’ speech trained models than for children’s speech trained models showing that vowel distributions are more overlapped in the acoustic spaces of children’s speech leading to comparatively poor classification performance of children’s speech trained models. This is attributed to the effect of greater inter-speaker acoustic variability in children’s speech than in adults’ speech [27].

Dalam dokumen PDF gyan.iitg.ernet.in (Halaman 157-161)