IMPROVING CHILDREN’S SPEECH RECOGNITION UNDER MISMATCHED CONDITION USING

Separability ε(x) for children's speech using global ABWE and age-specific transformation with ∆. Half window size, Θ = 13 is chosen to calculate. The effectiveness of the developed ABWE methods is demonstrated in the context of children's speech recognition under mismatched conditions.

Nature of children’s and adults’ speech signals

The narrowband spectrograms given in Figures 1.1 (e) and 1.1 (f) illustrate the high pitch values present in the case of children's speech. The spectral dynamics representing peak to valley in the case of children's speech is about twice that of the adult case.

Figure 1.1: Differences in the nature of speech signals of a child (7 years old) and adult male for the English utterance ‘one three five seven’

Importance of bandwidth for speech recognition

The passband is determined by the lower (fl) and upper (fh) cutoff frequency of the bandpass filter. b) The syllable articulation of low-pass and high-pass filtered signals is shown as the function of the cutoff frequency. Both these studies conclude that for automatic speech recognition task also, especially children's speech recognition, greater bandwidth is needed.

Figure 1.2: The effect of audio bandwidth on the quality and intelligibility of speech

Artificial bandwidth extension for speech recognition

The narrowband vocal tract information undergoes a process of approximate reconstruction of the missing vocal tract information in the range from 3400-8000 Hz. During the training process, the relationship between narrowband and wideband vocal tract information is learned in terms of a mapping function.

Figure 1.4: Two state speech production model depicting source-filter nature of approximation.

Organization of thesis

Differences in the linguistic correlates of children's and adults' speech are given in section 2.4. The specific requirements of ABWE suitable for recognizing children's speech under mismatch conditions are highlighted in section 2.9.

Children’s and adults’ speech production systems

Acoustic mismatch differences in children’s and adults’ speech

The bandwidth contribution due to glottal resistance (Bg) is directly proportional to the surface area of the glottis. The bandwidth contribution due to the glottal resistance effect is greater in children than in adults.

Linguistic correlates of children’s and adults’ speech

Dis-fluencies decline with age and children reach adult proficiency level at around 12-13 years of age (slightly earlier for boys than girls) [48]. Thus, the ability of the children to use language effectively to convey the message improves with their age.

Short-term features of children’s and adults’ speech

Thus, this study shows that the LP coefficients are also affected in the case of children's speech due to high pitch value. However, this study has only focused on minimizing the effect of high pitch values in the case of children's speech.

Approaches for improving ASR under mismatched condition

Vocal tract length normalization (VTLN) for improving ASR performance 25

VTLN is a speaker normalization method where the acoustic variability between speakers due to varying vocal tract lengths, i.e. the discrepancy due to differences in the formant frequencies among speakers is reduced by warping the frequency axis of each speaker's speech spectrum [70, 71]. The spacing and width of the filters in the Mel filter bank are changed while keeping the speech spectrum unchanged.

Maximum likelihood linear regression (MLLR)

MLLR-MEAN
MLLR-COV
Constrained MLLR (CMLLR)

The fitting method, where the linear transformations are only applied to the variances of the models, is referred to as 'MLLR-COV'. Thus, the transformation matrix used to provide a new estimate of the fitted observation is given by .

Minimizing pitch mismatch for Improving ASR Performance

The proposed algorithm automatically selects the appropriate length of the MFCC basis functions for each test signal without prior knowledge of the speaker of the test statement. Significant improvements in child speech recognition have been achieved using the proposed algorithms on models trained for adult speech.

Need for ABWE for ASR under mismatched condition

Also, a method based on mel cepstral truncation is proposed for reducing the height discrepancy between training and test data.

Artificial bandwidth extension: A Review

Motivation for ABWE
Frequency bands
Correlation between frequency bands of speech
Speech bandwidth extension with side information
Different techniques for artificial bandwidth extension

General signal processing methods
Source-filter model based ABWE
Feature extraction

Consequently, the features of the speech production mechanism and speech signals can be exploited in the bandwidth expansion task. It is also shown that the inclusion of memory in the estimation technique improves the certainty of the higher band estimate [90]. For example, the higher band excitation is constructed in the frequency domain by repeatedly copying a subband of the telephone band spectrum to the higher band [ 104 , 105 ].

Figure 2.2: Frequency bands in the bandwidth extension of telephone speech.

ABWE for children’s speech recognition

The previous attempts to improve children's speech recognition under mismatched conditions focused on minimizing the variabilities due to pitch [1,53–55] . Here, regardless of whether the given speech is either narrowband or wideband, the focus is mainly on minimizing the fluctuations due to pitch. In this, the approaches based on pitch modification focus on minimizing the effects of pitch.

Figure 2.4: Plots showing mean along with variance (in bar) for MFCC (C 1 -C 12 ) of signals of different pitch groups: 75-100 Hz and 200-250 Hz (left panel) and 200-250 Hz and its transformed to 140-175 Hz (right panel) for vowels (a) /ae/ (b) /iy/ (c) /a

Organization of the Present Work

TIDIGITS corpus
Design of ASR studies
Feature extraction
Digits models
Performance Evaluation

A standard VTLN approach is implemented and both adult and child speech are subjected to the VTLN process. The first step is to understand the importance of ABWE for speech recognition, especially children's speech recognition under the incongruent condition. The very poor WER infers the significant acoustic mismatch between adult and child speech.

Figure 2.5: Linear predictive coding (LPC) spectra of a vowel /aa/ for an adult and a child speak- speak-ers along with the warped spectrum of that child for (a) narrowband speech case, (b) wideband speech case

Proposed Spectral Loss Compensation for ASR using ABWE

Selective linear prediction
Gaussian mixture model
Spectral envelope reconstruction
ASR Study using ABWE

It consists of the logarithm of the ratio of narrowband (nb) to highband (hb) energy, and the remaining ten coefficients are selective linear predictive cepstral coefficients (SLPCC) calculated from the spectral information of the highband (3.4–8.0 kHz) signal, such as described in [100]. In the diploma thesis, we take advantage of the possibility of deriving cepstral coefficients from LP coefficients. Let X∈Rk be the characteristic vector of narrowband speech and Y ∈Rl the characteristic vector of highband speech.

Figure 3.1: Block diagram of the source-filter based generic ABWE algorithm.

ASR using VTLN

In the case of VTLN normalized narrowband children's speech tested against narrowband adults' models gives a WER of 1.64%. Next in the case of VTLN normalized wideband children's speech tested against wideband adults' models gives a WER of 0.77% versus 3.21% for no normalized case. The performance of ABWE and VTLN normalized narrowband child speech tested against broadband adult models gives a WER of 1.17%.

Table 3.3: Recognition performances for adults speech (AD) and children’s speech (CH) test sets hav- hav-ing narrowband (NB), wideband (WB) and artificial bandwidth extended (ABWE) speech data

ASR using Truncation of MFCC Features

Combining ABWE, VTLN and Cepstral Truncation Approaches

ABWE and VTLN

The performance of VTLN seems to be even better compared to the standalone performance of ABWE. In the case of VTLN, normalization is only performed for the same band, while, as in ABWE, missing information in the high frequency is reconstructed.

ABWE and Cepstral Truncation

ABWE, VTLN and Cepstral Truncation

These highlighted values are taken and combined to give an average value for the adult case of 0.50%, a marginal improvement on the previous results of 0.61% and 0.58%. The highlighted values in each column represent the best number for a given strain factor. These highlighted values are taken and combined to give an average value for child cases of 0.97%, a significant improvement over 1.06%.

Table 3.6: Performance for the adults’ test set on models trained on adults’ speech data set for vari- vari-ous truncations of base MFCC features along with their VTLN warp factor-wise breakup, Further respective utterences’ MFCC are warped to respective b

Summary

Mutual Information (I)
Differential Entropy (H)
Ratio Measure (R IH )
Separability (ε)

Thus, coefficient trimming is effective in the case of recognizing a child's speech in mismatched conditions. Consequently, benefit can be achieved by exploiting age-specific information in the case of children's speech. Resolution of ε(x) for child speech using global ABWE and age-specific transformation with ∆. Half the window size, Θ = 13, is chosen for the calculation.

Table 3.8: Performance for the adults’ test set on models trained on adults’ speech data set for vari- vari-ous truncations of base MFCC features along with their VTLN warp factor-wise breakup.

ABWE using Class-Specific Information

Derivation of Class Information for ABWE Transformation

Unsupervised Classification
Supervised Classification

SMZ} associated with a sequence of features Z. The correspondence between SMZ states and ZM feature vectors forms a classification similar to vector quantization, which also benefits from temporal correlation due to the nature of the HMM structure. The following procedure is used to derive the class information for a narrowband speech signal whose bandwidth is expanded:. i) NB X features are extracted from a given narrowband speech. ii). Based on the obtained class information, an appropriate GMM-based ABWE transformer is used to estimate the HB functions.

ASR using Class-Specific Information based ABWE

Results and Discussion

In supervised classification, the true word level information is used to both learn the class-specific ABWE transform and to bandwidth expand the narrowband speech signal. But it gives us a useful benchmark for the possible performance improvement using class-specific ABWE transformations. The objective measures agree closely with the WER when class-specific information is used to model information between NB and HB cases.

Table 4.6: Performances of different ABWE systems developed for varying size of GMM in ABWE systems for children’s test set

Feature Domain MFCC based ABWE

Novel Feature domain ABWE modeling

These clustered energies are then converted into 15-dimensional MFCC features (C0-C14) for LB and 6-dimensional MFCC features (C0-C5) for HB by obtaining the discrete cosine transform (DCT). Finally, for representation purposes, the LB features are truncated to 10-dimensional MFCC features (C0–C9) while the HB features are kept as is. Given the LB MFCC features, the HB MFCC features are determined using the mean square error evaluation (MMSE) criterion as originally proposed in [12] and also detailed.

Efficient derivation of extended wideband MFCC

The detailed block of extracting LB and WB features from broadband development data for ABWE modeling is shown in Figure 4.2(a).

Figure 4.2: The detailed block diagrams of the proposed and the default (speech domain) MFCC based ABWE approaches

Delta Features and Age Information for MFCC based ABWE

Inclusion of Delta Features in ABWE

For this study, the memory involved is controlled by varying the half-window length Θ from 1 to 15 in steps of 2.

Age-specific conditioning in ABWE

ABWE models
ASR system

For training the different types of ABWE models, the children's data from TI-DIGITS is used which is mutually exclusive to the children's test set. The bandwidth enhancement performances of children's test set using ABWE-GT and ABWE-AG approaches when tested on WB adults' speech-trained ASR models are given in the bottom row of Table 4.10 along with those of original WB and NB children's speech cases. This result only provides an assessment of the sensitivity of ABWE modeling to the high variability in children's speech.

Table 4.10: Performances for using ABWE-GT and ABWE-AG on children’s test along with age-wise breakup.

Estimation of Age-specific information

Summary

Creation of dictionary for sparse representation

There are two types of dictionaries that are commonly used for the purpose of limited representation. In contrast, the learned dictionary is derived by processing the data to produce a sparse representation. In this work, we used the KSVD [132] algorithm to create a learned redundant dictionary for sparse representation purposes.

ABWE using sparse representation

Proposed SR-ABWE approach

For single dictionary case, the plot of the bandwidth-enhanced magnitude spectra for NBI frame obtained using the proposed ABWE approach for a voiced (/aa/) and an unvoiced (/s/) frames is shown in Figure 5.1(a) and Figure 5.1 ( c), respectively. The corresponding sparse codes for NBI/WB data are shown in Figure 5.1(b) and Figure 5.1(d). To confirm this fact, the enhanced spectra for NBI frame obtained by sparse coding with WB dictionary are also given in Figure 5.1(a) and Figure 5.1(c) for voiced and unvoiced cases, respectively.

Figure 5.1: Panels (a) and (c) show the reconstructed spectra using the proposed ABWE approach for a voiced (/aa/) and a unvoiced (/s/) frames of speech, respectively

Enhancements in proposed SR-ABWE approach

Linear transformation of NBI sparse coefficients

As a first tool, we explored a linear transformation to address significant sparse coding differences for the NBI and WB cases. Attempting to adapt sparse NBI codes to unreplaced frames is expected to produce more improved HB information than the default case. To evaluate its efficiency, the spectral profile of the atoms included in the sparse WB code, the sparse NBI code, and after the linear transformation of the sparse NBI code for a silent frame are shown in Figure 5.4(a), 5.4(b) and 5.4(c). ), respectively.

Lookup constrained linear transformation

Let WNBI and WWB denote the sparse code matrices for NBI and WB instances for the voiceless frames in the training data, then a least squares (LS)-based linear transformation TLS is estimated as . TLS=WWBWTNBI(WNBIWTNBI)−1 (5.4) For untunable cases, an adjusted sparse codes matrix, given the target NBI sparse code matrix, is derived as. We can note that after linear transformation of NBI sparse codes, the selected atoms are found to possess somewhat more HB information.

Figure 5.4: Plots showing the spectral profile of the atoms involved in the sparse representation of an unvoiced frame for (a) WB sparse code (b) NBI sparse code (c) top 40 of the linear transformed NBI sparse code and (d) lookup table based WB sparse code

Semi-Coupled Dictionary based ABWE

Semi-coupled dictionary algorithum

In this work, we discuss the creation of SC dictionaries in the context of ABWE. The goal is to minimize the energy function below to find the desired semi-connected dictionaries and the desired mapping function.

Training

Note that the mapping through W is assumed to be linear, and the bidirectional transformation learning strategy can be applied to simultaneously learn transformations from Λx to Λy and from Λy to Λx. With SCDL, it can learn the dictionary pair Dx and Dy on which the sparse coding coefficients of two spaces have stable bidirectional linear transformations. Initial dictionary pairDxandDy, and initial mappingWxand Wy. i) Fix other variables, updateΛxandΛy by sparse coding in Eq. ii) Correct other variables, updateDy andDy in Eq. iii) Fix other variables, updateWx andWy in Eq.

Figure 5.6: The complete block diagram of the proposed sparse representation based ABWE ap- ap-proach.

Synthesis

Clustering based SCDL ABWE

Experimental Setup and Performance Measures

Experimental Results and Discussion

Application of Sparse Representation based ABWE in Children’s Speech ASR . 135

System parameter tuning

For the proposed subclass dictionary learning approach, we experimentally tuned the number of atoms of the subclass dictionary to 20 and the number of clusters in each of the broad speech classes to be 64. The number of iterations for each of the dictionaries of subclass learning is kept the same as the previous single clustered approach. In this approach, for the purpose of learning subclass dictionaries and sparse coding techniques, all atoms of the subclass dictionary are considered.

Results and discussion

Summary

The work began with the goal of developing methods to improve children's speech recognition in mismatched conditions. Previous attempts to improve children's speech recognition under mismatch conditions have involved VTLN and reducing instances of pitch mismatch. This shows that the pitch and vocal tract length mismatch in children's speech is significantly high.

Major Contributions

Future Work

Kabal, “A memory-based approximation of a Gaussian mixture model framework for narrowband speech bandwidth expansion.” inProc. Vary, “An upper bound on the quality of artificial bandwidth extension of narrowband speech signals,” in Proc. Kabal, “An objective analysis of the effect of memory inclusion on narrowband speech bandwidth expansion,” in Proc.