Separability ε(x) for children's speech using global ABWE and age-specific transformation with ∆. Half window size, Θ = 13 is chosen to calculate. The effectiveness of the developed ABWE methods is demonstrated in the context of children's speech recognition under mismatched conditions.
Nature of children’s and adults’ speech signals
The narrowband spectrograms given in Figures 1.1 (e) and 1.1 (f) illustrate the high pitch values present in the case of children's speech. The spectral dynamics representing peak to valley in the case of children's speech is about twice that of the adult case.
Importance of bandwidth for speech recognition
The passband is determined by the lower (fl) and upper (fh) cutoff frequency of the bandpass filter. b) The syllable articulation of low-pass and high-pass filtered signals is shown as the function of the cutoff frequency. Both these studies conclude that for automatic speech recognition task also, especially children's speech recognition, greater bandwidth is needed.
Artificial bandwidth extension for speech recognition
The narrowband vocal tract information undergoes a process of approximate reconstruction of the missing vocal tract information in the range from 3400-8000 Hz. During the training process, the relationship between narrowband and wideband vocal tract information is learned in terms of a mapping function.
Organization of thesis
Differences in the linguistic correlates of children's and adults' speech are given in section 2.4. The specific requirements of ABWE suitable for recognizing children's speech under mismatch conditions are highlighted in section 2.9.
Children’s and adults’ speech production systems
Acoustic mismatch differences in children’s and adults’ speech
The bandwidth contribution due to glottal resistance (Bg) is directly proportional to the surface area of the glottis. The bandwidth contribution due to the glottal resistance effect is greater in children than in adults.
Linguistic correlates of children’s and adults’ speech
Dis-fluencies decline with age and children reach adult proficiency level at around 12-13 years of age (slightly earlier for boys than girls) [48]. Thus, the ability of the children to use language effectively to convey the message improves with their age.
Short-term features of children’s and adults’ speech
Thus, this study shows that the LP coefficients are also affected in the case of children's speech due to high pitch value. However, this study has only focused on minimizing the effect of high pitch values in the case of children's speech.
Approaches for improving ASR under mismatched condition
Vocal tract length normalization (VTLN) for improving ASR performance 25
VTLN is a speaker normalization method where the acoustic variability between speakers due to varying vocal tract lengths, i.e. the discrepancy due to differences in the formant frequencies among speakers is reduced by warping the frequency axis of each speaker's speech spectrum [70, 71]. The spacing and width of the filters in the Mel filter bank are changed while keeping the speech spectrum unchanged.
Maximum likelihood linear regression (MLLR)
- MLLR-MEAN
- MLLR-COV
- Constrained MLLR (CMLLR)
The fitting method, where the linear transformations are only applied to the variances of the models, is referred to as 'MLLR-COV'. Thus, the transformation matrix used to provide a new estimate of the fitted observation is given by .
Minimizing pitch mismatch for Improving ASR Performance
The proposed algorithm automatically selects the appropriate length of the MFCC basis functions for each test signal without prior knowledge of the speaker of the test statement. Significant improvements in child speech recognition have been achieved using the proposed algorithms on models trained for adult speech.
Need for ABWE for ASR under mismatched condition
Also, a method based on mel cepstral truncation is proposed for reducing the height discrepancy between training and test data.
Artificial bandwidth extension: A Review
- Motivation for ABWE
- Frequency bands
- Correlation between frequency bands of speech
- Speech bandwidth extension with side information
- Different techniques for artificial bandwidth extension
- General signal processing methods
- Source-filter model based ABWE
- Feature extraction
Consequently, the features of the speech production mechanism and speech signals can be exploited in the bandwidth expansion task. It is also shown that the inclusion of memory in the estimation technique improves the certainty of the higher band estimate [90]. For example, the higher band excitation is constructed in the frequency domain by repeatedly copying a subband of the telephone band spectrum to the higher band [ 104 , 105 ].
ABWE for children’s speech recognition
The previous attempts to improve children's speech recognition under mismatched conditions focused on minimizing the variabilities due to pitch [1,53–55] . Here, regardless of whether the given speech is either narrowband or wideband, the focus is mainly on minimizing the fluctuations due to pitch. In this, the approaches based on pitch modification focus on minimizing the effects of pitch.
Organization of the Present Work
- TIDIGITS corpus
- Design of ASR studies
- Feature extraction
- Digits models
- Performance Evaluation
A standard VTLN approach is implemented and both adult and child speech are subjected to the VTLN process. The first step is to understand the importance of ABWE for speech recognition, especially children's speech recognition under the incongruent condition. The very poor WER infers the significant acoustic mismatch between adult and child speech.
Proposed Spectral Loss Compensation for ASR using ABWE
- Selective linear prediction
- Gaussian mixture model
- Spectral envelope reconstruction
- ASR Study using ABWE
It consists of the logarithm of the ratio of narrowband (nb) to highband (hb) energy, and the remaining ten coefficients are selective linear predictive cepstral coefficients (SLPCC) calculated from the spectral information of the highband (3.4–8.0 kHz) signal, such as described in [100]. In the diploma thesis, we take advantage of the possibility of deriving cepstral coefficients from LP coefficients. Let X∈Rk be the characteristic vector of narrowband speech and Y ∈Rl the characteristic vector of highband speech.
ASR using VTLN
In the case of VTLN normalized narrowband children's speech tested against narrowband adults' models gives a WER of 1.64%. Next in the case of VTLN normalized wideband children's speech tested against wideband adults' models gives a WER of 0.77% versus 3.21% for no normalized case. The performance of ABWE and VTLN normalized narrowband child speech tested against broadband adult models gives a WER of 1.17%.
ASR using Truncation of MFCC Features
Combining ABWE, VTLN and Cepstral Truncation Approaches
ABWE and VTLN
The performance of VTLN seems to be even better compared to the standalone performance of ABWE. In the case of VTLN, normalization is only performed for the same band, while, as in ABWE, missing information in the high frequency is reconstructed.
ABWE and Cepstral Truncation
ABWE, VTLN and Cepstral Truncation
These highlighted values are taken and combined to give an average value for the adult case of 0.50%, a marginal improvement on the previous results of 0.61% and 0.58%. The highlighted values in each column represent the best number for a given strain factor. These highlighted values are taken and combined to give an average value for child cases of 0.97%, a significant improvement over 1.06%.
Summary
- Mutual Information (I)
- Differential Entropy (H)
- Ratio Measure (R IH )
- Separability (ε)
Thus, coefficient trimming is effective in the case of recognizing a child's speech in mismatched conditions. Consequently, benefit can be achieved by exploiting age-specific information in the case of children's speech. Resolution of ε(x) for child speech using global ABWE and age-specific transformation with ∆. Half the window size, Θ = 13, is chosen for the calculation.
ABWE using Class-Specific Information
Derivation of Class Information for ABWE Transformation
- Unsupervised Classification
- Supervised Classification
SMZ} associated with a sequence of features Z. The correspondence between SMZ states and ZM feature vectors forms a classification similar to vector quantization, which also benefits from temporal correlation due to the nature of the HMM structure. The following procedure is used to derive the class information for a narrowband speech signal whose bandwidth is expanded:. i) NB X features are extracted from a given narrowband speech. ii). Based on the obtained class information, an appropriate GMM-based ABWE transformer is used to estimate the HB functions.
ASR using Class-Specific Information based ABWE
- Results and Discussion
In supervised classification, the true word level information is used to both learn the class-specific ABWE transform and to bandwidth expand the narrowband speech signal. But it gives us a useful benchmark for the possible performance improvement using class-specific ABWE transformations. The objective measures agree closely with the WER when class-specific information is used to model information between NB and HB cases.
Feature Domain MFCC based ABWE
Novel Feature domain ABWE modeling
These clustered energies are then converted into 15-dimensional MFCC features (C0-C14) for LB and 6-dimensional MFCC features (C0-C5) for HB by obtaining the discrete cosine transform (DCT). Finally, for representation purposes, the LB features are truncated to 10-dimensional MFCC features (C0–C9) while the HB features are kept as is. Given the LB MFCC features, the HB MFCC features are determined using the mean square error evaluation (MMSE) criterion as originally proposed in [12] and also detailed.
Efficient derivation of extended wideband MFCC
The detailed block of extracting LB and WB features from broadband development data for ABWE modeling is shown in Figure 4.2(a).
Delta Features and Age Information for MFCC based ABWE
Inclusion of Delta Features in ABWE
For this study, the memory involved is controlled by varying the half-window length Θ from 1 to 15 in steps of 2.
Age-specific conditioning in ABWE
- ABWE models
- ASR system
For training the different types of ABWE models, the children's data from TI-DIGITS is used which is mutually exclusive to the children's test set. The bandwidth enhancement performances of children's test set using ABWE-GT and ABWE-AG approaches when tested on WB adults' speech-trained ASR models are given in the bottom row of Table 4.10 along with those of original WB and NB children's speech cases. This result only provides an assessment of the sensitivity of ABWE modeling to the high variability in children's speech.
Estimation of Age-specific information
Summary
Creation of dictionary for sparse representation
There are two types of dictionaries that are commonly used for the purpose of limited representation. In contrast, the learned dictionary is derived by processing the data to produce a sparse representation. In this work, we used the KSVD [132] algorithm to create a learned redundant dictionary for sparse representation purposes.
ABWE using sparse representation
Proposed SR-ABWE approach
For single dictionary case, the plot of the bandwidth-enhanced magnitude spectra for NBI frame obtained using the proposed ABWE approach for a voiced (/aa/) and an unvoiced (/s/) frames is shown in Figure 5.1(a) and Figure 5.1 ( c), respectively. The corresponding sparse codes for NBI/WB data are shown in Figure 5.1(b) and Figure 5.1(d). To confirm this fact, the enhanced spectra for NBI frame obtained by sparse coding with WB dictionary are also given in Figure 5.1(a) and Figure 5.1(c) for voiced and unvoiced cases, respectively.
Enhancements in proposed SR-ABWE approach
Linear transformation of NBI sparse coefficients
As a first tool, we explored a linear transformation to address significant sparse coding differences for the NBI and WB cases. Attempting to adapt sparse NBI codes to unreplaced frames is expected to produce more improved HB information than the default case. To evaluate its efficiency, the spectral profile of the atoms included in the sparse WB code, the sparse NBI code, and after the linear transformation of the sparse NBI code for a silent frame are shown in Figure 5.4(a), 5.4(b) and 5.4(c). ), respectively.
Lookup constrained linear transformation
Let WNBI and WWB denote the sparse code matrices for NBI and WB instances for the voiceless frames in the training data, then a least squares (LS)-based linear transformation TLS is estimated as . TLS=WWBWTNBI(WNBIWTNBI)−1 (5.4) For untunable cases, an adjusted sparse codes matrix, given the target NBI sparse code matrix, is derived as. We can note that after linear transformation of NBI sparse codes, the selected atoms are found to possess somewhat more HB information.
Semi-Coupled Dictionary based ABWE
Semi-coupled dictionary algorithum
In this work, we discuss the creation of SC dictionaries in the context of ABWE. The goal is to minimize the energy function below to find the desired semi-connected dictionaries and the desired mapping function.
Training
Note that the mapping through W is assumed to be linear, and the bidirectional transformation learning strategy can be applied to simultaneously learn transformations from Λx to Λy and from Λy to Λx. With SCDL, it can learn the dictionary pair Dx and Dy on which the sparse coding coefficients of two spaces have stable bidirectional linear transformations. Initial dictionary pairDxandDy, and initial mappingWxand Wy. i) Fix other variables, updateΛxandΛy by sparse coding in Eq. ii) Correct other variables, updateDy andDy in Eq. iii) Fix other variables, updateWx andWy in Eq.
Synthesis
Clustering based SCDL ABWE
Experimental Setup and Performance Measures
Experimental Results and Discussion
Application of Sparse Representation based ABWE in Children’s Speech ASR . 135
System parameter tuning
For the proposed subclass dictionary learning approach, we experimentally tuned the number of atoms of the subclass dictionary to 20 and the number of clusters in each of the broad speech classes to be 64. The number of iterations for each of the dictionaries of subclass learning is kept the same as the previous single clustered approach. In this approach, for the purpose of learning subclass dictionaries and sparse coding techniques, all atoms of the subclass dictionary are considered.
Results and discussion
Summary
The work began with the goal of developing methods to improve children's speech recognition in mismatched conditions. Previous attempts to improve children's speech recognition under mismatch conditions have involved VTLN and reducing instances of pitch mismatch. This shows that the pitch and vocal tract length mismatch in children's speech is significantly high.
Major Contributions
Future Work
Kabal, “A memory-based approximation of a Gaussian mixture model framework for narrowband speech bandwidth expansion.” inProc. Vary, “An upper bound on the quality of artificial bandwidth extension of narrowband speech signals,” in Proc. Kabal, “An objective analysis of the effect of memory inclusion on narrowband speech bandwidth expansion,” in Proc.