Organization of the Thesis - PDF gyan.iitg.ernet.in

The organization of the rest of this thesis is as follows:

Chapter 2 describes in detail all the methods already reported in literature which have been used in this thesis for various experiments and analysis. The techniques as used in this thesis for modifying various acoustic correlates of speech in a signal are discussed. The standard speaker normalization and the models adaptation techniques used in speech technology are reviewed. Also, the details about

the speech corpora and the experimental setup used in this thesis for training and testing the speech recognition systems are given.

Chapter 3 explores various acoustic sources of mismatch between the adults’ and the children’s speech for children’s speech recognition on adults’ speech trained models. The impacts of variations in the acoustic sources of mismatch are studied on the most commonly used MFCC features and on the automatic speech recognition models. Also, the relative significance of those acoustic sources of mismatch is then explored for children’s speech recognition on adults’ speech trained models in a consistent setup. The acoustic sources of mismatch that are addressed in this study are the formant frequencies, the pitch, the speaking rate and the glottal flow parameters (open quotient, return quotient and speed quotient).

In Chapter 4, a study is done to understand the roles of filterbank and cepstral truncation in removing the pitch-related information in the speech spectrum from the uniform filterbank based and the non-uniform Mel filterbank based spectra and their corresponding cepstra. The Mel cepstra for different pitch signals are then explored to study the cause and the nature of the observed effect of pitch on MFCC features.

In Chapter 5, the pitch-robustness of the salient features that have been reported in literature to perform comparable or better than MFCC for adults’ speech recognition are explored for children’s speech recognition. The features studied in this work are the PLPCC and the PMVDR features. The effect of pitch variations across speech signals is studied on these features in comparison to that on MFCC features.

An algorithm for normalizing the pitch differences across speech signals during MFCC feature extraction is proposed for children’s ASR in Chapter 6. The algorithm modifies the Mel filterbank structure during MFCC feature extraction for each test speech signal based on the average pitch of the test signal. The efficacy of the proposed pitch normalization algorithm is also studied in combination with the existing speaker normalization and model adaptation techniques for children’s ASR on adults’

speech trained models.

In Chapter 7, MFCC feature truncation is explored for pitch mismatch reduction for children’s speech recognition on adults’ speech trained models. Based on the observation, an automatic algorithm is proposed for pitch mismatch reduction. The algorithm selects the length of the base MFCC features for recognition of each test speech signal to address its pitch mismatch with respect to the speech

recognition models without any prior knowledge about the speaker of the test utterance. The efficacy of the proposed algorithm is also explored in combination with the existing speaker normalization and model adaptation techniques for children’s ASR on adults’ speech trained models.

Finally, Chapter 8 summarizes the work presented in this thesis, highlights the main contributions of the work and gives some directions for future research. Note that all speech recognition evaluations in this thesis are done on both connected digit recognition and continuous speech recognition tasks.

2

Speech Corpora and Experimental Setups

Contents

2.1 Introduction . . . . 18 2.2 Speech Corpora . . . . 18 2.3 Speech Recognition Systems . . . . 19 2.4 Methods for Transformation of Acoustic Correlates of Speech . . . . 22 2.5 Model Adaptation Techniques . . . . 28 2.6 Summary . . . . 32

2.1 Introduction

In this chapter, the speech corpora used in this thesis for conducting various ASR experiments on both the connected digit recognition task and the continuous speech recognition task are described.

The details of the training and testing of the speech recognition systems used for the connected digit recognition task and the continuous speech recognition task are also given.

Various techniques have been proposed in literature for addressing the acoustic mismatch between the adults’ and the children’s speech for improving the children’s ASR performance on the adults’

speech trained models. In this thesis, to determine the relative significance of each of the different acoustic sources of mismatch for children’s ASR on adults’ speech trained models, those acoustic parameters are transformed using the methods reported in literature. Also, the consistency and efficacy of the pitch normalization approaches proposed in this thesis is explored in conjunction with the various existing speaker normalization and model adaptation techniques. In this chapter, thus, all those methods from literature that have been used in this thesis are also briefly reviewed along with their respective parameter settings kept for the experimental work.

The chapter is organized as follows: Section 2.2 presents the details about the speech corpora followed by the details regarding the speech recognition systems used in this thesis in Section 2.3.

A brief review of the methods used for transforming various acoustic correlates of speech in signal domain and in feature domain are explained in Section 2.4. Section 2.5 explains the various model adaptation techniques used in this work for validating the efficacy of our proposed techniques. Finally, the chapter is summarized in Section 2.6.

Dalam dokumen PDF gyan.iitg.ernet.in (Halaman 42-46)