Chapter 1 INTRODUCTION - IDR

(1)

Copyright

IIT Kharagpur

INTRODUCTION

The rapid growth of mobile users is creating great deal of interest in the develop- ment of robust speech systems in mobile environment. Some of the new and exciting services enabled by speech systems in mobile environment are: speech interface to the mobile devices, information retrieval through mobile devices, voice based person authentication, and forensic investigation. Issues involved in adapting the present speech processing technology to mobile systems are: eﬀect of varying background noise, degradations introduced by the speech coders and errors introduced due to transmission impairments [1, 2, 3, 4]. In this work, the major focus is on improving the recognition performance of speech systems in presence of speech coding and background noise by using vowel onset points (VOPs).

Speech is produced as a sequence of changes and those are known as events. For example, in speech there are phonetic events and acoustic events. Any change which can be attributed to the activity of the speech organs is a phonetic event. For example, voicing and closure are phonetic events [5, 6,7]. Any feature which is present in the acoustic signal is an acoustic event. For example, burst, friction and Voice Onset Time (VOT) are acoustic events. From the perception point of view, events and regions around them are known to contain important information [8]. Conventional

(2)

Copyright

IIT Kharagpur

block processing approach uses fixed frame size (20–30 ms) to extract information, and it does not uses knowledge of events. Vowel Onset Point (VOP) is one of the important event in speech production. The VOP is defined as the instant at which the onset of vowel takes place in the speech signal. The significance of VOP can be observed in speech applications like (i) recognition of Consonant-Vowel (CV) units, (ii) spotting CV segments in continuous speech, (iii) speaker recognition, (iv) speech rate manipulation and (v) enhancement of speech [8,9,10]. Accuracy in the detection of VOP is vital for these applications. Therefore in this work, we propose accurate VOP detection methods under clean and degraded environments.

1.1 Objective of the thesis

At the signal level, robust information is present in speech around glottal closure and VOP events [8]. The objective of this work is to illustrate the signiﬁcance of accurate VOP detection for speech processing in mobile environment. Existing VOP detection methods are suﬀering with poor detection accuracy. Speech signal during glottal closure regions exhibits high Signal-to-Noise Ratio (SNR) characteristics. Hence, processing glottal closure regions may be useful for accurate detection of VOPs under clean and degraded conditions. The knowledge of VOP events is used in the following studies:

• Recognition of CV units in presence of speech coding and background noise.

• Spotting and recognition of CV units from continuous speech.

• Speaker identiﬁcation (SI) in presence of coding.

• Non-uniform Time Scale Modiﬁcation (TSM)

(3)

Copyright

IIT Kharagpur

1.2 Organization of the thesis

The evaluation of ideas presented in this thesis are listed in the Table 1.1. The rest of the thesis is organized as follows:

Chapter 2: Background and literature survey- discusses the state-of-the- art methods for VOP detection and speech systems in mobile environment. Existing approaches for CV recognition in Indian languages and time scale modiﬁcation are also discussed in this chapter.

Chapter 3: Vowel onset point detection from coded and noisy speech- presents the proposed VOP detection methods for coded and noisy speech. Perfor- mance of the proposed VOP detection methods is compared with existing method which uses the combination of evidences from excitation source, spectral peaks and modulation spectrum.

Chapter 4: Consonant-vowel recognition in presence of coding and background noise- presents the recognition performance of the CV units in presence of coding and background noise by using proposed two-stage hybrid approach.

Proposed CV recognition approach uses the combination of complimentary evidences from Support Vector Machine (SVM) and Hidden Markov Model (HMM) to improve recognition performance. Impact of accuracy in the proposed VOP detection method is studied on recognition performance of CV units by using proposed CV recognition approach in presence of coding. Further, combined Temporal Spectral Processing (TSP) based preprocessing methods are used to improve the recognition performance of CV units under background noise.

Chapter 5: Spotting and recognition of consonant-vowel units from continuous speech- discusses about the need for accuracy in detection of VOP for CV recognition in continuous speech. Proposed two-stage VOP detection method to reduce the spurious VOPs and improve the accuracy of genuine VOPs is presented in

(4)

Copyright

IIT Kharagpur

this chapter.

Chapter 6: Speaker identification and time scale modification using VOPs- focuses on the application of proposed VOP detection methods for improving the SI performance in presence of coding, and non-uniform time scale modification (TSM). Proposed speaker identification system is developed using features extracted from steady vowel region. Steady vowel regions are determined by using vowel onset points and epochs. Further, proposed non-uniform TSM method is presented for slow down and speed up the speech.

Chapter 7: Conclusions- summarizes the contributions of the thesis, and dis- cuss the scope for future investigation.

(5)

Copyright

IIT Kharagpur

Table 1.1: Evolution of ideas presented in the thesis

• Speech systems in mobile environment has become popular in recent years.

• The major issues in mobile environment are background noise, speech coding and channel errors.

• Information present around VOP can be useful for speech processing in mobile environment.

• Existing VOP detection methods are suﬀering with poor accuracy. Hence, there is a need to develop accurate VOP detection methods for both clean and degraded conditions..

• Glottal closure regions are known to be high SNR regions in speech. Therefore, spectral energy of the speech signal present in the glottal closure region can be explored for the detection of VOPs in the presence of coding and background noise.

• Crucial information of CV unit is present around VOP, and hence VOP can be used as an anchor point for deriving the relevant information for CV recognition.

• As the number of CV classes are more, multi-stage acoustic models may perform better compared to single stage acoustic models. Hence, we explored a two-stage hybrid approach for improving the recognition performance of CV units.

• HMMs and SVMs are trained using diﬀerent modalities, hence they can provide complemen- tary evidence. Therefore, at each stage of the proposed CV recognition method, complemen- tary evidences from SVM and HMM are combined for enhancing the CV units recognition performance.

• Combined TSP methods are known to be useful for improving the performance in speech enhancement and speaker recognition tasks. In this work, we explored its usefulness in speech recognition task.

• VOPs can be used for spotting CV units from continuous speech. Hence, two-stage accurate VOP detection method is proposed for spotting and recognition of CV units from continuous speech.

• Since speaker-speciﬁc characteristics are preserved in steady vowel segments of speech even after coding, the features extracted from these steady vowel regions can be used to improve the SI performance in presence of coding. Hence, a method is proposed to determine the steady vowel region from the speech signal by using VOPs and epochs.

• Due to unique articulatory and production constraints associated to each type of vowel during slow and fast speech, vowel segments are expanded and compressed non-uniformly based on the type of vowel. Therefore, a non-uniform TSM method is proposed by using VOPs and epochs.