Explicit acoustic-phonetic knowledge

cover all possible contextual variations of sound units. In mismatch training conditions such as dis- similar microphone, different background noise, mismatched channel etc., the recognition performance degrades. Adaptation or retraining is required for new operating environment.

[115] used an auditory-based front end for segmentation of speech into broad classes. For detection, statistically determined thresholds were used to take rule based decisions and a detection accuracy of 85% was reported.

Segmentation based methods mostly fail because of the inaccurate segmentation of the boundaries.

It is easy to detect the boundaries where an abrupt energy change takes place. The boundary between stop consonant and vowel unit is one example of such sharp boundaries. The boundary between semivowel and vowel is not sharp and energy change takes place over a transition region. The formant transitions are main parameters to detect such boundaries and segmentation becomes difficult. If the algorithm is designed to detect small changes, the process ends up with over-segmentation [116].

In [117], segmentation was carried out in many levels and represented in an unified framework.

In landmark based systems, the features are extracted from the regions around certain landmarks rather than in between two boundaries or landmarks. Landmarks are the instants of abrupt articulatory changes in the vocal tract where acoustic features are most pertinent. First step in such systems is to analyze the speech signal for detecting the acoustic events or landmarks. The second step is to extract relevant acoustic-phonetic information regarding manner and place of articulation that help the classification of sound unit. The advantage of using such system is that relevant information from appropriate regions are extracted eliminating other redundant information. Analysis of different landmarks can be done differently. For example, analysis can be done with different time resolution.

Appropriate acoustic parameters can be studied depending upon the landmark. For burst landmark, VOT is important for determining place of articulation of the consonant. On the other hand, formants position are more important for vowel recognition. Another advantage is that the problem of separat- ing semivowel-vowel pair and diphthongs (done in case of segmentation based approach) are avoided.

Finally, the distinctive features are associated to some acoustic-phonetic segment and converted to word using lexical knowledge or language model.

In the literature, attempts have been made to extract acoustic correlates of the phonetic features [118], [55] [115], [73]. Efforts have also been made to detect the landmarks associated with these phonetic features [55], [115], [4], [80]. Liu addressed four groups of landmarks in [4], which were already introduced in the first chapter. A method was proposed to detect these landmarks by processing energy of the signal in six frequency bands. The glottis, sonorant and burst landmarks were recognized with error rates of 5%, 14% and 57%, respectively, when evaluated on a subset of the TIMIT database [4].

The detected landmarks were used for estimating broad phonetic class of the hidden segments. In [80], temporal measurements were used to derive measures of periodicity, aperiodicity, energy onset and offset. An overall landmark detection rate of 70.18% was obtained using the temporal measurements based method. In another study, Park proposed a probabilistic knowledge based algorithm to detect the consonant landmarks such as, glottis, sonorant and burst [119]. A deletion and substitution error of 12 % and insertion error of 15 % was reported on TIMIT test set.

A model was proposed by Stevens [11], for lexical access based on acoustic landmarks and phonetic features. Most of the landmark systems failed due to lack of probabilistic framework for handling variability in pronunciation. In [120], a probabilistic framework for landmark based speech recognition system was demonstrated. Speech signal was represented by a set of binary valued articulatory phonetic features. The probabilistic framework used SVM as binary classifier of manner phonetic features. Landmarks in the segmented regions were used for source and place phonetic features. Finally finite state automata was used to constrain the probabilistic segmentation paths for connected word recognition. In [5], apart from using the acoustic-phonetic knowledge, the framework was constrained by higher level language information such as pronunciation model of words, durations of phonetic units etc. The probabilistic framework was used to recognize broad phonetic classes of TIMIT database.

Although, the system was evaluated for TIDIGIT database containing very limited vocabulary, it is still a big challenge to make continuous speech recognition system using landmark based approach.

2.4.2 Event based approach for syllable recognition

Another type of explicit acoustic-phonetic knowledge based approach uses event or landmark detection at the front end. However, conventional features (MFCCs) extracted from the region around the landmarks are used in a statistical speech recognition system. CV unit recognition systems for Indian languages are based on such event based approach [53] [12]. CV unit recognition is performed by anchoring the VOP. Therefore, first step in such system is to detect the VOP. Since VOP is also a point of abrupt change which takes place at the consonant- vowel transition, it can be considered as an event or landmark. However, subsequent stages are mostly by using statistical approaches. Consonant in the CV unit is recognized by considering speech segment present on either side of the VOP and vowel is recognized by extracting features from right side of VOP. In recent methods, recognition of consonants and vowels are separately carried out by using SVM and HMM, respectively [121]. A brief

review of literature related to recognition of CV unit in Indian languages is presented below.

In literature many symbols have been used as subword unit such as, phonemes [22], characters [122]

and syllable units (CⁿV Cⁿ where C is consonant, V is vowel and n = 0,1,2,3) [123] etc. Basic units of writing system in Indian languages are syllabic in nature and they are orthographic representation of the sounds [124]. Therefore, syllables are suitable to use as subword unit for Indian languages [53].

Moreover, description of syllables capture the necessary coarticulation information relevant for its recognition [12]. Main issue in recognizing the syllable units is the similar nature of the units. Another issue is the large number of syllable like units available. There are 33 consonants, 356 CC cluster, 77 CCC cluster, 1 CCCC cluster and 10 vowels [53]. These makes around 5000 syllable units with different possible combinations. However, 10 vowels and 330 CV units constitute the 90% of the occurrences in a text. Therefore, most of the studies were limited to CV units [122], [12], [53] and also it was shown that the information for CV unit recognition is present in the region around the VOP.

Chadrasekhar explored machine learning approaches for spotting CV units [12]. For spotting isolated utterances of CV units, multilayer neural network models and time delay neural network were used [125]. Performance of neural network based system decreases for large number of classes, so modular neural network and constraint satisfaction models were used [12].

Suryakanth proposed new VOP detection techniques and explored non-linear compression methods for reducing the dimension of CV segmental patterns using auto-associative neural network models [53].

Dimensionality reduction of features was carried out because large dimension segmental patterns need large number of training examples for multilayer neural network. It was found that non-linear compression performs better than principal component analysis. SVM based system using the one-against the rest approach was found to perform better compared to neural network models. A modification of this system was proposed by Vuppala et. al. [121], where two stage CV unit recognition system was developed. The system consists of HMMs at the first stage for recognizing the vowel category of a CV unit and SVM for recognizing the consonant category of a CV unit at the second stage [121], [16].

Using the two stage system, a CV unit recognition rate of 66.14 % was reported in Telugu broadcast news database [16].

Drawback of CV unit recognition system is that VOP must be spotted correctly, otherwise, all CV units will not be recognized. Another drawback is that the syllables with a coda or consonant clusters are not recognized. Therefore, a complete phone recognition is not possible with CV unit recognition.

2.4.3 Explicit acoustic-phonetic knowledge in statistical systems

In another approach, explicit acoustic-phonetic features are used in a statistical framework. Such approach uses acoustic-phonetic knowledge at the front end of a statistical speech recognition system.

Speech is processed frame by frame instead of processing around specific landmarks. HMM based speech recognition system is the most popular speech recognition system. In literature, acoustic- phonetic knowledge was inserted into the system in three different levels namely, feature level [126], [127], [76], model level [128] and score level [129]. In feature level insertion, the acoustic- phonetic knowledge is used as features which are appended to the standard MFCC features. In [126], phonetic features representing the manner features: sonorant, syllabic, nonsyllabic, noncontinuant and fricated were extracted and used in a HMM framework. The phonetic features were able to reduce the inter- speaker variability compared to the cepstral features.

In [127], acoustic features were mapped into a set of distinctive features using a set of classifiers and the output of classifiers were added to the standard cepstral features at the feature level. The resulting phone recognition system showed improved performance. In some studies, acoustic-phonetic features were used in hybrid ANN-HMM framework. Instead of computing the acoustic correlates of the distinctive features, neural networks were trained to map short-term spectral features to the posterior probabilities of the distinctive features [130]. These probabilities were then used as feature in HMM based system. The error pattern shown by such systems was found to be different from that of the conventional MFCC based systems.

In [128], articulatory-motivated distinctive features which contains manner and place of articulation information were extracted and added to the HMM framework at the state level. In [129], phone level posterior probabilities were derived using ANN and the probabilities were used to rescore the phone lattice generated by a HMM based phone recognizer. In [76], acoustic-phonetic information was added in all three levels and improvement was achieved in phoneme recognition performance. In [59], burst onsets were detected using random forest detectors. Intermediate posterior probabilities of the detectors were used as additional features to the MFCCs which gave improved recognition performance for the sounds containing burst region. Binary features denoting whether a particular phonetic event is present or not, was appended to MFCCs and performance of the system was found to improve [76]

significantly over the MFCCs alone.

Table 2.4: Summary of implicit and explicit acoustic-phonetic knowledge used for phone recognition

Application Feature Framework Performance Database accuracy (% Acc)

HMM-GMM [131] 67.60

CD HMM-GMM [132] 72.70

Implicit HMM-ANN [96] 73.42

acoustic-phonetic Phone recognition MFCCs HMM-SGMM [131] 80.50 TIMIT

knowledge HMM-DNN [132] 77.00

Mel Filter deep LSTM RNN [111] 82.30 bank coefficients

Raw speech CNN-CRF [106] 69.47

11 Acoustic Landmark based 79.50

parameters probabilistic

Explicit Broad phonetic HMM-GMM 73.70 Subset of

acoustic-phonetic class recognition [5] MFCCs Landmark based 78.20 TIMIT

knowledge probabilistic

HMM-GMM 80.00 (baseline)

CV unit MFCCs VOP based

recognition 2 stage system [16] 66.14 Telugu

using HMM and SVM

Dalam dokumen Biswajit Dev Sarma (Halaman 65-70)