Works in first decade of 21st century - Literature survey on language recognition

1.2 Literature survey on language recognition

1.2.3 Works in first decade of 21st century

Wonget al.[46] investigated GMM-universal background model (GMM-UBM) for language modeling as a speed enhanced alternative to standard GMM system. The MFCC along with delta coefficients have been used for speech parametrization. On comparing with standard GMM based LR approach, a slight degradation was noted. In [47], the authors have presented the phonotactic-acoustic features based LR system that combines extended phonotactic and acoustic features using neural network classifier as well as phonotactic background normalization method for rejecting the unknown language.

Singer [48] has described three LR systems based on Lincoln laboratory implementation of the phone recognition, GMM, and support vector machine (SVM). The LR performances have been evaluated on the 1996 and 2003 NIST test sets. In [49, 50], SVM classifier using generalized linear discriminant sequence (GLDS) kernel has been used on SDC features. For contrast purpose, the author has also developed the GMM based LR system, and found a slight degradation in the performance of the proposed system. Nagarajanet al.[51,52] has proposed syllable-like unit instead of the phoneme as a basic

1.2 Literature survey on language recognition

sound unit in a framework similar to PPR, but without using annotated corpora. In this approach, similar syllable segments are clustered and syllable models are trained incrementally. The determi- nation of the language identity from the unknown test utterances is done with the help of language dependent syllable models. In [53], the acoustic and phonetic speech information has been utilized for automatic spoken LR system. The GMM-UBM based LR system employing various acoustic features has been developed. In this work, the vocal tract length normalization (VTLN) [54, 55] which is used to reduce the speaker variation is also studied for LR task. The different aspects of phonetic speech information were also explored for the PPRLM based LR system. In [56], rhythmic units related to syllables are segmented using vowel detection algorithm. These rhythmic units are used to extract the various parameters and are modeled using GMM. In [57], a set of Legendre polynomials are used to approximate a segment of pitch contour. The coefficients of the polynomials form a feature vector.

The extracted feature vectors are modeled using GMM. The LR work discussed in [58] uses SVM classifier in which Louradour sequence kernel with the background GMM was used to map the variable length sequence of acoustic feature vectors consisting of MFCC along with their with delta coefficients to the fixed dimensional space. On comparing with GLDS kernel, the Louradour sequence kernel resulted in improved LR performance. Campbell et al. [59] proposed GMM supervectors with SVM classifier for speaker verification task. This approach has been further extended to LR domain [60]

along with adaptive relevance factor to take care of the duration mismatch. A GMM supervector is a high dimensional representation of a speech utterance which is derived by concatenating the mixture means of the language adapted GMM-UBM. In that approach, 56-dimensional SDC feature vectors (formed by concatenating 7 static MFCC plus 49 dynamic features using 7-1-3-7 delta-shift pattern) were used after voice activity detection (VAD). In [61], the vector space modeling (VSM) was proposed for automatic spoken LR task. This approach leads to a discriminative classifier backend, which was demonstrated to give superior performance over likelihood-based n-gram language modeling backend for long utterances. Ma et al. [62] adopted distributed output coding strategy in ensemble classifier design for vector space spoken LR. In this approach, multi-class LR problem is decomposed into many binary classification tasks, and the results of the classifiers are combined to form an output code as a hypothesized solution to the overall LR problem. Matejka et al. [63] describes Brno University of Technology (BUT) LR system for 2007 NIST Language recognition (LRE) evaluation. A total of 13 LR systems were fused (4 acoustic and 9 phonotactic). The 4 acoustic LR systems used were (i) GMM

system with 2048 Gaussians per language and eigen-channel adaptation, (ii) discriminatively trained GMM using maximum mutual information (GMM-MMI), (iii) GMM-MMI with channel compensated features, and (iv) SVM classifier on GMM super-vectors. The phonotactic systems were based on three phone recognizers: two ANN/HMM hybrids and one based on GMM/HMM context-dependent models. In [64], an alternative approach to PPRLM system, which aims at improving the acoustic di- versification among its parallel subsystems by using multiple acoustic models has been proposed. Mary and Yegnanarayana [65] explored the vowel onset point (VOP) based prosodic features for LR task.

The location of VOPs is used to obtain the syllable-like regions in continuous speech. The prosodic cues such as stress, rhythm, and intonation are characterized by energy contour, duration, and pitch contour, respectively. The quantitative parameters of energy contour, pitch contour and duration are computed from the syllable like regions and forms the prosodic features. For classification, a multilayer feedforward neural network is trained using these features. In [66], an Eigen-language space in analogy to the eigenvoice modeling approach has been proposed. The estimated language factors are applied as input features to SVM classifiers. In [67], PCA-based feature extraction for phonotactic LR has been proposed. Using PCA, the dimensionality of fixed length feature vectors from lattices obtained using phone recognizer is reduced, and are used with the SVM-based system. Stolcke et al.[68] has demonstrated the superior performance of language-independent phone recognizers in both PRLM and PPRLM based LR systems. The authors have also investigated the ways to use speaker adaptation transforms estimated by maximum likelihood linear regression (MLLR) as a complementary feature for language characterization. Torres-Carrasquillo et al. [69] have presented the description of MIT Lincoln laboratory LR system submitted to the NIST 2007 LRE. In this, a fusion of four LR systems namely a GMM-MMI spectral system, an SVM classifier on GMM supervector, an SVM language classifier using the lattice output of an English tokenizer, and a language model classifier using the lattice output of a Hungarian tokenizer. The similar approaches were used on the NIST 2009 LRE database [70]. The per system score calibration has been done using single discriminatively trained Gaussian, followed by score fusion with logistic regression. In [71], the dimensionality reduction techniques like PCA or random projection (RP) for using higher order n-grams in SVM-based phonotactic LR have been investigated. Richardson and Campbell [72] have applied a nuisance attribute projection (NAP) to token based LR system. The NAP formulation was done in such a way that it scales well to high dimensional sparse feature vectors. In [73], an attempt has been made to develop a two-level LR

1.2 Literature survey on language recognition

system for Indian languages using acoustic features. In the first level, the system identifies the family of the spoken language, and then it is fed to the second level which aims at identifying the particular language in the corresponding family. In [74], a wide range of prosodic attributes has been considered for improving the LR performance. These attributes are modeled by the bag of n-grams approach with SVM.

In summary, the major contributions in the first decade of the 21st century in terms of features are the SDC features, VOP-based prosodic features, GMM-supervector as the utterance representation.

In terms of statistical modeling techniques, GMM-UBM, and SVM played a major role in achieving the improved performances. The VTLN and NAP were also found helpful in improving the LR performance. A good introductory tutorial on the fundamentals of the theory and the state-of-the-art solutions can be found in [1].

Dalam dokumen EXPLORATION OF SPARSE REPRESENTATION TECHNIQUES IN LANGUAGE RECOGNITION (Halaman 38-41)