Taking advantage of this fact, the combination of the proposed ensemble approach with JFA is also investigated. The language-specific coefficients in each of the s-vectors are averaged to have multiple score vectors.
Perceptual cues
However, in the open-ended task, the non-target languages are also included in the test. To address this challenge, a number of automated multilingual services have been developed in the past.
Literature survey on language recognition
- Very early works
- Works in last decade of 20th century
- Works in first decade of 21st century
- Works in present decade
- Low-rank modeling
- Sparse domain modeling
- Deep neural network modeling
In [53], the acoustic and phonetic speech information was used for automatic spoken LR system. In [113], the performance of the feedforward neural network-based LR using line spectral frequencies and MFCC front-end features is compared.
Motivation of the study
Recently, work [110] explored the DNN-based LR system that uses acoustic features in the input and replaces the ASR-based output labels with the labels learned from a Bayesian non-parametric model. The specifications and performance of the salient language recognition system reported in the literature are summarized in Table 1.1.
Contributions of the thesis
Organization of the thesis
The language-specific coefficients of each of the s-vectors are averaged and used as score vectors. The following sections describe the functionality of each of the blocks used to build the language detection system.
Speech parametrization
In CMN, the average value of the cepstral coefficients, calculated over the entire utterance, is subtracted from each of the cepstral coefficients. The feature vector to be normalized is located in the center of the sliding window.
Statistical modeling techniques
- Gaussian mixture model
- GMM-universal background model
- GMM mean supervector representation
- i-vector representation
GMM mean supervector is a high dimensional vector derived from adapted GMM-UBM [142] and constitutes the best representation in speaker and language recognition domains [60, 143]. The mean vectors corresponding to a language adapted GMM-UBM are concatenated to form a GMM mean supervector `s.
Session/Channel compensation
Joint factor analysis
The factor V v is referred to as the JFA compensated supervector or JFA supervector for short. The MAP point estimate v corresponding to the eigenvalue matrix is referred to in this work as the JFA latent vector.
Linear discriminant analysis
In JFA [75], the GMM mean of shifted supervectors˜for a language is represented as the sum of three factors as, . In factor analysis, the i-vectors have the standard normal distribution w ∼ N(0, I), therefore the mean vector ω of the language population is equal to the null vector.
Within-class covariance normalization
Classifiers
- Cosine distance scoring
- Support vector machine
- Generative Gaussian model
- Logistic regression
- Gaussian-PLDA
For the given test i-vector wtst, the classification is performed by noting the sign of the function f(wtst). Gaussian-probabilistic linear discriminant analysis (G-PLDA model assumes that each i-vector wk can be decomposed into language component land channel component c and expressed as.
Score calibration
Memory intensive, harder to tune because of the and there are many kernels to choose from. A generative model that places a Gaussian prior on the underlying class variable, and so can model classes with very limited training data.
Databases
AP17-OLR
However, AP17-OL3 database is created as part of the Multilingual Minority Automatic Speech Recognition (M2ASR) project in China. According to the challenge, the participants can use all the mentioned database to train the LR system, excluding AP17-OL3 test and which is required to report the results on development kit.
Performance evaluation metric
The geometric interpretation of p-norm in 2-D space for the value of (p= 0.5) is shown in Figure 3.2 which is an astroid. The geometric interpretation of p-norm in 2-D space for the value of p= 2 is shown in Figure 3.2 which is a circle.
Sparse solutions of a linear system of equations
To obtain the approximate solution, the optimization problem from equation 3.7 is reformulated as. In the statistical community, the optimization problem related to equation 3.13 is known as Lasso [173] and formulated as.
Sparse representation based language recognition
- Sparse coding over exemplar dictionary based LR
- Language identification
- Language detection
- Sparse coding over learned dictionary based LR
- K-SVD/OL dictionary learning with l 0 regularization
- K-SVD/OL dictionary learning with l 1 regularization
- K-SVD/OL dictionary learning with l 1 − l 2 regularizations
- Sparse coding over class-specific learned dictionary based LR
- Class-based KSVD/OL dictionary learning with l 0 regularization
- Class-based K-SVD/OL dictionary learning with l 1 regularization
- Class-based K-SVD/OL dictionary learning with l 1 − l 2 regularization 50
- Label consistent-KSVD dictionary based language recognition
The learned dictionary can be designed to have fewer bases, so finding a sparse representation of test vectors in a reduced-size learned dictionary is computationally cheaper than a sample dictionary (linguistic concatenation of learning i-vectors). The dictionary learning problem can be divided into two subproblems: a) finding a sparse solution of the data patterns based on the initial dictionary or the one obtained by the previous iteration, and b) updating the dictionary atoms while preserving the s-vector. Given the training data for the lth language containing nl numbers of m-dimensional vectors Yl={yli}ni=1l and a sparsity constraint as γ′, the class-based dictionary learning problem can be defined as,.
Experimental setup
Dataset and system parameters
The number of language factor (LF) and channel factor (CF) are chosen to be 14 and 400, respectively. The performance of the language detection systems is primarily evaluated using the average detection cost function Cavg.
Results and discussion
- Contrast LR systems
- Language recognition using sparse representation over an exemplar dictionary . 55
- Language recognition using sparse representation over class-based learned dic-
- D-KSVD and LC-KSVD learned dictionary based language recognition
In this subsection, LR using sparse representation over the K-SVD/OL dictionary in two-expression representation: i-vector and supervector using three sparse coding algorithms (OMP, Lasso, and ENet) is reported. The sparse coding phase is used to find the sparse representation of the training data samples given the dictionary.
Summary
By comparison with two utterance representations, the GMM mean supervector and the i-vector, it is observed that the performance of the proposed speech recognition approaches is much better with the i-vector using appropriate channel compensations. The main disadvantage of the i-vector approach lies in its computational complexity and storage requirements.
Introduction
Exploiting the low-dimensional projections of the GMM mean, supervectors, known as i-vectors, are derived using a total variability matrix. The i-vector representation followed by session/channel compensation based on LDA/WCCN forms the state-of-the-art.
Proposed ensemble of subspaces based language recognition
Algorithm
The steps of the proposed ensemble-based LR approach:. i) Given the language-independent GMM-UBM, generate the zero- and first-order statistics for the utterance. ii) Using the statistics, create a GMM mean supervector and remove bias from it by subtracting the UBM mean supervector. iii) Randomly permute the derived GMM mean shifted supervectors if required.˜. iv) Divide the supervector s˜∈RM into N subvectors. Let s˜ci ∈RP denote the corresponding compensated subvectors of length P ≤K. vi) Concatenate session/channel compensated subvectors{s˜ci}Ni=1 to form a single vectorx of length P×N. vii) Given the training vector xtrn and the test vector xtst, calculate the cosine distance score as Score= x′trn·xtst.
Partitioning of supervector
Experimental setup
Results and discussion
Combination of JFA with the ensemble-based approach
Note that the JFA-compensated supervectors cannot be modeled with the higher rank model than used in JFA processing for further session/channel compensation. These results show that the slice length of 1024 has proven to be optimal for all combinations of session/channel compensation methods.
Computational complexity
The proposed ensemble approach with JFA would definitely be found more complex if one compares it with the simplified i-vector based approach. Even with the use of simplified factor analysis, the relative complexity advantage of the proposed approach over the i-vector based approach would remain intact.
Summary
The number of such example dictionaries will become very high if the size of the training dataset is large. Obviously, the proposed ensemble of the learned-example dictionary (merging class-based learned dictionaries) based LR approach can efficiently handle large size training data along with providing the diversity gain.
Proposed ensemble exemplar dictionary based LR system
The language-specific coefficients in each of the s-vectors are averaged to produce Scorevector 1, Scorevector 2 and Scorevector3, respectively. For each of G exemplary dictionaries. a) Find the sparse code (ie the s-vector) of the test-i vectors.
Proposed ensemble learned-exemplar dictionary based LR system
The proposed approach does not require any random sampling and therefore there is no chance of obtaining the same feature vector in different dictionaries (i.e., no repetition of selecting the same feature vector). The salient steps involved in the proposed system are summarized below: i) A set of offset training i-vectors of m-dimensional WCCN sessions/channels corresponding to each of the task languages is given. ii) Create an example dictionary by choosing k unique vectors i from each of the L languages.
Experiments with 2007 NIST LRE data
Results and discussion
- Ensemble exemplar and learned dictionary based LR systems on the
- Ensemble exemplar and learned dictionary based LR systems on the
In the following, we present the performance evaluation of LR systems based on ensemble exemplars and ensemble learned-instance dictionaries. This motivated us to explore the proposed ensemble model and ensemble learned-example dictionaries based on LR approaches on JFA latent vector as expression representation.
Experiments with 2009 NIST LRE data
- Database
- Shifted delta cepstral feature
- Language modeling, classifiers, and dictionary learning
- Results and discussion
We investigated SR classifier in all the dictionary-based LR approaches, which include single example dictionary, single learned dictionary, ensemble of example dictionary and ensemble of the learned-example dictionary. The performance of the proposed ensemble of example and learned-example dictionaries using SR for LR task is summarized in Table 5.3.
Summary
Time delay neural network
J are delayed from D1 through DN, and the weighted sum of the non-delayed and delayed inputs is calculated and passed through an activation function. This procedure is performed iteratively for all training patterns until the network converges to the desired output.
Long short-term memory
This is done to efficiently model the long time context, keeping training times comparable to standard feed-forward DNN.
Bottleneck feature extraction
Extraction of robust statistics
The i-vector extraction using GMM posteriors
The i-vector extraction using DNN posteriors
Capturing of phonetic sequence information
DNN conditioning of sparse coding-based LR system
Experimental setup
- Database
- Feature extraction
- MFCC features
- Bottleneck features
- FBanks features
- PTN sytsem
- i-vector modeling, classifiers and calibration
Thus, the 60 dimensional features obtained at the output of the block layer are referred to as BNF. TDNN, LSTM, and PTN-based LR systems use 40-dimensional Fbanks as raw spectral features, extracted from each of the speech frames.
Results and discussion
The performance of a BNF-based i-vector system using GMM/DNN posteriors with three session/channel compensation techniques is shown in Table 6.2 and Table 6.3, respectively. Table 6.4 reports the LR performance of GMM/DNN posteriors-based i-vector representation on conventional MFCCs and bottleneck features.
Summary
The improved performance of an ensemble of learned exemplars with a reduced number of dictionaries is observed by comparing the ensemble of exemplar dictionaries. In addition, DNN posteriors are also used when extracting the i-vector.
Salient contributions
A collection of exemplar and learned-exemplar dictionary is also proposed, taking into account the i-vector and the latent JFA vector. It should be noted that the group performance of the learned pattern with the low-dimensional JFA-latent vector is better compared to the LR contrast system; i-vector using the GG classifier.
Directions for future work
The steps involved in the OMP algorithm are given in Algorithm 1. Inputs: Dictionary D ≡ {di}pi=1,di ∈ Rm, the target vector y ∈ Rm and γ is the number of nonzero elements in sparse resolution. Initialization: Initialize the iteration variable k= 0 and set. the initial solution supports S0 =Support{x0}=∅, the null set. i) Element selection: Determine the dictionary column⋆ that maximizes the correlation with the residual. iii) Update preliminary solution: Calculate xk, the minimizer of ky − Dxk22 subject to Support{x}=Sk.
Least angle regression
In the sparse coding step, the data vectors are represented as a linear combination of only a few dictionary atoms. In the dictionary update stage, the atoms of the dictionary are modified so that the total representation error for the given set of data vectors is minimized.
D-KSVD dictionary learning
Label consistent-KSVD dictionary learning
The special case of the LC-KSVD problem with β = 0 is called LC-KSVD1, while with non-zero values of regularization parameters α and β it is called LC-KSVD2. The estimation of the linear classifierWˆ in the case of LC-KSVD1 is obtained by solving the equation.
Online dictionary learning
With a fixed dictionary size of 28, the best performance in terms of Cavg for language detection is noted with dictionary sparsity λ′1 = 0.01 and decomposition parsityλ1 = 0.001, while in language identification the optimal parameters are λ′1 = 0.02 and λ1 = 0.001. . Using OL using OMP and the i-vector representation, the best language detection performance is achieved with dictionary sparsityγ′= 15 and decomposition sparsity γ=20, shown in Figure C.5(a), while for the language identification task the sparsity constraints γ′= 20 andγ = 20 gives the best result, as shown in Figure C.5 (b).
Parameters tuning in learned dictionary based LR with ENet
Dehak, “Deep neural network approaches to speaker and language recognition,” IEEE Signal Processing Letters, vol. Sinha, “Language identification using sparse representation: A comparison between GMM supervector and i-vector based approaches,” in Proc.