Table 3.6: Performance of LR systems using D-KSVD and LC-KSVD learned dictionary with best tuned parameter values. The K-SVD learned dictionary of size400×28employing different regularization techniques (OMP, Lasso, and ENet) on WCCN session/channel compensated i-vector are considered.
Dictionary/ LD LID
Regularization Parameter(s). 100×Cavg Parameter(s) IDR(%)
D-KSVD/OMP γ′= 1,γ= 22,√
β= 0.3 2.93 γ′= 10,γ= 23,√
β= 0.2 90.67
D-KSVD/Lasso λ′1= 0.05,λ1= 0.01,√β= 0.2 2.92 λ′1= 0.1,λ1= 0.01,√β= 0.4 90.59 D-KSVD/ENet λ′1= 0.05,λλ2= 0.1,′2= 0.1,λ√β= 0.11= 0.01, 2.99 λ′1= 0.1,λλ2= 0.8,′2= 0.9,λ√β= 0.11= 0.01, 90.13 LC-KSVD1/OMP γ′= 6,γ= 22,√α= 0.1 2.76 γ′= 1,γ= 21,√α= 0.1 90.78 LC-KSVD1/Lasso λ′1= 0.05,λ1= 0.001,√α= 0.1 2.93 λ′1= 0.005,λ1= 0.01,√α= 0.1 90.46 LC-KSVD1/ENet λ′1= 0.05,λ′2= 0.2,λ1= 0.001,
2.97 λ′1= 0.005,λ′2= 0,λ1= 0.01,
91.01 λ2= 0.4,√α= 0.1 λ2= 1,√α= 0.2
LC-KSVD2/OMP γ′= 6,γ= 21,
2.71 γ′= 1,γ= 23, 90.68
√α= 0.1,√β= 0.2 √α= 0.1,√β= 0.2
LC-KSVD2/Lasso λ√′1α= 0.05,λ= 0.1,√1= 0.001, 2.86 λ′1= 0.03,λ1= 0.015, 90.57
β= 0.1 √α= 0.1,√
β= 0.2 LC-KSVD2/ENet λ′1= 0.05,λ′2= 0,λ1= 0.001, 2.91 λ′1= 0.03,λ′2= 0.9,λ1= 0.015,
91.05 λ2= 0,√α= 0.1,√
β= 0.1 λ2= 0.9,√α= 0.2,√ β= 0.1
best tuning parameters. On comparing with D-KSVD and LC-KSVD, a slight improvement in the LR performance is noted in LC-KSVD. The improved results are due to the incorporation of discriminative sparse-code error term enforcing label consistency in addition to the terms (i.e., reconstruction and classification error term) used in the D-KSVD formulation. In LID, the ENet based sparse coding is found to give satisfactory result compared to OMP and Lasso. However, the OMP based LR system is slightly better than the LR system employing Lasso and ENet based regularization. It is to be noted that the performance of LC-KSVD based LR system is superior among D-KSVD, class-wise learned dictionary, simple learned dictionary, an exemplar dictionary based LR systems. A large improvement is also observed compared to the i-vector LR systems employing GG classifier.
3.5 Summary
In this chapter, we have first analyzed the effect of three sessions/channel compensations on i- vector utterance representation based LR employing CDS classifier. For score calibration, GB, MLR, and GB+MLR have been investigated. The experimental results show the superior performance with WCCN compensation and GB+MLR calibration. Further, various classifiers (SVM, G-PLDA, MLR, GG, and CDS) on i-vector have been explored. Among the classifiers, the best performance is noted
with WCCN compensated i-vector with GG classifier.
The SRC based LR system have been also explored where an exemplar dictionary has been cre- ated with i-vector/GMM-mean supervector utterance representation with appropriate compensation technique. The sparse coding is performed over an exemplar dictionary employing three regularization techniques (OMP, Lasso, and ENet). The best performance has been noted with WCCN compensated i-vector using ENet based sparse coding. On comparing with the i-vector employing GG classifier, a slightly degraded performance has been noted in SRC based LR.
With the increase of a number of training examples (e.g., i-vector and GMM-mean supervector), the size of the exemplar dictionary increases and the computational burden of sparse coding over such dictionary becomes quite prohibitive. To address that, various sparse representation techniques over simple learned dictionary have been also explored for LR task. For creating the learned dictionary, both K-SVD and OL dictionary learning approaches employing different kinds of regularization have been investigated. Once the dictionary is learned, s-vectors corresponding to the training and test utterances have been extracted, and scores have been computed with the help of CDS classifier. It is to be noted that the class-labels in the case of the simple learned dictionary are not retained and hence the SRC is not feasible. To address this problem, the class-based learned dictionary has been also explored. Here, the language-specific learned dictionaries are concatenated to form the learned-exemplar dictionary.
The performance of learned-exemplar dictionary based LR is much better compared to the simple learned dictionary, and this may be due to the discriminative nature of s-vector computed using learned-exemplar.
The D-KSVD and the LC-KSVD learned discriminative dictionaries specially designed for classi- fication purpose have been also investigated. On comparing with D-KSVD and LC-KSVD, a slight improvement in the LR performance is noted in LC-KSVD. The improved results are due to the in- corporation of discriminative sparse-code error term enforcing label consistency in addition to the reconstruction and classification error terms used in the D-KSVD formulation.
On comparing with two utterance representations, the GMM mean supervector and the i-vector, it is observed that the performance of the proposed language recognition approaches are much better with the i-vector employing suitable channel compensations. The major drawback of the i-vector approach lies in its computational complexity and storage requirements. Motivated by these constraints, in the next chapter ensemble of random subspaces of supervector based LR approach is explored.
4
Low Complexity Language Recognition Exploiting Ensemble of Random Subspace
Contents
4.1 Motivation . . . 64 4.2 Introduction . . . 64 4.3 Proposed ensemble of subspaces based language recognition . . . 65 4.4 Experimental setup . . . 68 4.5 Results and discussion . . . 69 4.6 Summary . . . 74
4.1 Motivation
The current approaches for spoken LR are predominantly based on GMM mean supervector or its variants as the representation of the utterances. It is assumed that the language information lies in a linear manifold of low dimensional spaces. Exploiting that low dimensional projections of the GMM mean supervectors, known as i-vectors, are derived using a total variability matrix. The i-vector representation followed by LDA/WCCN based session/channel compensation forms the state- of-the-art. The major drawback of the i-vector approach lies in its computational complexity and storage requirements. Letm,p, andq represents the number of Gaussian mixture components, feature dimension and the size of i-vector, respectively. Using Equation 2.13, the computational complexity and memory requirement involved in i-vector estimation for an utterance is O(mpq+mq2+q3) and O(mpq+mq2), respectively. For example, the computational complexity and memory requirement in i-vector extraction for the typical value of parameters m = 1024, p= 39, and q = 400are in tune of 243,814,400 and 179,814,400 bits (≈22.477 MB), respectively. Motivated by these constraints, in this chapter we explore ensemble of random subspaces of supervector based LR approach which includes LDA/WCCN based session/channel compensation in the subspace. The proposed approach does not require any learning for creating the subspaces and has very low storage requirements. In typical LR applications, the number of languages required to be identified is limited to10-30. As a result of that, the complexity of the JFA based system turns out to be lower than that of the i-vector based one.
Exploiting this fact, the combination of the proposed ensemble approach with JFA is also explored.