EXPLORATION OF SPARSE REPRESENTATION TECHNIQUES IN LANGUAGE RECOGNITION

Taking advantage of this fact, the combination of the proposed ensemble approach with JFA is also investigated. The language-specific coefficients in each of the s-vectors are averaged to have multiple score vectors.

Perceptual cues

However, in the open-ended task, the non-target languages are also included in the test. To address this challenge, a number of automated multilingual services have been developed in the past.

Literature survey on language recognition

Very early works
Works in last decade of 20th century
Works in first decade of 21st century
Works in present decade

Low-rank modeling
Sparse domain modeling
Deep neural network modeling

In [53], the acoustic and phonetic speech information was used for automatic spoken LR system. In [113], the performance of the feedforward neural network-based LR using line spectral frequencies and MFCC front-end features is compared.

Motivation of the study

Recently, work [110] explored the DNN-based LR system that uses acoustic features in the input and replaces the ASR-based output labels with the labels learned from a Bayesian non-parametric model. The specifications and performance of the salient language recognition system reported in the literature are summarized in Table 1.1.

Table 1.1: Summary of specifications and performances of salient language recognition system reported in the literature.

Contributions of the thesis

Organization of the thesis

The language-specific coefficients of each of the s-vectors are averaged and used as score vectors. The following sections describe the functionality of each of the blocks used to build the language detection system.

Figure 2.1: Block diagram of the i-vector based language detection system

Speech parametrization

In CMN, the average value of the cepstral coefficients, calculated over the entire utterance, is subtracted from each of the cepstral coefficients. The feature vector to be normalized is located in the center of the sliding window.

Figure 2.3: The illustration of extraction of the t-th frame of SDC feature for the choice of parameter set, Z-d-P-k = 7-1-3-7 [1]

Statistical modeling techniques

Gaussian mixture model
GMM-universal background model
GMM mean supervector representation
i-vector representation

GMM mean supervector is a high dimensional vector derived from adapted GMM-UBM [142] and constitutes the best representation in speaker and language recognition domains [60, 143]. The mean vectors corresponding to a language adapted GMM-UBM are concatenated to form a GMM mean supervector `s.

Session/Channel compensation

Joint factor analysis

The factor V v is referred to as the JFA compensated supervector or JFA supervector for short. The MAP point estimate v corresponding to the eigenvalue matrix is referred to in this work as the JFA latent vector.

Linear discriminant analysis

In JFA [75], the GMM mean of shifted supervectors˜for a language is represented as the sum of three factors as, . In factor analysis, the i-vectors have the standard normal distribution w ∼ N(0, I), therefore the mean vector ω of the language population is equal to the null vector.

Within-class covariance normalization

Classifiers

Cosine distance scoring
Support vector machine
Generative Gaussian model
Logistic regression
Gaussian-PLDA

For the given test i-vector wtst, the classification is performed by noting the sign of the function f(wtst). Gaussian-probabilistic linear discriminant analysis (G-PLDA model assumes that each i-vector wk can be decomposed into language component land channel component c and expressed as.

Score calibration

Memory intensive, harder to tune because of the and there are many kernels to choose from. A generative model that places a Gaussian prior on the underlying class variable, and so can model classes with very limited training data.

Databases

AP17-OLR

However, AP17-OL3 database is created as part of the Multilingual Minority Automatic Speech Recognition (M2ASR) project in China. According to the challenge, the participants can use all the mentioned database to train the LR system, excluding AP17-OL3 test and which is required to report the results on development kit.

Performance evaluation metric

The geometric interpretation of p-norm in 2-D space for the value of (p= 0.5) is shown in Figure 3.2 which is an astroid. The geometric interpretation of p-norm in 2-D space for the value of p= 2 is shown in Figure 3.2 which is a circle.

Figure 3.1: Behavior of the scalar function f (x) = | x | p − the core of l p -norm computation for various values of p in 1-D space

Sparse solutions of a linear system of equations

To obtain the approximate solution, the optimization problem from equation 3.7 is reformulated as. In the statistical community, the optimization problem related to equation 3.13 is known as Lasso [173] and formulated as.

Sparse representation based language recognition

Sparse coding over exemplar dictionary based LR

Language identification
Language detection

Sparse coding over learned dictionary based LR

K-SVD/OL dictionary learning with l 0 regularization
K-SVD/OL dictionary learning with l 1 regularization
K-SVD/OL dictionary learning with l 1 − l 2 regularizations

Sparse coding over class-specific learned dictionary based LR

Class-based KSVD/OL dictionary learning with l 0 regularization
Class-based K-SVD/OL dictionary learning with l 1 regularization
Class-based K-SVD/OL dictionary learning with l 1 − l 2 regularization 50

Label consistent-KSVD dictionary based language recognition

The learned dictionary can be designed to have fewer bases, so finding a sparse representation of test vectors in a reduced-size learned dictionary is computationally cheaper than a sample dictionary (linguistic concatenation of learning i-vectors). The dictionary learning problem can be divided into two subproblems: a) finding a sparse solution of the data patterns based on the initial dictionary or the one obtained by the previous iteration, and b) updating the dictionary atoms while preserving the s-vector. Given the training data for the lth language containing nl numbers of m-dimensional vectors Yl={yli}ni=1l and a sparsity constraint as γ′, the class-based dictionary learning problem can be defined as,.

Experimental setup

Dataset and system parameters

The number of language factor (LF) and channel factor (CF) are chosen to be 14 and 400, respectively. The performance of the language detection systems is primarily evaluated using the average detection cost function Cavg.

Results and discussion

Contrast LR systems
Language recognition using sparse representation over an exemplar dictionary . 55
Language recognition using sparse representation over class-based learned dic-
D-KSVD and LC-KSVD learned dictionary based language recognition

In this subsection, LR using sparse representation over the K-SVD/OL dictionary in two-expression representation: i-vector and supervector using three sparse coding algorithms (OMP, Lasso, and ENet) is reported. The sparse coding phase is used to find the sparse representation of the training data samples given the dictionary.

Table 3.1: Performances of LDA/WCCN/LDA+WCCN compensated i-vector based LR systems employing CDS classifier with different score calibration schemes.

Summary

By comparison with two utterance representations, the GMM mean supervector and the i-vector, it is observed that the performance of the proposed speech recognition approaches is much better with the i-vector using appropriate channel compensations. The main disadvantage of the i-vector approach lies in its computational complexity and storage requirements.

Introduction

Exploiting the low-dimensional projections of the GMM mean, supervectors, known as i-vectors, are derived using a total variability matrix. The i-vector representation followed by session/channel compensation based on LDA/WCCN forms the state-of-the-art.

Proposed ensemble of subspaces based language recognition

Algorithm

The steps of the proposed ensemble-based LR approach:. i) Given the language-independent GMM-UBM, generate the zero- and first-order statistics for the utterance. ii) Using the statistics, create a GMM mean supervector and remove bias from it by subtracting the UBM mean supervector. iii) Randomly permute the derived GMM mean shifted supervectors if required.˜. iv) Divide the supervector s˜∈RM into N subvectors. Let s˜ci ∈RP denote the corresponding compensated subvectors of length P ≤K. vi) Concatenate session/channel compensated subvectors{s˜ci}Ni=1 to form a single vectorx of length P×N. vii) Given the training vector xtrn and the test vector xtst, calculate the cosine distance score as Score= x′trn·xtst.

Partitioning of supervector

Experimental setup

Results and discussion

Combination of JFA with the ensemble-based approach

Note that the JFA-compensated supervectors cannot be modeled with the higher rank model than used in JFA processing for further session/channel compensation. These results show that the slice length of 1024 has proven to be optimal for all combinations of session/channel compensation methods.

Table 4.1: Slice length-wise performances of the proposed ensemble based LR systems on the GMM supervector for various session/channel compensation methods

Computational complexity

The proposed ensemble approach with JFA would definitely be found more complex if one compares it with the simplified i-vector based approach. Even with the use of simplified factor analysis, the relative complexity advantage of the proposed approach over the i-vector based approach would remain intact.

Table 4.3: LR performances of the proposed ensemble based approach with/without JFA preprocessing (lan- (lan-guage factors, LF = 11; channel factors, CF = 200) averaged over 10 iterations along with those of the contrast systems.

Summary

The number of such example dictionaries will become very high if the size of the training dataset is large. Obviously, the proposed ensemble of the learned-example dictionary (merging class-based learned dictionaries) based LR approach can efficiently handle large size training data along with providing the diversity gain.

Proposed ensemble exemplar dictionary based LR system

The language-specific coefficients in each of the s-vectors are averaged to produce Scorevector 1, Scorevector 2 and Scorevector3, respectively. For each of G exemplary dictionaries. a) Find the sparse code (ie the s-vector) of the test-i vectors.

Figure 5.1: Flow diagram of the proposed ensemble exemplar dictionary based LR approach

Proposed ensemble learned-exemplar dictionary based LR system

The proposed approach does not require any random sampling and therefore there is no chance of obtaining the same feature vector in different dictionaries (i.e., no repetition of selecting the same feature vector). The salient steps involved in the proposed system are summarized below: i) A set of offset training i-vectors of m-dimensional WCCN sessions/channels corresponding to each of the task languages is given. ii) Create an example dictionary by choosing k unique vectors i from each of the L languages.

Experiments with 2007 NIST LRE data

Results and discussion

Ensemble exemplar and learned dictionary based LR systems on the
Ensemble exemplar and learned dictionary based LR systems on the

In the following, we present the performance evaluation of LR systems based on ensemble exemplars and ensemble learned-instance dictionaries. This motivated us to explore the proposed ensemble model and ensemble learned-example dictionaries based on LR approaches on JFA latent vector as expression representation.

Experiments with 2009 NIST LRE data

Database
Shifted delta cepstral feature
Language modeling, classifiers, and dictionary learning
Results and discussion

We investigated SR classifier in all the dictionary-based LR approaches, which include single example dictionary, single learned dictionary, ensemble of example dictionary and ensemble of the learned-example dictionary. The performance of the proposed ensemble of example and learned-example dictionaries using SR for LR task is summarized in Table 5.3.

Table 5.2: Performances of ensemble of exemplar and learned-exemplar dictionary based LR approaches employing WCCN compensation on JFA latent vector with SRC

Summary

Time delay neural network

J are delayed from D1 through DN, and the weighted sum of the non-delayed and delayed inputs is calculated and passed through an activation function. This procedure is performed iteratively for all training patterns until the network converges to the desired output.

Long short-term memory

This is done to efficiently model the long time context, keeping training times comparable to standard feed-forward DNN.

Bottleneck feature extraction

Extraction of robust statistics

The i-vector extraction using GMM posteriors

The i-vector extraction using DNN posteriors

Capturing of phonetic sequence information

DNN conditioning of sparse coding-based LR system

Experimental setup

Database
Feature extraction

MFCC features
Bottleneck features
FBanks features

PTN sytsem
i-vector modeling, classifiers and calibration

Thus, the 60 dimensional features obtained at the output of the block layer are referred to as BNF. TDNN, LSTM, and PTN-based LR systems use 40-dimensional Fbanks as raw spectral features, extracted from each of the speech frames.

Results and discussion

The performance of a BNF-based i-vector system using GMM/DNN posteriors with three session/channel compensation techniques is shown in Table 6.2 and Table 6.3, respectively. Table 6.4 reports the LR performance of GMM/DNN posteriors-based i-vector representation on conventional MFCCs and bottleneck features.

Table 6.1: Performance comparison between TDDN, LSTM and PTN based LR systems.

Summary

The improved performance of an ensemble of learned exemplars with a reduced number of dictionaries is observed by comparing the ensemble of exemplar dictionaries. In addition, DNN posteriors are also used when extracting the i-vector.

Table 6.4: Performances of the proposed LR systems employing i-vector based utterance representation derived from the bottleneck features

Salient contributions

A collection of exemplar and learned-exemplar dictionary is also proposed, taking into account the i-vector and the latent JFA vector. It should be noted that the group performance of the learned pattern with the low-dimensional JFA-latent vector is better compared to the LR contrast system; i-vector using the GG classifier.

Directions for future work

The steps involved in the OMP algorithm are given in Algorithm 1. Inputs: Dictionary D ≡ {di}pi=1,di ∈ Rm, the target vector y ∈ Rm and γ is the number of nonzero elements in sparse resolution. Initialization: Initialize the iteration variable k= 0 and set. the initial solution supports S0 =Support{x0}=∅, the null set. i) Element selection: Determine the dictionary column⋆ that maximizes the correlation with the residual. iii) Update preliminary solution: Calculate xk, the minimizer of ky − Dxk22 subject to Support{x}=Sk.

Least angle regression

In the sparse coding step, the data vectors are represented as a linear combination of only a few dictionary atoms. In the dictionary update stage, the atoms of the dictionary are modified so that the total representation error for the given set of data vectors is minimized.

D-KSVD dictionary learning

Label consistent-KSVD dictionary learning

The special case of the LC-KSVD problem with β = 0 is called LC-KSVD1, while with non-zero values of regularization parameters α and β it is called LC-KSVD2. The estimation of the linear classifierWˆ in the case of LC-KSVD1 is obtained by solving the equation.

Online dictionary learning

With a fixed dictionary size of 28, the best performance in terms of Cavg for language detection is noted with dictionary sparsity λ′1 = 0.01 and decomposition parsityλ1 = 0.001, while in language identification the optimal parameters are λ′1 = 0.02 and λ1 = 0.001. . Using OL using OMP and the i-vector representation, the best language detection performance is achieved with dictionary sparsityγ′= 15 and decomposition sparsity γ=20, shown in Figure C.5(a), while for the language identification task the sparsity constraints γ′= 20 andγ = 20 gives the best result, as shown in Figure C.5 (b).

Figure C.1: Performance of the LR system using the sparse representation of WCCN compensated i-vector over K-SVD learned dictionary with the tuning of dictionary size, dictionary sparsity (γ ′ ) and decomposition sparsity (γ).The sparse coding is performed

Parameters tuning in learned dictionary based LR with ENet

Dehak, “Deep neural network approaches to speaker and language recognition,” IEEE Signal Processing Letters, vol. Sinha, “Language identification using sparse representation: A comparison between GMM supervector and i-vector based approaches,” in Proc.