Classifiers - EXPLORATION OF SPARSE REPRESENTATION TECHNIQUES IN LANGUAGE RECOGNITION

In factor analysis, the i-vectors have standard normal distribution w ∼ N(0, I), hence the language population mean vector ω is equal to the null vector.

2.4.3 Within-class covariance normalization

WCCN is another linear transformation method widely used to mitigate the session/channel effects from the Gaussian mean supervector and the i-vector [78, 145]. In WCCN method, the data vectors are transformed using a matrix which minimizes the upper bounds on the classification error metric and hence minimizes the classification error. The transformation matrix B is obtained by Cholesky decomposition of the inverse of the within-class covariance matrix W˘ as,

W˘ ⁻¹=BB^′ (2.19)

2.5 Classifiers

In this section, five popularly used classifiers in speaker/language recognition domain are briefly discussed for the contrast purpose. The strengths and weakness of each classifier are summarized in Table 2.1. The algorithms are explained in terms of the i-vector based representation of the speech utterances.

2.5.1 Cosine distance scoring

The CDS provides the similarity between two i-vectors. Given the training and test i-vectors (w_trn and w_tst), the score is computed as,

Score(w_trn,w_tst) = (wtrn)^′w_tst

kw^trnk₂kw^tstk₂ (2.20)

Using Equation 2.20, the scores against all training i-vectors are are computed for a particular test i-vector. Finally, the class-wise mean of the scores are computed for the final decision.

2.5.2 Support vector machine

The SVM [139] is a supervised binary classifier. In the training phase, a set of training instance- class label pairs (w_i, c_i), i= 1,2, . . . N, w_i ∈R^m, c_i ∈(−1,+1) is used to determine the best linear hyperplaneH, which maximizes the margin between the two classes. The notationsw_iandc_irepresent ith i-vector and corresponding class label, respectively. The classification function f associated with

the optimal hyperplane H is given by, f(w) =

i=1

α_ic_ik(w,w_i) +b (2.21)

whereαiandbare the Lagrange multiplier and bias, the SVM parameters obtained from training step.

Considering the linear kernel function k(w,w_i) =w^′_iw, Equation 2.21 can be re-written as

f(w) =

i=1

α_ic_iw_i

!^′

w+b=a^′w+b (2.22)

where a = PN

i=1α_ic_iw_i

is the weight vector. For the given test i-vector w_tst, the classification is performed by noting the sign of the function f(wtst). The one-vs-all strategy is used to obtain the scores for all languages. The other three popular kernels used in SVM are as follows:

•radial basis function (RBF): k(w,w_i) = exp −γkw−w_ik²

, γ >0

•polynomial: k(w,w_i) = (γw_i^′w+r)^d, γ >0

•sigmoid: k(w,w_i) =tanh(γw^′_iw+r), γ >0 where γ,r and dare the parameters of the kernel.

2.5.3 Generative Gaussian model

In linear generative Gaussian (GG) model [79], the class specific i-vectors are used to train a class dependent multivariate normal distribution N(µ_l,Σ), where full covariance matrixΣis shared across all classes. Given the test i-vector w, the log-likelihood score for each class is computed as

lnp(w|l)≃w^′Σ⁻¹µ_l−1

2µ^′_lΣ⁻¹µ_l (2.23)

where µ_l is the mean vector and Σis the common covariance matrix. Equation 2.23 is linear in w, and hence leads to a linear classifier.

2.5.4 Logistic regression

Given a set of training instance-class label pairs (w_i, c_i), i = 1,2, . . . N where w_i ∈ R^m and c_i ∈(0,1) are theith i-vector and its class label, respectively. The probability distribution of the class labelc given an i-vector wcan be modeled using logistic regression [146], and given by

p(c= 1|w;θ) =σ(θ^′w) = 1

1 + exp(−θ^′w) (2.24)

2.5 Classifiers

where θ ∈ R^m are the parameters of logistic regression model andσ(.) is the sigmoid function. The regularized logistic regression model can be described as

arg max

θ N

i=1

logp(c_i|w_i;θ)−λR(θ) (2.25)

where R(θ) is the regularization term. The parameter θ can be computed using Equation 2.25. The value ofR(θ) =kθk₂ and R(θ) =kθk₁ corresponds to l1 and l2 regularized logistic regression. Once θ is computed, find the decision boundary for the classification. The decision boundary is defined as the line where

p(c= 1|w;θ) = 0.5 =⇒θ^′w= 0 (2.26)

For multi-class problem, one-vs-all strategy is used to obtain the scores for all languages.

2.5.5 Gaussian-PLDA

Assume w_k represents the kth i-vector of a language data, where k= 1,2, . . . , N. The Gaussian- probabilistic linear discriminant analysis (G-PLDA) [147], [148] model assumes that each i-vector w_k can be decomposed into language component land channel component c and is expressed as

w_k=l+c= (m+Φφ) + (Ψψ_k+ηk) (2.27) The language and channel components describe the between-language and within-language variabil- ity, respectively. The former doesn’t depend on the particular utterance while the later is utterance dependent. The columns of Φ and Ψ provides a basis for the language subspace (eigenvoice) and channel subspace (eigenchannel), respectively. The φ and ψ_k are the corresponding latent identity vectors having standard normal distributions. The residual term ηk is assumed to be Gaussian with zero mean and diagonal covariance Σ. The global offset is represented by m. In this work, we have used the modified G-PLDA [149] model due to low dimensional i-vector. Here, the eigenchannels have been removed andΣis considered as full covariance matrix. The modified G-PLDA model is given by

w_k=m+Φφ+ηk (2.28)

The EM algorithm [141] is applied on the large development data for ML point estimates of the model parameters {m,Φ,Σ}.

Table 2.1: Comparison among CDS, SVM, MLR, GG, and G-PLDA classifiers

S.No. Classifiers Pros Cons

1 CDS Simple and easy to implement. 1. Less optimal under spaces of lower dimension.

2. Don’t work well for correlated classes.

2 SVM 1. SVM’s can model non-linear decision boundaries 1. Memory intensive,trickier to tune due to the and there are many kernels to choose from. importance of picking the right kernel.

2. Fairly robust against over-fitting, especially 2. Don’t scale well to larger datasets.

in high-dimensional space.

3 MLR

1.Algorithm can be regularized to avoid over-fitting. 1. Unstable with well separated classes 2.Logistic models can be updated easily with 2. Unstable with few examples.

new data using stochastic gradient descent. 3. Tends to under-perform when there are multiple or non-linear decision boundaries.

4 GG

1. A good algorithm to use for the classification 1. For computational reasons, it can fail

of static postures and non-temporal pattern recognition. to work if the dimensionality of the problem is too high.

2. User must set the number of mixture models that the algorithm will try and fit to the training dataset.

5 G-PLDA

1. A probabilistic version of LDA and so inherits Lack of robustness to outliers LDA’ s discriminative nature. in the language and channel subspaces.

2. A generative model which places a Gaussian prior on the underlying class variable, and so can model classes with very limited training data.

Dalam dokumen EXPLORATION OF SPARSE REPRESENTATION TECHNIQUES IN LANGUAGE RECOGNITION (Halaman 59-62)