Pattern matching and score normalization

Cholesky decomposition of the inverse of the within-class covariance matrixW as,

W⁻¹=BB^T (2.19)

2.3.4 Nuisance attribute projection

In nuisance attribute projection (NAP) method objective is to find the nuisance subspace which represents the channel/session variabilities [56]. The data is projected to the complimentary space of the nuisance subspace for compensation. The projection matrixP is given as,

P =I−RR^T (2.20)

whereRis a rectangular matrix containing the eigen vectors corresponding to few of the top eigen values of the within-class covariance matrix.

2.4 Pattern matching and score normalization

The pattern matching module choses the speaker model according to the identity claim input and compute the verification score of the test data on the selected model. This verification score is the measure of confidence of the system in the hypothesis that the input speech utterance is generated by the claimed speaker. The method of computing the verification score depends on the kind of the speaker modeling technique used. The score computation methods that are commonly used with various speaker modeling approaches, discussed in Section 2.2, are described in the following.

2.4.1 Likelihood ratio scoring with GMM modeling

In SV systems using parametric speaker modeling approaches like GMM or GMM-UBM, the likelihood of the test data vectors on the claimed speakers model is used as the score for verification.

Usually, the feature vectors of a test utterance X are assumed to be independent and identically distributed (IID), so the likelihood of a claimed speaker modelθ_c for a sequence of feature vectors, X ≡ {x₁,x₂, . . . ,x_N}, is computed as,

p(X|θ_c) =

n=1

p(xn|θ_c) (2.21)

The likelihoods over the claimed model is often normalized with the likelihood on the UBM model to obtain the log-likelihood ratio (LLR) score as,

LLR Score = log p(X|θ_c) p(X|θ_{U BM})

= log Q_N

n=1p(xn|θ_c) Q_N

n=1p(xn|θ_{U BM})

n=1

logp(xn|θ_c)−

n=1

logp(xn|θ_{U BM}) (2.22)

2.4.2 Cosine distance scoring

With GMM mean supervectors or i-vectors as the speaker utterance representations, SVM can be used for pattern matching [17, 56]. The representations for both the training and test speech utterance are derived and the SVM classifier is used to compute the verification score. Later it was shown in [59] that the use of a simple cosine distance measure between training and test data representations provides better performance compared to the SVM classifier for SV tasks. This is due to the fact that, in most of the publicly available SV databases like NIST SRE datasets, only a few utterances are is available to train each speaker. As a result of this, the training of SVM becomes suboptimal and it fails in achieving its effectiveness in such conditions. The cosine distance score is computed as,

Score = <wˆ_clm.wˆ_tst>

kwˆ_clmk2 kwˆ_tstk2

(2.23) where ˆw_clmand ˆw_tstrepresent the representations of the training and the test utterances, respectively, and the operators have the usual meaning.

2.4.3 Bayesian classifier with probabilistic linear discriminant analysis

Probabilistic linear discriminant analysis (PLDA) [60] is a stochastic approach which is recently been explored for performing SV task in the i-vector domain [18, 19]. In PLDA, an i-vector w corresponding to a speaker utterances is represented using a generative model as,

w=ρ+Hh+Gg+ε (2.24)

where, ρ is the global mean of i-vector population, H and G are the matrices representing the speaker and channel subspaces respectively, hand g are the speaker and channel factors, respectively with each having standard normal prior distribution, and ǫ is the residual factor having

2.4 Pattern matching and score normalization

standard normal prior with diagonal covariance. In [18], heavy-tailed priors are assumed for the latent variables for handling the effect of outliers in the data and the model is known as heavy-tailed PLDA (HPLDA). Later in [19] a simplified version of PLDA has been proposed which ignores the termGg and assigns standard normal prior to the latent variables. The residual term is modeled with Guassian distribution with zero mean and non-diagonal covariance denoted by S. A Gaus- sianization process consisting of whitening followed by length normalization is performed on the i-vectors prior to modeling and testing. This approach is commonly referred to as Gaussian PLDA (GPLDA) and reported to provide performance similar to that of the HPLDA with a much lesser complexity.

The parameters of the GPLDA model{ρ,H,S} are learned with the help of EM algorithm on a development dataset. Given the length normalized i-vectorsw_clmand w_tst corresponding to the claimed speakers training utterance and the test utterance, respectively, the log-likelihood ratio score using the GPLDA model is computed by a hypothesis test as,

LLR Score = log p(w_clm,w_tst|Γs) p(w_clm|Γ_d)p(w_tst|Γ_d)

R p(w_clm,w_tst,h_clm|ρ,S)dh_clm Rp(w_clmh_clm|ρ,S)dh_clmR

p(w_tst,h_tst|ρ,S)dh_tst

Rp(w_clm,w_tst)|ρ,S)p(h_clm)dh_clm Rp(w_clm|ρ, S)p(h_clm)dh_clmR

p(w_tst|ρ,S)p(h_tst)dh_tst

= N

w^T_clmw^T_tstT

ρ^Tρ^TT

,H˜H˜^T + ˜S N

w^T_clmw^T_tstT

|[ρ^Tρ^T]^T , diag{HH^T +S,HH^T +S} (2.25) where Γsis the hypothesis that bothw_clmandw_tstshare the same speaker identity latent variable hwhile Γ_d is the hypothesis thatw_clm and w_tst are generated by different speaker identity latent variablesh₁andh₂. ˜H =

H^TH^TT

and ˜S =diag{S,S}, where the operatordiag{., .}creates a block diagonal matrix by placing the arguments across the diagonal. The final closed-form solution for the LLR score is derived as,

LLR Score =C+w^T_clmQw_clm+w^T_tstQw_tst+ 2w_clm^T P w_tst (2.26)

where

C= 1

2lnkD₀k −1

2lnkD₁k+ρ^T(P +Q)ρ P = ∆⁻¹Φ(∆−Φ∆⁻¹Φ)⁻¹

Q= ∆⁻¹−(∆−Φ∆⁻¹Φ)⁻¹

∆ =HH^T +S Φ =HH^T

D₀=







∆ Φ Φ ∆





, D₁ =







∆ 0

0 ∆







Dalam dokumen and my wife, (Halaman 49-52)