Typical SV systems use short-term features for parameterization of speech and employ either a generative or a discriminative model for classification. The most successful short-term features are mel frequency cepstral coefficients (MFCC) [2] and perceptual linear predictive cepstral coefficients (PLPCC) [3]. For modeling the speakers, most commonly a Gaussian mixture model (GMM) [4] is used. In this approach, typically a few thousands of Gaussian components are used to capture the distribution of feature vectors of a given speaker and the parameters of the same are estimated using the available training data. For verifying a claim, the likelihood of the test data feature vectors are computed over the GMM that corresponds to the claimed speaker. Usually an approach involving the adaptation of the mean parameters of a universal background model (UBM) [5] using the speaker training data is followed to derive the speaker models. The UBM is a GMM trained using a sufficient amount of speech data from a large number of speakers by following the maximum likelihood (ML) approach. Thus it accounts for the speaker-independent distribution of speech feature vectors. For adaptation of the UBM mean parameters using the speaker dependent data, usually the maximum a-posteriori (MAP) approach is followed. This speaker modeling approach
1.1 Review of current SV approaches
is generally termed as GMM-UBM in short.
Support vector machine (SVM) is a discriminative classifier and is also widely used for speaker verification [6, 7]. The SVM is a binary classifier which uses a separating hyperplane as the decision boundary between the target speaker and the imposter (non-target speakers) population.
The separating hyperplane that maximizes the margin of separation between these two classes is learned in the training phase. In the testing phase the decision is taken based on the distance of the test vector from the hyperplane that corresponds to the claimed speaker. For developing an SVM based SV system, an appropriate kernel function is used to map the speech utterances of arbitrary durations to a fixed and high-dimensional kernel space where the two classes are easier to separate with a hyperplane. In fact the linear hyperplane in the high-dimensional kernel space corresponds to a nonlinear decision boundary in the input feature vector space. There exists a number of kernels [1] which can be grouped into either parametric or derivative ones. Fisher kernel [8], GMM-supervector kernel [9], maximum likelihood linear regression (MLLR) kernel [10]
and cluster adaptive training (CAT) based kernel [11] are a few to mention and among those the GMM-supervector kernel [12] happens to be the most commonly used one. A GMM supervector is derived by concatenating the mean parameters of all components of the speaker-adapted GMM- UBM model to form a high dimensional vector. Apart from individual kernel based approaches, the combination of multiple kernel functions has also been explored to improve the performance [13].
The SVM based SV systems have generally been found to outperform the traditional GMM-UBM likelihood based approaches.
Exploiting the fixed dimensional representation of the utterances provided by GMM supervec- tors, a number of session/channel variability compensation methods such as nuisance attribute projection (NAP) [14], linear discriminant analysis (LDA), within-class covariance normalization (WCCN) [15], joint factor analysis (JFA) [16] have also been developed. Majority of the current SV systems use a low dimensional representation of GMM supervectors called i-vectors [17], de- rived using factor analysis for representing the speaker utterances. The earlier i-vector based SV systems [17] used a simple cosine distance based classifier along with NAP/WCCN and LDA for channel compensation. Later in [18], a Bayesian approach referred to as probabilistic linear dis- criminant analysis (PLDA) that performs simultaneous channel compensation and classification was proposed. The PLDA based SV system uses a generative approach to model the speaker and channel subspaces in the i-vectors domain. The hyper-parameter matrices representing the
speaker and channel subspaces are learned offline using a labeled development data. Given an i-vector representing the training utterance of a target speaker, the latent variables corresponding to the speaker and channel subspaces are estimated. The likelihood of the test data i-vector is com- puted from the marginal density function obtained by integrating the channel factors out. In [18], heavy-tailed priors are assumed for the latent variables to handle the non-Gaussian nature of the i-vectors and the resultant model is known as heavy-tailed PLDA (HPLDA). In the simplified Gaussian PLDA (GPLDA) model proposed in [19], a radial Gaussianization (whitening) followed by length normalization is applied on the i-vectors and Gaussian priors for the latent variables are assumed. The GPLDA approach is reported to be much faster compared to HPLDA without any degradation in performance and is the most commonly pursued one at present.
In the last few years a lot of interest is generated about sparse representation of signals which provides new directions to signal processing research [20]. Sparse representation denotes the process of computing the sparsest solution among the infinitely many solutions available for a redundant system of linear equations. In sparse representation literature, the coefficient matrix of the linear system of equations is generally termed asdictionary and the columns of the dictionary are called as atoms. SR has been exploited successfully for various signal processing tasks like de-noising, compression etc. Recently the discriminative abilities of SR have also been exploited in various areas of applied pattern recognition. The signal classification using SR was first proposed by Huang et. al. [21] and later Wright et. al. [22] exploited it for the face recognition task. In the SR based classification, an exemplar dictionary is created by arranging the training examples corresponding to all classes in the task as atoms. A target vector is then approximated as a sparse linear combination of the atoms of the dictionary. The class assignment is done by comparing the absolute sums of the sparse coefficients corresponding to the atoms belonging to different classes. Motivated by the excellent performance reported for the face recognition task, the SR based classification is explored for the speaker identification task in [23] using GMM supervectors as speaker representations. Later in [24], the same approach is extended to the SV task using a set of background (non-target) speakers. For the verification of each claim, a dictionary is created using the examples of the claimed speaker along with that of the background speakers, and the test vector is sparse coded with that dictionary. In the sparse code, the coefficients corresponding to the claimed speaker atoms are compared to those corresponding to the imposter speaker atoms with a suitable metric [25]. In [26, 27], the same idea is explored with i-vectors as the speaker