Proposed ensemble of subspaces based language recognition

an ensemble learning framework based on random sampling of PCA subspace in face recognition [185]

and speaker verification [186–188]. The idea was to employ LDA or Fishervoice classifiers in the randomly sampled PCA subspace of high dimensional vector representation employed. In practice, the usage of PCA gets restricted due to the high complexity involved, whereas with random sampling the chance of losing some useful information is always there. It has also been shown that the constructed LDA classifier is often biased and unstable when the PCA subspace dimension is relatively high [185].

In this work, we propose a sequential partitioning of randomly permuted supervector to derive low dimensional subvectors (slices). The subspaces created in this manner are expected to be more balanced than the sequential partitioning of the supervector without randomization. Note that in the proposed approach the information in supervector simply gets distributed into the subvectors without any loss unlike that in the random sampling approach. Also, the proposed approach does not involve any learning for the creation of subspace unlike the PCA or i-vector approach. The LDA followed by WCCN channel compensation is then applied on low dimensional random subspace to mitigate the session/channel variability. The compensated, reduced dimensional subvectors are concatenated to form a single vector and the CDS is used to find the similarity score.

It is to note that though there are a large number of languages being spoken over the world, the current publicly available language databases cover only a small set of languages. As a result of this, the language-specific information in JFA can be captured without restricting the language space.

Thus, we argue that in LR task, there will be an insignificant loss of language-specific information with JFA modeling unlike that reported for a speaker verification task. Attributed to a small number of language factors involved, the complexity of JFA based LR system happens to be quite low compared to the i-vector based one. Exploiting this fact, we have also explored the combination of the proposed ensemble approach with JFA.

4.3 Proposed ensemble of subspaces based language recognition

The main motivation of the work reported in this chapter lies in the development of an LR system having low computational complexity and storage requirements. To achieve the same, we have explored an ensemble of subvectors based LR approach. It is assumed that the GMM mean shifted supervector based representation of the utterances is available. Each of the given supervectors is then sliced into subvectors of suitable length. Prior to slicing, the supervectors can also be randomly permuted. For

࢙෤ଵ

Score

࢙෤

Concatenate the compensated vector

CDS

࢙෤૛ ࢙෤ଷ ࢙෤_ே

CC CC CC

Figure 4.1: Flow diagram of the proposed ensemble of subspaces based LR system (˜s: centered GMM mean supervector, s˜N: N^th partitioned subvector, CC: session/channel compensation using LDA/WCCN transfor- mation, CDS: cosine distance scoring).

session/channel compensation, the linear transformations like LDA/WCCN are applied to the obtained subvectors. The compensated subvectors are then concatenated to form a single vector. The cosine kernel scoring has been used for recognition purpose. Figure 4.1 outlines the proposed ensemble of subspaces based LR system. The algorithm of the proposed LR approach and a brief description of the partitioning stage are presented in the following subsections.

4.3.1 Algorithm

The steps in the proposed ensemble based LR approach:

(i) Given the language independent GMM-UBM, generate the zeroth- and the first-order statistics for the utterance.

(ii) Using the statistics, create a GMM mean supervector and remove the bias from it by subtracting the UBM mean supervector.

(iii) Randomly permute the derived GMM mean shifted supervectors, if required.˜

(iv) Partition the supervector s˜∈R^M into N subvectors. The derived subvectors s˜_i,i= 1,2, . . . , N are of lengthK each, such that M =K×N.

4.3 Proposed ensemble of subspaces based language recognition

(v) Apply suitable session/channel compensation on each of the subvectors s˜_i. Let s˜_ci ∈R^P denote the corresponding compensated subvectors of lengthP ≤K.

(vi) Concatenate session/channel compensated subvectors{s˜_ci}^Ni=1to form a single vectorxof length P×N.

(vii) Given the training vector x_trn and the test vector x_tst, compute the cosine distance score as Score= x^′_trn·x_tst

kxtrnk kxtstk (4.1)

where x^′_trn is the transpose of xtrn.

(viii) Repeat the step (vii) for each of the training vector keeping fixed test vector. Finally, the language specific scores are obtained by computing the mean of scores belonging to particular language class.

4.3.2 Partitioning of supervector

Consider a GMM supervector `sof length CD, where C is the number of Gaussian mixtures and Dbe the dimensionality of the feature vector. The mean supervector is created by stacking the mean vectors of MAP adapted GMM-UBM in row-wise fashion, such that the firstD entries correspond to 1^st mixture and so on up till C^th mixture. Furthermore, in order to reduce the bias due to the UBM, the language-independent UBM mean supervectorµis subtracted from the GMM supervectors, thus` creating the centered GMM mean supervector s.˜

For slicing the GMM mean shifted supervector, we can follow two schemes as described below:

• Sequential slicing: The given supervector is partitioned sequentially to create the slices of appropriate length. As the entries in the supervector are structured, the resulting slices would exhibit some finite structures depending on the chosen length of the slices.

• Sequential slicing of randomized supervector: The supervector elements are first permuted randomly prior to partitioning sequentially to create the slices. The randomization of supervector helps in reducing any definite structure embedded in the resulting slices.

Out of these schemes, the sequential slicing of randomized supervector would result in more di- verse feature selection in the subvectors and hence is expected to outperform the other. These two partitioning methods are illustrated in Figure 4.2.

(a)

(b)

(c)

f11 f12 f13 f21 f22 f23 f31 f32 f33 f41 f42 f43 f51 f52 f53

f11 f33 f22 f41 f53 f51 f43 f21 f32 f13 f12 f23 f52 f41 f42

CD=15

K=5

K=5 K=5

Figure 4.2: Illustration of two different ways of partitioning of the supervector. (a) Given 15-dimensional GMM mean supervector corresponding to GMM-UBM sizeC= 5and feature vector dimensionD= 3which is to be partitioned into slices of lenghtK= 5, (b) Sequential slicing, (c) Sequential slicing of randomly permuted supervector.

Dalam dokumen EXPLORATION OF SPARSE REPRESENTATION TECHNIQUES IN LANGUAGE RECOGNITION (Halaman 97-100)