Block Diagram of the Proposed Speaker Recognition System

7.2 SR System using excitation Information

7.2.1 Block Diagram of the Proposed Speaker Recognition System

7. Speaker Verification using Excitation Information

Recognized Speaker

Feature Extraction

Testing Modeling

Train Feature

Training Phase Testing Phase

Test Feature LP

Train/Test

Speech Analysis Residual

LP GMM−UBM

Speaker Models

Comparison and Decision Scores

Combination Schemes GFD

Supra Seg Sub

MPDSS +∆∆

RMFCC+∆ +∆∆

c c c c 0 0 t1 t2 t4 t6 +∆

T A

Figure 7.1: Block diagram of the proposed speaker recognition system using excitation information.

available for training and testing are collected in clean condition, the CMS may not be useful.

The RMF CC is concatenated with its ∆ and ∆∆ values to incorporate the temporal dynamic information of segmental excitation energy. In this work they are called asRMF CC+ ∆ + ∆∆

feature. The MP DSS and RMF CC + ∆ + ∆∆ features are used independently for building the speaker models. Their individual evidences are added to represent improved segmental excitation information. In this work the combination of RMF CC + ∆ + ∆∆ and MP DSS feature to represent the segmental level information is abbreviated as Src6. For building the speaker models from the suprasegmental pitch and epoch strength contours excitation information, the T0 and A0 vectors are computed as described in Section 6.2.1. For building the speaker models from the suprasegmental level cepstral trajectory excitation information, ct1, ct2, ct4 and ct6 trajectory vectors are extracted fromRMF CC as described in Section 6.3. The pitch, epoch strength and the cepstral trajectory vectors are used independently for building the speaker models. The complete suprasegmental excitation information is represented by

7.2 SR System using excitation Information

Src₅ is abbreviated as Src. The Src represents the excitation information for the proposed speaker verification system.

For building the speaker models, GMM-UBM modeling technique is employed. The UBM is a GMM built on large amount of data collected from several speakers, mostly not included in the target set [35]. The UBM represents speakers independent distribution of the feature vectors and serves as the average speaker model and also can be used as an imposter model for speaker verification task [29]. In [35], it was shown that it is advantageous to train two separate background models, one for male and the other for female speakers, and then build the single UBM by pooling the two models. The target speaker GMM model is then adapted by the single UBM. In practice, adapting the means only is found to work well than adapting all the three parameters [35, 36, 65]. Therefore, in most of the existing speaker recognition systems, only the means are adapted, and the weights and variances of the speaker model remain unaltered. The adaption technique is described in Appendix C. For each target speaker, separate models are built using different features. For example, in the proposed system, each target speaker has nine separate target and background models.

At the time of testing, the features are extracted from the input test speech signal similar to the training phase and compared with respective feature target model. The decision is taken based on the LLR scores assigned to the corresponding target models. The LLR that depends on both the target model and background model is given by

LLR=logP(λc)−logP(λu) (7.1)

where, P(λc) and P(λu) are the likelihoods given by the claimed speaker model and the UBM, respectively.

In the combination stage, the speaker-specific evidence from different excitation features are combined to achieve the improved recognition performance. Combining evidences from the multiple levels may be broadly grouped into three categories, namely, abstract level, rank level and the measurement level combination [74]. Majority of the speaker recognition systems that use information from multiple levels mostly use the measurement level combination [74,95].

TH-1048_07610209

7. Speaker Verification using Excitation Information

This is because, among the three levels of combination schemes, the measurement level contains the maximum amount of information and the abstract level contains the least [95]. Further, whenever the recognition performance from different levels not at par with each other, the abstract and rank levels combination schemes give relatively poor performance [95]. This is indeed the case for different excitation features, as we observe from the results presented in previous chapters. Thus, in this work we prefer to use the measurement level combination schemes for combining the speaker-specific evidence from different excitation features. Of course, the limitation of measurement level is also due to the poor performing system and is handled as described next.

As mentioned in the introduction section, the excitation information is extracted from subsegmental, segmental and suprasegmental levels. In some levels like segmental and suprasegmental levels, different features are combined to represent their respective complete information.

So, to model the complete excitation information, there are two approaches that may be used for combining the evidences from all the three levels. In one approach, the speaker-specific evidence from all the features from all levels can be combined. Alternatively, feature from individual levels can be combined first and then then these combined evidences can be further combined to obtain the overall excitation information evidence. The later approach seem to give better performance when some of the systems are performing poor. Because, from the previous experimental results presented in this thesis, we can observe that the performance of the individual features are significantly different. For example, the verification performance of the MP DSS and A0 feature vectors for whole NIST-03 database on a GMM based speaker recognition system is 32.25% and 49.25%, respectively. Due to the large difference in the performance, the weighting of the LLR of the poor feature may further reduce its significance. As a result, the effectiveness of the poor performing feature may be diminished by good performing features. On the other hand, in the similar condition, the best possible speaker verification performances achieved by subsegmental, segmental and suprasegmental levels excitation information are 39.79%, 32.33% and 39.47%, respectively. They are almost at par with each other, so their respective weighting factors may be proportional. As a result, the effectiveness of the

7.2 SR System using excitation Information

speaker-specific information from each level of the excitation can be properly manifested in the combined representation. So we suggest that, first the combination of the evidences may be at segmental and suprasegmental levels. The combined evidence from each of these levels may further be combined with the subsegmental level to represent improved excitation information.

By this combination scheme, it is expected that we may achieve better recognition performance from the proposed speaker recognition system.

Dalam dokumen PATI THESIS (Halaman 193-197)