• Tidak ada hasil yang ditemukan

Limited Data Speaker Recognition

N/A
N/A
Protected

Academic year: 2023

Membagikan "Limited Data Speaker Recognition"

Copied!
211
0
0

Teks penuh

55 3.7 Performance of the speaker recognition system based on SFSR and MFSR anal-. ysis techniques for test data of different sizes for the first 30 speakers from the YOHO database. 57 3.9 Performance of the speaker recognition system based on SFSR and MFSR anal-. ysis techniques for test data of different sizes for the first 30 speakers from the YOHO database.

Objective of the Thesis

Importance of Speaker Recognition

Nature of Speech for Speaker Recognition

Classifications of Speaker Recognition

  • Speaker Identification
    • Closed-set
    • Open-set
  • Speaker Verification
  • Text-dependent
  • Text-independent

In this case, the closed-set speaker identification system first identifies the speaker closest to the test speech data. If the distance from the test speech data to the target model is below the threshold, the speaker is accepted as a true speaker.

Figure 1.1: Block diagram of closed-set speaker identification system.
Figure 1.1: Block diagram of closed-set speaker identification system.

Components of Speaker Recognition

  • Analysis
  • Feature extraction
  • Modelling
  • Testing

The relevance of these analysis techniques for speaker recognition under data-limited conditions needs to be tested. The quality of the speaker model depends on the quality and quantity of the characteristic vectors.

Issues in Speaker Recognition

Motivation for the Present Work

Applications of Limited Data Speaker Recognition

In such cases, the amount of speech data available for a speaker training and test can be as small as a few seconds. Moreover, sending large amounts of speech data over the transmission media is not feasible and may increase communication costs.

Organization of the Thesis

In Chapter 6, integrated systems for speaker recognition under limited data conditions are proposed and demonstrated. The overall performance of the speaker recognition system depends on the techniques used in the analysis, feature extraction, modeling and testing stages.

Speech Analysis Techniques

Therefore, this chapter will provide an overview of some of the approaches developed in the field of speaker recognition. Studies done on segmental analysis used to extract vocal tract information for speaker recognition.

Feature Extraction Techniques

This study shows that the reflection coefficients are very informative and effective for speaker recognition. In [71], amplitude modulation (AM)-frequency modulation (FM) based speech parameterization is proposed for speaker recognition.

Speaker Modelling Techniques

The loudspeaker model built with the adaptive technique is shown to give better performance than the BPC and GMM for cross-channel conditions. In this study, it is shown that the hybrid model gives better recognition performance than the MLP.

Speaker Testing and Decision Logic

A study done in [17] reported that by combining evidence from the source, suprasegmental and spectral features, it is indeed possible to improve the performance of the speaker recognition system. In [94], it is reported that speaker recognition performance can be improved by combining evidence from SVM and GMM classifiers.

Summary and Scope for Present Work

The effectiveness of other ranging techniques needs to be investigated under limited data conditions. Therefore, studies need to be done using the different combination schemes to improve the performance.

Organization of the Work

Then, the main contributions of the work in the development of some approaches for speaker recognition under conditions of limited data are mentioned. In this chapter, as part of the analysis, we demonstrate the use of multi-frame size (MFS), multi-frame rate (MFR), and multi-frame size and rate (MFSR) analysis techniques for speaker recognition under data-limited conditions. In existing speaker recognition systems, the analysis phase uses a frame size and displacement in the range of 10-30 ms.

MFSR Analysis of Speech

MFS Analysis

In conventional speech processing systems using SFSR, feature vectors are extracted by analyzing the speech signal in frame size (S) of 20 ms and shift or hop (H) of 10 ms. Because of VAD, the actual number of feature vectors or frames (Nact) varies from speaker to speaker for the same amount of speech data D and is given by . From these two figures, it can be observed that MFS analysis results in multiple numbers of feature vectors for the same speech data.

Figure 3.1: MFCC feature extraction process
Figure 3.1: MFCC feature extraction process

MFR Analysis

Comparing Figure 3.2(d) with Figure 3.2(a), it can be said that MFR analysis yields a significantly larger number of feature vectors. Furthermore, the presence of feature vectors in places other than SFSR demonstrates the manifestation of different spectral information. Furthermore, by comparing Figure 3.2(d) with Figure 3.2(b), it can be observed that the distribution of feature vectors in the feature space is different for MFR and MFS.

MFSR Analysis

As a result, as shown in Figure 3.2(b) and Figure 3.2(d), the distribution of feature vectors is different in each case. As can be observed, even when using a frame size of 20 ms and a frame shift of 1 ms, the number of feature vectors obtained seems to be smaller than the MFSR. Another important observation we made in Figure 3.2(e) and Figure 3.2(f) is that the number of feature vectors obtained with a frame size of 20 ms and a frame shift of 0.125 ms is greater than the MFSR.

Table 3.1: Comparison of average number of frames (µ) using different analysis techniques for the first 30 speakers taken from the YOHO database, each having 3 sec training data
Table 3.1: Comparison of average number of frames (µ) using different analysis techniques for the first 30 speakers taken from the YOHO database, each having 3 sec training data

Limited Data Speaker Recognition using MFSR Analysis

Speech Database

These studies were later extended to the data of all 138 speakers from the YOHO database and to the data of the first 30 speakers and the first 138 speakers from the test set of the TIMIT database.

Speaker Modelling and Testing

In the first approach, minimum distance for each codebook is recorded for each frame and the speaker of the codebook that gives minimum distance is the winner of a frame and is assigned to that speaker. The speaker with maximum number of commands is recognized as the final speaker of the test speech data. In the latter approach, collect for each frame the minimum distance for each codebook and the speaker of the codebook that gives the minimum average distance over all frames is recognized as the speaker of the test speech data.

Speaker Recognition using SFSR, MFS, MFR and MFSR Analysis

Experiments are therefore conducted with different codebook size N to verify recognition performance for different amounts of data. In both strategies, each feature vector of the test speech data is compared to the codebook vectors of each speaker using Euclidean distance computation. MFSR based speaker recognition system for limited data condition is shown as block diagram in Figure 3.4.

Experimental Results and Discussions

Limited Training and Sufficient Testing Data

Therefore, an attempt is made here to increase speaker recognition performance for the same amount of training data by using MFS, MFR and MFSR. In Figure 3.5, we noticed that the MFSR provides significantly better performance for 3 sec data compared to the 6 sec training data. The performance of an approximately 10% difference can be seen from SFSR to MFSR at 3 sec and 6 sec training data.

Sufficient Training and Limited Testing Data

Further, we observed that there is a significant improvement in performance for 3 sec data compared to 6 sec test data. The recognition performance of almost 15% high is achieved by testing data in less than 12 seconds compared to SFSR. Therefore, MFSR methods can also be used for large databases to improve speaker recognition performance when test data is small.

Figure 3.5: Performance of the speaker recognition system based on SFSR and MFSR analysis techniques for different sizes of training data for the first 30 speakers taken from the YOHO database.
Figure 3.5: Performance of the speaker recognition system based on SFSR and MFSR analysis techniques for different sizes of training data for the first 30 speakers taken from the YOHO database.

Limited Training and Test Data

Performance is higher than that of SFSR, which provides 70% for a cipherbook of size 64. Performance is higher than that of SFSR, which provides 67% for a cipherbook of size 64. A recognition performance of 90% is achieved for 3-second training and test data using MFSR for codebook size 128.

Figure 3.8: Performance of the speaker recognition system based on SFSR and MFSR analysis techniques for 138 speakers taken from the YOHO database for different sizes of testing data
Figure 3.8: Performance of the speaker recognition system based on SFSR and MFSR analysis techniques for 138 speakers taken from the YOHO database for different sizes of testing data

Summary

This chapter presents an experimental evaluation of the different feature extraction techniques for speaker recognition under limited data condition. Advanced speaker recognition systems therefore use speaker-specific information such as vocal tract [23], excitation source [22] and suprasegmental features such as intonation, duration and accent [121] for speaker recognition. Therefore, extraction of different levels of information is especially important in speaker recognition under limited data condition to obtain reliable performance.

Limited Data Speaker Recognition using Different Features

Vocal Tract Features for Speaker Recognition

  • Speaker Recognition using MFCC
  • Speaker Recognition using LPCC

Experimental results using the ∆ and ∆∆MFCC features for 30 speakers of the YOHO database using one speech file (3 seconds) for training and testing data and different codebook sizes are also given in Table 4.1. From this study, we can conclude that under data-limited conditions, the MFCC function performs better than the LPCC function. Therefore, we suggest that the MFCC feature can be used for speaker recognition under data-limited conditions.

Table 4.1: Speaker recognition performance (%) for the first 30 speakers of the YOHO database using 3 sec training and testing data for MFCC feature and its derivatives.
Table 4.1: Speaker recognition performance (%) for the first 30 speakers of the YOHO database using 3 sec training and testing data for MFCC feature and its derivatives.

Excitation Source Features for Speaker Recognition

  • Speaker Recognition using LPR
  • Speaker Recognition using LPRP

Limited Data Speaker Recognition using Combination of Features

SFSR analyzed the MFCC experimental results for 30 loudspeakers of the YOHO database using one speech file (3 seconds) for training and testing data, and different codebook sizes are given in Section 3.4 of Chapter 3 (Table 3.2). The magnitude of the analytical signal ra(n) is the Hilbert envelope he(n) given by he(n) = |ra(n)|=q. Also, the experimental results of the TIMIT database are similar to those of the YOHO database, regardless of the speaker population and the amount of data.

Figure 4.1: Block diagram of combination of features for speaker recognition under limited data condition.
Figure 4.1: Block diagram of combination of features for speaker recognition under limited data condition.

Summary

Limited Data Speaker Recognition using Different Modelling Techniques

Speaker modelling by Direct Template Matching (DTM)

Speaker Modelling using CVQ

Speaker Modelling using FVQ

Speaker Modelling using SOM

Speaker Modelling using LVQ

Speaker Modelling using GMM

Speaker Modelling using GMM-UBM

Limited Data Speaker Modelling using Combined Modelling Techniques

Performance for the 30 speakers of the YOHO database using a speech file (3 seconds) for training and testing data and different TH. The performance for the 30 speakers of the YOHO database using a speech file (3 seconds) for training and testing data and η and different iterations are given in Table 5.4. Experimental results for the 30 speakers of the YOHO database using a speech file (3 seconds) for training and testing data and different Gaussian mixtures are given in Table 5.6.

Figure 5.1: Block diagram of proposed combined modelling technique for speaker recognition under limited data condition.
Figure 5.1: Block diagram of proposed combined modelling technique for speaker recognition under limited data condition.

Summary

The combination of features such as MFCC, its time derivative (∆MFCC, ∆∆MFCC), linear prediction residual (LPR) and linear residual prediction phase (LPRP) provides improved performance in the feature extraction phase. The combination of Learning Vector Quantization (LVQ) and Gaussian Mixed Model - Universal Background Model (GMM-UBM) provides improved performance in the modeling phase. In [99], fixed frame size and rate analysis are used to extract MFCC, and GMM is used as a modeling technique.

Integrated Systems for Limited Data Speaker Recognition

  • Analysis Stage
  • Feature Extraction Stage
  • Modelling Stage
  • Testing Stage

In this phase, features such as MFCC, ∆MFCC, ∆∆MFCC, LPR and LPRP are extracted from the speech signal. Since residual samples in LPR and LPRP cases are compared in the time domain, the MFSR analysis may not be efficient. In this phase, both LVQ and GMM-UBM modeling techniques are used for speaker modeling.

Figure 6.1: Block diagram of integrated systems based speaker recognition for limited data condition.
Figure 6.1: Block diagram of integrated systems based speaker recognition for limited data condition.

Limited Data Speaker Recognition using Integrated Systems

The performance of systemS3 for the 30 speakers of the YOHO database using 3-second training and test data for different values ​​of the LVQ parameters is given in Table 6.3. This system gives a recognition performance of 27% for the 30 speakers of the YOHO database using 3 seconds of training and testing data. The performance of systemS8 for the 30 speakers of the YOHO database using 3-second training and test data for different values ​​of the LVQ parameters is given in Tables 6.6.

Table 6.2: Speaker recognition performance (%) for the first 30 speakers taken from the YOHO database using 3 sec training and testing data for integrated system S 1 i.e., MFSR-MFCC-LVQ technique.
Table 6.2: Speaker recognition performance (%) for the first 30 speakers taken from the YOHO database using 3 sec training and testing data for integrated system S 1 i.e., MFSR-MFCC-LVQ technique.

Summary

Combination Techniques for Limited Data Speaker Recognition

  • Abstract Level Combination
    • Voting
    • Strength Voting (SV)
  • Rank Level Combination
    • Borda count (BC)
    • Weighted Ranking (WR)
  • Measurement Level Combination
    • Linear Combination of Frame Ratio (LCFR)
    • Weighted LCFR (WLCFR)
    • Supporting Systems (SS)
  • Hierarchical Combination (HC)

The performance of the integrated systems is also used to achieve the rank of speaker in WR technique. Alternatively, we can benefit from taking into account the performance of the integrated systems. The performance is higher than the integrated systems and the combination techniques in Table 7.3.

Figure 7.1: Block diagram of combining evidences from the integrated systems for speaker recognition under limited data condition.
Figure 7.1: Block diagram of combining evidences from the integrated systems for speaker recognition under limited data condition.

Summary

Therefore, we suggest that the combined LVQ-GMM-UBM can be used as a modeling technique under limited data. Therefore, we suggest that integrated systems can be used to improve performance under data-limited conditions. Therefore, we suggest that HC can be used to improve speaker recognition performance under data-limited conditions.

Table 8.1 and 8.2 shows the summary of the results obtained using the different approaches for the YOHO and TIMIT database, respectively
Table 8.1 and 8.2 shows the summary of the results obtained using the different approaches for the YOHO and TIMIT database, respectively

Contributions of the Work

Scope for the Future Work

For example, the performance of kernel eigenspace based maximum likelihood linear regression (KEMLLR) should be verified. The vibration of the vocal cords, driven by air coming from the lungs during exhalation, is the source of sound for speech. The vocal tract is replaced by a filter, and the filter coefficients depend on the physical dimensions of the vocal tract.

Figure A.1: The representation of the speech production mechanism.
Figure A.1: The representation of the speech production mechanism.

Database Description

MFSR Analysis of Speech for Limited Data Speaker Recognition

Combination of Features for Limited Data Speaker Recognition

Combined Modelling Techniques for Limited Data Speaker Recognition

This study therefore demonstrates that the combination features can be used for improving the speaker recognition under limited data condition, regardless of the data quality. The study in chapter 5 demonstrates that the combined LVQ and GMM-UBM modeling provided the better speaker recognition performance compared to the individual and other combined modeling techniques using the YOHO and TIMIT databases. The experimental results for the NIST-SRE-2003 database also agree with those for the YOHO and TIMIT databases.

Figure D.1: Performance of the SFSR and MFSR analysis techniques speaker recognition system for (a) the first 30 speakers (b) the 138 speakers taken from the NIST-SRE-2003 database.
Figure D.1: Performance of the SFSR and MFSR analysis techniques speaker recognition system for (a) the first 30 speakers (b) the 138 speakers taken from the NIST-SRE-2003 database.

Integrated Systems for Limited Data Speaker Recognition

Combining Evidences for Limited Data Speaker Recognition

  • Block diagram of closed-set speaker identification system
  • Block diagram of open-set speaker identification system
  • Block diagram of speaker verification system
  • Basic block diagram of a speaker recognition system
  • MFCC feature extraction process
  • Features of a speaker for 100 ms speech data: (a) Features extracted for 20 ms
  • Features of another speaker for 100 ms speech data: (a) Features extracted for 20
  • The SFSR and proposed MFSR based speaker recognition system for limited

To verify the effectiveness of the combination techniques, we conducted the experiments on the NIST-SRE-2003 database. The experimental results for the NIST-SRE-2003 database are also similar to those for the YOHO and TIMIT databases. Therefore, we suggest that the combination techniques can be used to improve performance under limited data conditions, regardless of data quality.

Figure D.4: Performance of the integrated speaker recognition systems for (a) the first 30 speakers (b) the 138 speakers taken from the NIST-SRE-2003 database.
Figure D.4: Performance of the integrated speaker recognition systems for (a) the first 30 speakers (b) the 138 speakers taken from the NIST-SRE-2003 database.

Gambar

Figure 1.1: Block diagram of closed-set speaker identification system.
Figure 1.2: Block diagram of open-set speaker identification system.
Figure 3.2: Features of a speaker for 100 ms speech data: (a) Features extracted for 20 ms frame size and 10 ms frame shift (b) MFS based feature vectors (c) Features extracted for 20 ms frame size and 1 ms frame shift (d) MFR based feature vectors (e) Fea
Figure 3.3: Features of another speaker for 100 ms speech data: (a) Features extracted for 20 ms frame size and 10 ms frame shift (b) MFS based feature vectors (c) Features extracted for 20 ms frame size and 1 ms frame shift (d) MFR based feature vectors (
+7

Referensi

Dokumen terkait

CHAPTER 4: SYSTEM MODEL 3.2.3 Speaker Extraction Module After model in 3.2.2.4 Multi-Speaker Identification Non-Overlapped helps identify the speaker existing in a speech audio