55 3.7 Performance of the speaker recognition system based on SFSR and MFSR anal-. ysis techniques for test data of different sizes for the first 30 speakers from the YOHO database. 57 3.9 Performance of the speaker recognition system based on SFSR and MFSR anal-. ysis techniques for test data of different sizes for the first 30 speakers from the YOHO database.
Objective of the Thesis
Importance of Speaker Recognition
Nature of Speech for Speaker Recognition
Classifications of Speaker Recognition
- Speaker Identification
- Closed-set
- Open-set
- Speaker Verification
- Text-dependent
- Text-independent
In this case, the closed-set speaker identification system first identifies the speaker closest to the test speech data. If the distance from the test speech data to the target model is below the threshold, the speaker is accepted as a true speaker.
Components of Speaker Recognition
- Analysis
- Feature extraction
- Modelling
- Testing
The relevance of these analysis techniques for speaker recognition under data-limited conditions needs to be tested. The quality of the speaker model depends on the quality and quantity of the characteristic vectors.
Issues in Speaker Recognition
Motivation for the Present Work
Applications of Limited Data Speaker Recognition
In such cases, the amount of speech data available for a speaker training and test can be as small as a few seconds. Moreover, sending large amounts of speech data over the transmission media is not feasible and may increase communication costs.
Organization of the Thesis
In Chapter 6, integrated systems for speaker recognition under limited data conditions are proposed and demonstrated. The overall performance of the speaker recognition system depends on the techniques used in the analysis, feature extraction, modeling and testing stages.
Speech Analysis Techniques
Therefore, this chapter will provide an overview of some of the approaches developed in the field of speaker recognition. Studies done on segmental analysis used to extract vocal tract information for speaker recognition.
Feature Extraction Techniques
This study shows that the reflection coefficients are very informative and effective for speaker recognition. In [71], amplitude modulation (AM)-frequency modulation (FM) based speech parameterization is proposed for speaker recognition.
Speaker Modelling Techniques
The loudspeaker model built with the adaptive technique is shown to give better performance than the BPC and GMM for cross-channel conditions. In this study, it is shown that the hybrid model gives better recognition performance than the MLP.
Speaker Testing and Decision Logic
A study done in [17] reported that by combining evidence from the source, suprasegmental and spectral features, it is indeed possible to improve the performance of the speaker recognition system. In [94], it is reported that speaker recognition performance can be improved by combining evidence from SVM and GMM classifiers.
Summary and Scope for Present Work
The effectiveness of other ranging techniques needs to be investigated under limited data conditions. Therefore, studies need to be done using the different combination schemes to improve the performance.
Organization of the Work
Then, the main contributions of the work in the development of some approaches for speaker recognition under conditions of limited data are mentioned. In this chapter, as part of the analysis, we demonstrate the use of multi-frame size (MFS), multi-frame rate (MFR), and multi-frame size and rate (MFSR) analysis techniques for speaker recognition under data-limited conditions. In existing speaker recognition systems, the analysis phase uses a frame size and displacement in the range of 10-30 ms.
MFSR Analysis of Speech
MFS Analysis
In conventional speech processing systems using SFSR, feature vectors are extracted by analyzing the speech signal in frame size (S) of 20 ms and shift or hop (H) of 10 ms. Because of VAD, the actual number of feature vectors or frames (Nact) varies from speaker to speaker for the same amount of speech data D and is given by . From these two figures, it can be observed that MFS analysis results in multiple numbers of feature vectors for the same speech data.
MFR Analysis
Comparing Figure 3.2(d) with Figure 3.2(a), it can be said that MFR analysis yields a significantly larger number of feature vectors. Furthermore, the presence of feature vectors in places other than SFSR demonstrates the manifestation of different spectral information. Furthermore, by comparing Figure 3.2(d) with Figure 3.2(b), it can be observed that the distribution of feature vectors in the feature space is different for MFR and MFS.
MFSR Analysis
As a result, as shown in Figure 3.2(b) and Figure 3.2(d), the distribution of feature vectors is different in each case. As can be observed, even when using a frame size of 20 ms and a frame shift of 1 ms, the number of feature vectors obtained seems to be smaller than the MFSR. Another important observation we made in Figure 3.2(e) and Figure 3.2(f) is that the number of feature vectors obtained with a frame size of 20 ms and a frame shift of 0.125 ms is greater than the MFSR.
Limited Data Speaker Recognition using MFSR Analysis
Speech Database
These studies were later extended to the data of all 138 speakers from the YOHO database and to the data of the first 30 speakers and the first 138 speakers from the test set of the TIMIT database.
Speaker Modelling and Testing
In the first approach, minimum distance for each codebook is recorded for each frame and the speaker of the codebook that gives minimum distance is the winner of a frame and is assigned to that speaker. The speaker with maximum number of commands is recognized as the final speaker of the test speech data. In the latter approach, collect for each frame the minimum distance for each codebook and the speaker of the codebook that gives the minimum average distance over all frames is recognized as the speaker of the test speech data.
Speaker Recognition using SFSR, MFS, MFR and MFSR Analysis
Experiments are therefore conducted with different codebook size N to verify recognition performance for different amounts of data. In both strategies, each feature vector of the test speech data is compared to the codebook vectors of each speaker using Euclidean distance computation. MFSR based speaker recognition system for limited data condition is shown as block diagram in Figure 3.4.
Experimental Results and Discussions
Limited Training and Sufficient Testing Data
Therefore, an attempt is made here to increase speaker recognition performance for the same amount of training data by using MFS, MFR and MFSR. In Figure 3.5, we noticed that the MFSR provides significantly better performance for 3 sec data compared to the 6 sec training data. The performance of an approximately 10% difference can be seen from SFSR to MFSR at 3 sec and 6 sec training data.
Sufficient Training and Limited Testing Data
Further, we observed that there is a significant improvement in performance for 3 sec data compared to 6 sec test data. The recognition performance of almost 15% high is achieved by testing data in less than 12 seconds compared to SFSR. Therefore, MFSR methods can also be used for large databases to improve speaker recognition performance when test data is small.
Limited Training and Test Data
Performance is higher than that of SFSR, which provides 70% for a cipherbook of size 64. Performance is higher than that of SFSR, which provides 67% for a cipherbook of size 64. A recognition performance of 90% is achieved for 3-second training and test data using MFSR for codebook size 128.
Summary
This chapter presents an experimental evaluation of the different feature extraction techniques for speaker recognition under limited data condition. Advanced speaker recognition systems therefore use speaker-specific information such as vocal tract [23], excitation source [22] and suprasegmental features such as intonation, duration and accent [121] for speaker recognition. Therefore, extraction of different levels of information is especially important in speaker recognition under limited data condition to obtain reliable performance.
Limited Data Speaker Recognition using Different Features
Vocal Tract Features for Speaker Recognition
- Speaker Recognition using MFCC
- Speaker Recognition using LPCC
Experimental results using the ∆ and ∆∆MFCC features for 30 speakers of the YOHO database using one speech file (3 seconds) for training and testing data and different codebook sizes are also given in Table 4.1. From this study, we can conclude that under data-limited conditions, the MFCC function performs better than the LPCC function. Therefore, we suggest that the MFCC feature can be used for speaker recognition under data-limited conditions.
Excitation Source Features for Speaker Recognition
- Speaker Recognition using LPR
- Speaker Recognition using LPRP
Limited Data Speaker Recognition using Combination of Features
SFSR analyzed the MFCC experimental results for 30 loudspeakers of the YOHO database using one speech file (3 seconds) for training and testing data, and different codebook sizes are given in Section 3.4 of Chapter 3 (Table 3.2). The magnitude of the analytical signal ra(n) is the Hilbert envelope he(n) given by he(n) = |ra(n)|=q. Also, the experimental results of the TIMIT database are similar to those of the YOHO database, regardless of the speaker population and the amount of data.
Summary
Limited Data Speaker Recognition using Different Modelling Techniques
Speaker modelling by Direct Template Matching (DTM)
Speaker Modelling using CVQ
Speaker Modelling using FVQ
Speaker Modelling using SOM
Speaker Modelling using LVQ
Speaker Modelling using GMM
Speaker Modelling using GMM-UBM
Limited Data Speaker Modelling using Combined Modelling Techniques
Performance for the 30 speakers of the YOHO database using a speech file (3 seconds) for training and testing data and different TH. The performance for the 30 speakers of the YOHO database using a speech file (3 seconds) for training and testing data and η and different iterations are given in Table 5.4. Experimental results for the 30 speakers of the YOHO database using a speech file (3 seconds) for training and testing data and different Gaussian mixtures are given in Table 5.6.
Summary
The combination of features such as MFCC, its time derivative (∆MFCC, ∆∆MFCC), linear prediction residual (LPR) and linear residual prediction phase (LPRP) provides improved performance in the feature extraction phase. The combination of Learning Vector Quantization (LVQ) and Gaussian Mixed Model - Universal Background Model (GMM-UBM) provides improved performance in the modeling phase. In [99], fixed frame size and rate analysis are used to extract MFCC, and GMM is used as a modeling technique.
Integrated Systems for Limited Data Speaker Recognition
- Analysis Stage
- Feature Extraction Stage
- Modelling Stage
- Testing Stage
In this phase, features such as MFCC, ∆MFCC, ∆∆MFCC, LPR and LPRP are extracted from the speech signal. Since residual samples in LPR and LPRP cases are compared in the time domain, the MFSR analysis may not be efficient. In this phase, both LVQ and GMM-UBM modeling techniques are used for speaker modeling.
Limited Data Speaker Recognition using Integrated Systems
The performance of systemS3 for the 30 speakers of the YOHO database using 3-second training and test data for different values of the LVQ parameters is given in Table 6.3. This system gives a recognition performance of 27% for the 30 speakers of the YOHO database using 3 seconds of training and testing data. The performance of systemS8 for the 30 speakers of the YOHO database using 3-second training and test data for different values of the LVQ parameters is given in Tables 6.6.
Summary
Combination Techniques for Limited Data Speaker Recognition
- Abstract Level Combination
- Voting
- Strength Voting (SV)
- Rank Level Combination
- Borda count (BC)
- Weighted Ranking (WR)
- Measurement Level Combination
- Linear Combination of Frame Ratio (LCFR)
- Weighted LCFR (WLCFR)
- Supporting Systems (SS)
- Hierarchical Combination (HC)
The performance of the integrated systems is also used to achieve the rank of speaker in WR technique. Alternatively, we can benefit from taking into account the performance of the integrated systems. The performance is higher than the integrated systems and the combination techniques in Table 7.3.
Summary
Therefore, we suggest that the combined LVQ-GMM-UBM can be used as a modeling technique under limited data. Therefore, we suggest that integrated systems can be used to improve performance under data-limited conditions. Therefore, we suggest that HC can be used to improve speaker recognition performance under data-limited conditions.
Contributions of the Work
Scope for the Future Work
For example, the performance of kernel eigenspace based maximum likelihood linear regression (KEMLLR) should be verified. The vibration of the vocal cords, driven by air coming from the lungs during exhalation, is the source of sound for speech. The vocal tract is replaced by a filter, and the filter coefficients depend on the physical dimensions of the vocal tract.
Database Description
MFSR Analysis of Speech for Limited Data Speaker Recognition
Combination of Features for Limited Data Speaker Recognition
Combined Modelling Techniques for Limited Data Speaker Recognition
This study therefore demonstrates that the combination features can be used for improving the speaker recognition under limited data condition, regardless of the data quality. The study in chapter 5 demonstrates that the combined LVQ and GMM-UBM modeling provided the better speaker recognition performance compared to the individual and other combined modeling techniques using the YOHO and TIMIT databases. The experimental results for the NIST-SRE-2003 database also agree with those for the YOHO and TIMIT databases.
Integrated Systems for Limited Data Speaker Recognition
Combining Evidences for Limited Data Speaker Recognition
- Block diagram of closed-set speaker identification system
- Block diagram of open-set speaker identification system
- Block diagram of speaker verification system
- Basic block diagram of a speaker recognition system
- MFCC feature extraction process
- Features of a speaker for 100 ms speech data: (a) Features extracted for 20 ms
- Features of another speaker for 100 ms speech data: (a) Features extracted for 20
- The SFSR and proposed MFSR based speaker recognition system for limited
To verify the effectiveness of the combination techniques, we conducted the experiments on the NIST-SRE-2003 database. The experimental results for the NIST-SRE-2003 database are also similar to those for the YOHO and TIMIT databases. Therefore, we suggest that the combination techniques can be used to improve performance under limited data conditions, regardless of data quality.