Speaker Recognition Terminologies

1. Introduction

level processing of LP residual. The typical approach for suprasegmental processing is to first process the LP residual by the segmental processing to extract relevant information that can be viewed at the suprasegmental level. For instance, segmental processing is performed on the LP residual initially to extract the pitch and then the pitch values of successive segmental blocks are plotted to obtain the pitch contour. The pitch contour is viewed as the speaker- specific excitation information at the suprasegmental level [8]. In the similar way, several other information can be obtained to view the speaker information at the suprasegmental level [19].

The observations from the existing attempts for suprasegmental processing infer that these attempts also contain good amount of speaker-specific excitation information. However, the suprasegmental processing suffers from large intra-speaker variability.

As briefly reviewed above, the subsegmental, segmental and suprasegmental levels of processing of LP residual model in an independent manner and demonstrate that each level has good or significant amount of speaker-specific excitation information. However, to the best of our knowledge, there are no systematic and visible efforts in the literature to explore whether each level has different information. If they are different, then each of these attempts are modeling only one component of speaker-specific excitation information. If so, can we combine them to achieve further improvement in modelling the speaker-specific excitation information for speaker recognition. Hence the need for such an attempt.

1.4 Speaker Recognition Terminologies

models of all the users having access to that system are available. The test speaker is therefore guaranteed to be any one among the given reference models and the system will give decision accordingly. In an open-set case, there is a possibility that the test speaker may not be from the reference models available in the system. In such case the system is designed to go for a second level of test, that is, how close he/she is, which may give an additional decision like no match [31]. Speaker identification is a process of one-to-many comparisons.

Depending on the mode of operation, SR system is also classified as text-dependant and text-independent [30]. If speech of same text is used for building the speaker model and later for comparison, then it is called as text-dependant SR system. In text-independent case there is no such constraint. Generally, text-independent speaker recognition is more difficult than text-dependant task, because one has to cope with an additional variability due to differences in the texts of the unknown and reference utterances [23]. In this thesis all our studies will be on text-independent, closed-set speaker identification and verification tasks.

1.4.2 Block Diagram of SR system

The basic block diagram of SR system is shown in Figure 1.2. The function of the block diagram may be divided into two phases of operation: training phase and testing phase. The training phase includes feature extraction and modeling blocks. The feature extraction block extracts the speaker-specific features from the speech signal(s) available for training. The modelingblock uses these extracted features for building the speaker models. Thetestingphase includes feature extraction, comparison and decision. In this phase, the feature extraction block uses the similar procedure, as in the case of training phase, to extract features from the test speech signal. Based on the modeling approach and mode of task, a decision is made by comparing test speaker features with reference models incomparisonanddecisionblocks. Thus, the main functions in SR system may be listed asfeature extraction,modeling and testing.

The objective of the feature extraction stage is to extract sufficient and robust speaker information at reduced data rate for effective modeling and later for comparison [32]. The features ultimately determine the separability of the speakers and the recognition performance TH-1048_07610209

1. Introduction

Speech Train

Modeling

Feature

Comparison Decision

Training Phase Testing Phase

Test Speech

Recognized Extraction

ExtractionFeature

Speaker

Figure 1.2: Basic block diagram of automatic speaker recognition system.

mostly depends upon the discriminating ability of the features [7, 33]. Moreover, the feature extraction stage is common for both training and testing phases. Thus, the feature extraction stage is an important stage in the SR system [31]. Selection of features having capability of effectively representing the speaker information, its robustness against unfavorable condition(s) and their accurate measurement play an important role for speaker recognition [7].

The features produced from an individual speaker are represented by vectors in the feature space. A good feature should be less variant within speakers and more variant across speakers [2, 23, 31]. But in the feature space, features of different speakers are shared and overlapped with each other. So, a second level of compression among the features of a speaker is made in the modeling stage. A large set of feature vectors of a speaker is grouped into its representative vector by several modeling techniques. The mostly used modeling techniques include, vector quantization (VQ), AANN, Gaussian mixture models (GMM) and GMM-universal back ground model (GMM-UBM) [4, 5, 27, 34–37]. State-of-the-art speaker recognition system mostly use GMM-UBM modeling technique.

The comparison is made at the frame levels and a score is assigned to the reference models.

The assigned score depends upon the modeling techniques. For example, Euclidean distance and log likelihood ratio (LLR) are used as the scores for VQ and GMM-UBM modeling techniques, respectively [38, 39]. The frame level scores are accumulated and normalized over the whole utterance of the test speaker. The decision is taken based on the scores assigned to the

Dalam dokumen PATI THESIS (Halaman 44-47)