Classifications of Speaker Recognition - Limited Data Speaker Recognition

Depending on the task objective, speaker recognition can be classified asspeaker identification and speaker verification.

1.4.1 Speaker Identification

In this case there is no identity claim and the system identifies who the speaking person is [1,6,11,12]. The identification process compares the test speech data with all the speaker models to decide the speaker of the test speech data. The computational complexity of the speaker identification increases as the population size increases and at the same time performance degrades [6]. Improving speaker recognition performance for limited data and large database is an interesting and challenging task. Speaker identification can be further classified asclosed-set and open-set [11, 13].

1.4.1.1 Closed-set

In this case, test speaker is known a priori to be a member of a set of N speakers [11, 13].

The steps involved in closed-set speaker identification is shown as a block diagram in Figure 1.1.

In closed-set speaker identification the speaker whose model best matches with the test speech data is declared as speaker of the test data. The closed-set speaker identification system provides the output as identity of a speaker. This system provides high security in the closed environment. The disadvantage of this system is that only a priori enrolled speakers can use the system. This type of system may be useful where high security is needed which includes defense, private organization etc.

1.4 Classifications of Speaker Recognition

Similarity

Reference Model Speaker # 1

Similarity

Reference Model Speaker # 2

Similarity

Reference Model Speaker # N Feature

Extraction Speech

data

Selection criteria

Identified speaker Framing

and windowing

Figure 1.1: Block diagram of closed-set speaker identification system.

1.4.1.2 Open-set

The speaker identification system which is able to identify the speaker, may be from outside the set ofN speakers and is known as open-set speaker identification [11,13]. The steps involved in open-set speaker identification is shown as a block diagram in Figure 1.2. In this case, first the closed-set speaker identification system identifies the speaker closest to the test speech data.

Then, the verification system is used to compare the distance of this speaker with a threshold to come up with a decision. If the speaker is accepted, then the speaker is the identified speaker for the test data. Otherwise, the system has to generate an error message that the speaker is unknown. The open-set speaker identification system does not impose any restriction in usage as in closed-set. The disadvantage is that the design complexity involved is more and finding the optimum threshold to verify the speaker is a tedious task. This system can be used in public relation services like banking, airport, railways etc.

TH-797_05610204

1. Introduction

Feature Extraction

Closed-set Speaker Identification

Speaker S₃Models

S₁ S₂ S₃

…

… S_N

Identified Speaker S₃

Speaker Verification

Speaker-set (N-speaker)

Accept (speaker S₃) Reject (not any of N speakers)

Input Speech

Framing and windowing

Figure 1.2: Block diagram of open-set speaker identification system.

1.4.2 Speaker Verification

This is used to verify the claimed identity of a person from his/her speech [1, 6, 11, 14].

The steps involved in speaker verification is shown as a block diagram in Figure 1.3. In this case, when an identity claim is made by a speaker, the speech data is compared with respect to the model of the speaker whose identity is claimed. The concept of threshold is used to come up with the decision. If the distance of the test speech data to the target model is below the threshold, then the speaker is accepted as genuine speaker. Unlike in identification, this process involves a binary decision (accept/reject) about the claimed identity regardless of the population size. Hence, the performance of the verification system does not depend on the size of the population.

1.4 Classifications of Speaker Recognition

Feature

Extraction Similarity Decision

Accept Reject Speech

data

SpeakerID (# M) ^Reference Models Speaker # M

Threshold

Framing

and windowing

Figure 1.3: Block diagram of speaker verification system.

Depending on the mode of operation, speaker recognition can be classified astext-dependent and text-independent [1].

1.4.3 Text-dependent

In this mode, speakers require to provide same speech for both training and testing. More- over, in this mode of recognition the cooperative users speaking fixed digit string passwords or repeating prompted phrases from a small vocabulary are needed [1, 8, 15]. Due to the constraints in the operation, this system can greatly improve the accuracy of recognition. In this case, in order to recognize speaker, template matching is done between training and testing speech data. The technique like Dynamic Time Warping (DTW) is used for template matching.

1.4.4 Text-independent

In this mode, training and testing speech need not be same [1, 8, 15]. As we have seen, the text-dependent mode imposes constraint on the speaker. However, there are cases when such constraints can be difficult to impose. This include background verification, forensic, remote biometric person authentication etc. For such cases, a more flexible recognition system which can be able to operate without explicit user cooperation and independent of the spoken speech is preferable. . Since different speech data is used for training and testing, the performance is TH-797_05610204

1. Introduction

not as good as text-dependent mode. The text-independent recognition is more difficult but also more flexible [1, 8]. The modelling technique like Vector Quantization (VQ) or Gaussian Mixture Models (GMM) is used in this case. Since GMM captures good statistical variations compared to VQ, GMM is used as state-of-the-art modelling technique [16]. Text-independent mode speaker recognition is useful if the speaker is noncoperative [8]. In this thesis, we focus on closed-set, text-independent, speaker identification task.

Dalam dokumen Limited Data Speaker Recognition (Halaman 38-42)