Depending on the task objective, speaker recognition can be classified asspeaker identifica- tion and speaker verification.
1.4.1 Speaker Identification
In this case there is no identity claim and the system identifies who the speaking person is [1,6,11,12]. The identification process compares the test speech data with all the speaker models to decide the speaker of the test speech data. The computational complexity of the speaker identification increases as the population size increases and at the same time performance degrades [6]. Improving speaker recognition performance for limited data and large database is an interesting and challenging task. Speaker identification can be further classified asclosed-set and open-set [11, 13].
1.4.1.1 Closed-set
In this case, test speaker is known a priori to be a member of a set of N speakers [11, 13].
The steps involved in closed-set speaker identification is shown as a block diagram in Figure 1.1.
In closed-set speaker identification the speaker whose model best matches with the test speech data is declared as speaker of the test data. The closed-set speaker identification system provides the output as identity of a speaker. This system provides high security in the closed environment. The disadvantage of this system is that only a priori enrolled speakers can use the system. This type of system may be useful where high security is needed which includes defense, private organization etc.
1.4 Classifications of Speaker Recognition
Similarity
Reference Model Speaker # 1
Similarity
Reference Model Speaker # 2
Similarity
Reference Model Speaker # N Feature
Extraction Speech
data
Selection criteria
Identified speaker Framing
and windowing
Figure 1.1: Block diagram of closed-set speaker identification system.
1.4.1.2 Open-set
The speaker identification system which is able to identify the speaker, may be from outside the set ofN speakers and is known as open-set speaker identification [11,13]. The steps involved in open-set speaker identification is shown as a block diagram in Figure 1.2. In this case, first the closed-set speaker identification system identifies the speaker closest to the test speech data.
Then, the verification system is used to compare the distance of this speaker with a threshold to come up with a decision. If the speaker is accepted, then the speaker is the identified speaker for the test data. Otherwise, the system has to generate an error message that the speaker is unknown. The open-set speaker identification system does not impose any restriction in usage as in closed-set. The disadvantage is that the design complexity involved is more and finding the optimum threshold to verify the speaker is a tedious task. This system can be used in public relation services like banking, airport, railways etc.
TH-797_05610204
1. Introduction
Feature Extraction
Closed-set Speaker Identification
Speaker S3Models
S1 S2 S3
…
… SN
Identified Speaker S3
Speaker Verification
Speaker-set (N-speaker)
Accept (speaker S3) Reject (not any of N speakers)
Input Speech
Framing and windowing
Figure 1.2: Block diagram of open-set speaker identification system.
1.4.2 Speaker Verification
This is used to verify the claimed identity of a person from his/her speech [1, 6, 11, 14].
The steps involved in speaker verification is shown as a block diagram in Figure 1.3. In this case, when an identity claim is made by a speaker, the speech data is compared with respect to the model of the speaker whose identity is claimed. The concept of threshold is used to come up with the decision. If the distance of the test speech data to the target model is below the threshold, then the speaker is accepted as genuine speaker. Unlike in identification, this process involves a binary decision (accept/reject) about the claimed identity regardless of the population size. Hence, the performance of the verification system does not depend on the size of the population.
1.4 Classifications of Speaker Recognition
Feature
Extraction Similarity Decision
Accept Reject Speech
data
SpeakerID (# M) Reference Models Speaker # M
Threshold
Framing
and windowing
Figure 1.3: Block diagram of speaker verification system.
Depending on the mode of operation, speaker recognition can be classified astext-dependent and text-independent [1].
1.4.3 Text-dependent
In this mode, speakers require to provide same speech for both training and testing. More- over, in this mode of recognition the cooperative users speaking fixed digit string passwords or repeating prompted phrases from a small vocabulary are needed [1, 8, 15]. Due to the con- straints in the operation, this system can greatly improve the accuracy of recognition. In this case, in order to recognize speaker, template matching is done between training and testing speech data. The technique like Dynamic Time Warping (DTW) is used for template matching.
1.4.4 Text-independent
In this mode, training and testing speech need not be same [1, 8, 15]. As we have seen, the text-dependent mode imposes constraint on the speaker. However, there are cases when such constraints can be difficult to impose. This include background verification, forensic, remote biometric person authentication etc. For such cases, a more flexible recognition system which can be able to operate without explicit user cooperation and independent of the spoken speech is preferable. . Since different speech data is used for training and testing, the performance is TH-797_05610204
1. Introduction
not as good as text-dependent mode. The text-independent recognition is more difficult but also more flexible [1, 8]. The modelling technique like Vector Quantization (VQ) or Gaussian Mixture Models (GMM) is used in this case. Since GMM captures good statistical variations compared to VQ, GMM is used as state-of-the-art modelling technique [16]. Text-independent mode speaker recognition is useful if the speaker is noncoperative [8]. In this thesis, we focus on closed-set, text-independent, speaker identification task.