Normalization of Speaker Variability - Speech subspace modelling with speaker adaptation for st

the dimension of features toL,(L≤C)as follows,

zt_C =Lxt_C =





LLxtC

LC−LxtC



=



 ztL

ztC−L



 (4.4)

where, LL and LC−L contain the first L and the remaining(C−L) rows of the projection matrix L having dimension C×C, respectively. The feature vectors, z_t_L and z_t_C−L comprise the first L and the remaining (C−L) elements of linearly transformed features z_t_C, respectively. The vector z_t_L forms L dimensional decorrelated feature vector. UsingLL, the synthesized neutral as well as the synthesized stressed speech features are projected onto the decorrelated feature subspace having dimension lower than that of the default dimension. Moreover, the MLLT-based semi-tied adaptation technique is employed to further decorrelate the resulting features and to develop the decorrelated model-space. In general, the final feature dimensionLis selected as39or40for the ASR system. As discussed earlier, the stressed speech exhibits a greater variance in comparison to the neutral speech leading to a degraded performance. Consequently, learning the subspace transformation matrixLL

on the synthesized neutral speech to reduce the feature dimensions below 40, will help alleviate the variance mismatch. During training, the synthesized neutral speech features are projected onto the lower dimensional subspace before learning the parameters of the acoustic model as depicted in Figure 4.1. For decoding, the synthesized stressed speech features are also projected onto the decorrelated feature subspace using the same projection matrix used in training.

4.4 Normalization of Speaker Variability

The speaker changes the movement of anatomical structure of speech production system to express the condition of stressful environment and to preserve the acoustic information of speech signals.

As discussed in Chapter 1, speakers with age and gender variations produce the same speech utterances under the similar stress condition with the varying parameters of stress. In literature, nu- merous works have been manifested that, the speaker-specific variability appears pronouncedly in the stressed speech [4–8]. The speaker variability causes high overlap between the different speech units of neutral and stressed speech. Consequently, the feature- and the model-space of speaker independent (SI) automatic speech recognition (ASR) system developed using the neutral speech data exhibit high variance for the recognition of stressed speech. Therefore, the normalization of speaker

information from both the neutral and the stressed speech will reduce the variance mismatch resulting from speakers under stress conditions. The adaptation of ASR system using the speaker normalize speech features will significantly diminish the acoustic mismatch between the training and the test environments and it will lead to the robust ASR system against the users under stress condition. The following paragraph summarizes a brief discussion on the method used for normalizing the speaker variability. This is followed by the evaluation of speaker normalization for the stressed speech.

Method of Speaker Normalization: In order to address the speaker variability, we have employed the feature-space maximum-likelihood linear regression (fMLLR)-based speaker normalization technique [144, 145, 161]. The decorrelated features z^ϕ_t_L from a particular speaker ϕresulting from the LDA+MLLT as discussed in Section 4.3 is adapted to ˆz^ϕ_t_L through the fMLLR transformation matrix W^ϕ and is determined as follows

ˆz^ϕ_t_L =W^ϕξ^ϕ_t

L (4.5)

where, ξ_t^ϕ

L =

1 z^ϕ_t_L^T T

constitutes the feature vector z^ϕ_t

L along with the unity element. The employed fMLLR transformationW^ϕ having dimension(L×(L+ 1))is learned in the speaker adaptive training (SAT) framework [95, 97, 146, 162] using the decorrelated features of synthesized neutral speech utterances. The speaker adaptive training adapts the speaker independent speech recognition system into the speaker dependent (SD) speech recognition system using low amount of adaptation data. In the works reported in [94–99], the speaker dependent speech recognition system are found to be very effective with improved recognition performance for the unknown test speakers. In addition to this, the features obtained by the concatenation of the discussed transformations (LDA + MLLT + fMLLR) are reported to be very effective in the case of DNN-based ASR systems [99, 164].

Evaluation of Speaker Normalization: The effectiveness of speaker normalization for reducing the acoustic mismatch between the neutral and the stressed speech is measured by evaluating the performances of stressed speech recognition tasks using the SUSSC database [118] as discussed in Section 2.1 in Chapter 2. All the experimental results of this experiment are evaluated using the experimental setup similar to as described in Subsection 2.3.2 in Chapter 2. The recognition performances for the stressed speech after normalizing the speaker variability using the 39-dimensional TEO-CB-Auto-Env and the 39-dimensional- MFCC features with respect to the GMM-HMM, the

4.4 Normalization of Speaker Variability

Table 4.1: The recognition performances for stressed speech (WER in %) by empoying the speaker normalization method over the MFCC and the TEO-CB-Auto-Env features with respect to GMM- HMM, SGMM-HMM, DNN-HMM and DNN-SGMM systems.

Acoustic Stress WER (in %) % Relative reduction

modelling class Conventional case Speaker adaptation

approach Feature type Feature type Feature type

MFCC TEO-CB-Auto-Env MFCC TEO-CB-Auto-Env MFCC TEO-CB-Auto-Env

GMM-HMM

Neutral 0.71 0 0 0 100 NI

Angry 25.71 41.71 7.43 7.71 71.10 81.51

Sad 7.41 31.31 3.03 6.40 59.11 79.56

Lombard 6.73 24.07 1.52 3.20 77.41 86.70

Happy 7.86 28.29 3.14 6 60.05 78.79

SGMM-HMM

Neutral 0.14 0 0 0 100 NI

Angry 48.71 44.43 15.71 5 67.75 88.75

Sad 24.07 36.36 10.77 3.70 55.25 89.82

Lombard 26.26 28.79 6.06 1.35 76.92 95.31

Happy 28.14 33 9.86 3.57 64.96 89.18

DNN-HMM

Neutral 1.29 0.71 0.29 0.14 77.52 80.28

Angry 18.86 11.29 7 3.71 62.88 67.14

Sad 9.09 12.96 1.35 3.37 85.15 74

Lombard 8.42 7.41 2.53 2.02 69.95 72.74

Happy 8.57 7.29 2.14 3.29 75.03 54.87

DNN-SGMM

Neutral 1.71 0.86 1.14 0.43 33.33 50

Angry 24.29 12.57 21.43 12.57 11.77 0

Sad 16.33 13.13 15.15 9.93 7.23 24.37

Lombard 16.16 7.74 12.12 4.71 25 39.15

Happy 13.71 7.57 11.29 6.71 17.65 11.36

SGMM-HMM, the DNN-HMM and the DNN-SGMM systems are summarized in Table 4.1. The DNN- HMM and the DNN-SGMM systems constitute the DNN comprising 3 hidden layers to estimate the posterior probability over their states. The WERs reported in this table for the conventional cases cor- respond to the case of using baseline ASR system developed on the features of original raw neutral speech data and tested using the original raw neutral and stressed speech utterances. The WER values for the speaker adaptation case are obtained from the 40-dimensional features of the original raw speech signals. These 40-dimensional features are derived from the time splicing of context size of 7 i.e.±3 frames, surrounding to the current frame of the speech. After that, LDA-based low-rank subspace projection is employed in MLLT-based semi-tied adaptation framework to reduce the dimension upto 40 and to decorrelate the feature- and the model-space. This is followed by the fMLLR-based linear regression transformation in SAT framework for the normalization of speaker variability. In this table, the bold face WER values and the acronyms ‘NI’ denote the best speech recognition performances and no relative improvement in the speech recognition performances in comparison to the conventional cases, respectively. The WERs depicted in this table show that the, speaker normalization significantly improves the recognition performances in comparison to the conventional cases in all the explored stress classes. This improvement is consistent across all the feature types as well as the acoustic modelling approaches. After normalizing the speaker information, the SGMM-HMM system developed on the TEO-CB-Auto-Env features has resulted in the best recognition performances with maximum relative decrement in the WERs of 88.75%, 89.82%, 95.31% and 89.18%, when compared to the WERs obtained in conventional cases under angry, sad, lombard and happy conditions, respectively. These experimental results demonstrate the similar phonetic structure for the training and the testing environments of speech recognition system, when trained and tested using the speaker normalize features of neutral and stressed speech, respectively. Speaker normalization reduces the variance mismatch between the speech units of neutral and stressed speech and helps in creating the similar characteristics for the training and the test environments for the speech recognizer. The resulting ASR system has leaded to the improved the accuracy of speech recognition for recognizing the stressed speech.

These observations motivate us to normalize the speaker variability from the synthesized speech signals. The normalization of speaker information from both the synthesized neutral and the synthesized stressed speech will additionally reduce the variance mismatch resulting from different speak-

Dalam dokumen Speech subspace modelling with speaker adaptation for stress normalization (Halaman 149-153)