Subspace Modeling Approaches for the Stressed Speech

The stress induces diversity in the acoustic parameters of speech production system. These diver- sities modify the properties of stressed speech. The investigations on the changes in the character- istics of stressed speech have included various signal processing algorithms in the acoustic-space, the feature-space and the model-space. The development of robust and computationally efficient algorithms is the foremost requirement for implementation of various practical applications in real life scenario. Among the numerous techniques explored in literature, the subspace projection-based technique is one of the most popular and widely accepted method in all fields of research. The

1.3 Subspace Modeling Approaches for the Stressed Speech

subspace projection method leads to an effective investigation on the subject in the framework of appealing realization capacity, efficient algorithms and simple computational aspects. Over the past several years, the subspace projection-based approaches have received a great deal of attention in the field of stressed speech processing.

In the area of stressed speech, the subspace projections are derived mainly by exploring the linear and the non-linear relationship between the speech and the stress information. Hansen et al. [69]

have proposed the subspace projection onto the source generator framework. The source generator framework was created using spectral features in an iterative manner. The proposed iterative approach was found to be very effective for enhancement and equalization of the source generator.

The source generator optimization method has resulted in the pronounced technique for capturing the variation in a production of speech under the emotional and noisy conditions. In another work [40], an orthogonal relationship between the speech- and the stress-specific attributes is assumed to sep- arate these two information from the speech produced under stress condition. The stressed speech is orthogonally projected onto the speech subspace and was reported to be very effective for nor- malizing the stress information. In that work, the speech subspace was created by performing the K-means clustering [1, 2] over the neutral speech features. The experimental observations of that work manifest the importance of estimation of an effective speech subspace. In the work reported in [34], an orthogonal noise signal decomposition onto the framework of rank-one projection was proposed for the normalization of multi-microphone noise. The multi-microphone noise is effectively suppressed by maximizing the total variance of the coherent noise.

In last few years, the sparse representation and the compressive sensing are extensively used for recognition and classification of speech produced under stress condition. This provides a new direction for the stressed speech processing research [42, 47, 51, 74, 75]. The sparse representation using the over-complete matrices are reported to provide a viable alternative to the projection under the full-rank matrices [76–78]. In the work reported in [74], the dictionary is learned in a multiview supervised manner for multiview representation analysis of speech emotion recognition. In the work reported in [75], the incomplete sparse least square regression (ISLSR) model is used for speech emotion recognition. The ISLSR is based on the linear relationship between the speech feature and the corresponding emotion labels. Jia-Ching et al. [42] have proposed mapping of variance on non-uniform auditory scale-frequency in sparse domain for emotion verification. In another work [51],

mean parameters of HMM were exploited in sparse domain to improve the accuracy of speech recognition for the stressed speech. These studies manifest that, the iterative process of sparse representation results in the more precise and robust representation of stressed speech features, which, in turns, leads to an effective technique for analyzing the stressed speech.

Speech signal comprises mainly of phonetic and speaker information. As discussed earlier, speaker modifies the speech production system to acknowledge the information about the adverse environmental factor. Under stress, both the phonetic and the speaker specific attributes of the speech are affected. Consequently, an ASR system that is trained using the neutral speech and tested on the stressed speech, exhibits a severe degradation in the recognition performance. There- fore, transforming the stressed speech through a subspace projection matrix capturing the principal dimensions of the acoustic variations represented by the neutral speech would help in improving the recognition performance. Furthermore, reducing the rank of the projection matrix will, in turn, reduce the mismatch in the variances resulting from the stress. From the reported works, it is also observed that, for the stressed speech, the high frequency region is affected more in comparison to the low frequency region [19, 41]. Chen [79] has shown that, the cepstral mean values have an exponential spectral tilt under stress. In other work [80], the low-rank features of emotion-specific component, which are derived using the hidden factor analysis (FA) and the mixture of factor analysis (MFA) are explored for speech emotion recognition. In the recently reported works [81, 82] on children’s mis- matched ASR system, the low-rank subspace projection has been explored and was reported to be very effective for reducing the variance mismatch between the training and the test environments.

In addition to the speech signal, the subspace projection techniques are successfully employed for investigating the stress information using the electroencephalogram (EEG) signal and the facial expression [39, 83–85]. In the work reported in [83], the canonical correlation analysis (CCA) of EEG signal and the privileged information of stimulus video was exploited for emotion recognition. The canonical correlation analysis (CCA) between the EEG signal and the video features were exploited for the creation of new space. This new subspace has resulted in the improved emotion recognition performances by implicit integration of privileged information in the training phase. In the research area of image processing, the subspace projection was accomplished over each pixel vector onto a subspace, which is orthogonal to the undesired signatures [39]. The proposed orthogonal projection was noted to be very effective for the classification of hyper-spectral images. In another work [85], the

Dalam dokumen Speech subspace modelling with speaker adaptation for stress normalization (Halaman 50-53)