A Feature Selection Method in Spectro-Temporal Domain Based on Gaussian Mixture Models

(1)

A Feature Selection Method in Spectro-Temporal Domain Based on Gaussian Mixture Models

Nafiseh Esfandian¹, Farbod Razzazi², Alireza Behrad³, Sara Valipour⁴ 1.Faculty of Engineering, Islamic Azad University, Qaemshahr Branch, QaemShahr, Iran 2.Faculty of Engineering, Islamic Azad University, Science and Research Branch, Tehran, Iran

3.Faculty of Engineering, Shahed University, Tehran, Iran 4.Faculty of Engineering, Islamic Azad University, Arak Branch, Arak, Iran

na_esfandian@yahoo.com, razzazi@srbiau.ac.ir, behrad@shahed.ac.ir, valipour_Sara@yahoo.com

Abstract—Spectro-temporal representation of speech is considered as one of the leading speech representation approaches in speech recognition systems in recent years. This representation is suffered from high dimensionality of the features space which makes this domain unusable in practical speech recognition systems. In this paper, a new method of feature selection is proposed in the spectro-temporal domain. In this method, clustering techniques are applied to spectro- temporal domain to reduce the dimensions of the features space.

In the proposed approach, spectro-temporal space is clustered based on Gaussian Mixture Models (GMMs). The mean vectors and covariance matrices elements of the clusters are considered as a part of the feature vector of the frame. The tests were conducted for new feature vectors on voiced stops (/b/, /d/, /g/) classification of the TIMIT database. Using the new feature vectors, the results were improved to 70.45% which is 7.95%

higher than last best results.

Keywords-component; Speech recognition; Speech processing;

auditory system; Feature extraction; Clustering methods.

I. INTRODUCTION

One of the most important issues in the process of speech recognition is feature extraction of the speech signals.

Successful examples of audio features are Mel scaled frequency cepstral coefficients and spectro-temporal features that are inspired from hearing models [1]. In particular, spectro-temporal features use a simplified human brain cortical stage model [2]. In recent years, computational auditory model is obtained according to neurology, biology and investigations at various stages of the auditory system of the brain and is being developed in various applications such as speech recognition [3], voice activity detection [4], pitch detection [5], and speech enhancement systems [6]. This model has two main stages. In the stage of auditory model, auditory spectrogram of signal is extracted. In the next stage, spectro-temporal features of speech are extracted employing a set of two dimensional spectro-temporal receptive field (STRF) filters on the spectrogram. STRF filters are the scaled versions of a two dimensional impulse response [1]. It is observed that these features are robust in noisy environments with in comparison with cepstral coefficients [7]. The main drawback of spectro- temporal features is the large number of extracted features which may affect the parameter estimation accuracy in the training phase of speech classifiers. Some methods such as

PCA, LDA and neural network are used to reduce the number of features in Spectro-temporal domain [8]. These methods are general feature selection methods. Therefore, using these methods does not exactly compatible with the speech classification problems. The motivation of this study is that the classes' information is concentrated in the specific parts of the features space. In other words, it is desirable to represent the features as a number of clusters. Some studies show that the space of MFCC features is not properly clustered [9]. In other words, this space does not have distinct clusters. Thus, applying clustering methods to MFCC may be just considered as a kind of space coverage.

In this paper, clustering methods are used to reduce the spectro-temporal features into a few effective features for each frame. In the new clustered features space, phonemes are more separable. There are various methods for clustering. One of these methods is Gaussian Mixture Model (GMM) [10].

Gaussian Mixture Models are able to model irregular data.

Therefore, in this paper, GMM is used for the space clustering as the feature reduction approach. The organization of the paper is as follows. The spectro-temporal feature extraction procedure is briefly discussed in section 2. The proposed feature extraction algorithm for phonemes using the behavior of the GMM clusters in spectro-temporal domain is presented in section 3. Experimental results are discussed in section 4 and the paper is concluded in section 5.

II. SPECTRO-TEMPORALFEATUREEXTRACTION Auditory model that is described in this section, is the modeling of internal ear and first layer of auditory brain section that are used for speech processing applications in recent years [2]. This model has two basic stages. The block diagram of auditory model is shown in Fig. 1 [4].

A. The First Stage of Auditory Model

When an audio signal enters the ear, the neural activities of the basilar membrane of the cochlea, convert the one dimensional audio signal into a two-dimensional auditory spectrogram image which the frequency axis of this 2D image, is a tonotopic axis. Tonotopic axis is a logarithmic frequency axis, which frequency is not divided uniformly. In contrast, it is divided according to the center of bandpass filters located in the inner ear. Basilar membrane is considered as a bandpass ,&633URFHHGLQJV

522 ___________________________________

(2)

filter bank. This filter bank includes 128 asymmetric band-pass filters which are uniformly distributed along the tonotopic axis.

The cochlear filter outputs are converted into auditory nerve patterns by an inner hair cell stage (IHC). IHC stage consists of a high-pass filter, an instantaneous nonlinear compression and a low-pass filter. The last stage is a model of lateral inhibitory network (LIN) activity, which increases the frequency selectivity of the cochlear filters. LIN is approximated by a first order derivative along the tonotopic axis. The final output of this stage, is obtained by using a half wave rectifier and an integrator during a short time window.

B. The Cortical Stage of Auditory Model

Auditory stage of brain (especially, the input auditory cortex), analyzes the auditory spectrogram as an image with more details. At this stage, a two-dimensional wavelet transform of auditory spectrogram is calculated and this transform is performed using a spectro-temporal mother wavelet, similar to a two-dimensional spectro-temporal Gabor function. In this stage, the spectral and temporal modulation contents of the auditory spectrogram are estimated via a bank of modulation-selective 2-D filters, which are centered at each frequency along the tonotopic axis. Each filter is tuned to a range of temporal modulations. Spectro-temporal impulse response of this filter is called spectro-temporal response field (STRF). There are two primitive STRF two dimensional function types which are named upward (+) and downward (-) respectively.The cortical representation of speech has four dimensions, scale ( in cycles / octave), rate or velocity ( in Hz), frequency (f, the numbers of the band-pass filters) and time (t, the frame number). A bank of directional selective STRFs are created by combining two complex functions of time and frequency that each of the STRFs, are real functions.

It has been shown that STRFs in input auditory cortex have good discriminating characteristics for different phonemes which is consistent with physiological findings [11].

Considering the direction of STRF filters, two complex answers are obtained, one for upward filters and the other for downward filters. The dimensions of the features space and the degrees of freedom in this space are very large. Therefore, the reduction of features space dimensions is a crucial task to train the parameters of the classifiers efficiently [4].

Figure 1. Block diagram of the auditory model

III. PHONEME FEATURE EXTRACTION USING PHONEME CLUSTER BEHAVIOR IN THE SPECTRO-TEMPORAL SPACE

The block diagram of the proposed method is shown in Fig.

2. In the first stage, the auditory spectrogram of the speech frames is calculated. Then, spectro-temporal features are extracted, using auditory spectrogram [12]. The information in the output of cortical stage is distributed in the four dimensions of time, frequency, rate and scale. The auditory cortex is modeled by the spectro-temporal filter bank that each of these filters is tuned in the range of different speeds and scales.

Because of the large dimensions of spectro-temporal feature space, in the proposed model, a clustering block is added in the spectro-temporal model to reduce the feature space attributes and select the features with valuable information. In this section, features space, is segmented using GMM clustering.

As a result, defining the main cluster in each speech frame, new feature vectors are extracted with smaller dimension.

A. Gaussian Mixture Model

Gaussian Mixture Model is one of the common methods for unsupervised classification. In this method, data is clustering using a mixture of Gaussian distributions and each cluster is modeled as a single Gaussian probability distribution function in the input space. The main problem of this method is the estimation of model parameters [13]. GMM parameters are mean vectors, covariance matrixes and mixing weights that are estimated in an iterative method to fit the model to the dataset.

An efficient method to estimate the parameters is using maximum likelihood estimation through expectation- maximization (EM) algorithm [14].

Probability distribution of data set feature vectorsX {xi,i 1,..,N}, is expressed as a linear combination of the K Gaussian mixtures is given by the following expression:

) 1 ( 1

}, , , { ), , , ( )

(

1

K k C

x N C x

P k k

K

k

k k k i k

i^O

¦

^P ¦ ^O ^P ¦ d d

O

is a symbol of GMM parameters and N(x_i,P_k,¦_k) is the Gaussian probability density function with mean vector

P

_k and matrix covariance

¦

_k.

) 2 ( )]

( ) 2( exp[ 1 )

2 ( ) 1 , ,

( ₁_/₂ ¹

2

/ k i k

T k i k

k D k

i x x

x

N P P

P S ¦

¦

Figure 2. Block diagram of the proposed method Band pass

Filter Bank

IHC LIN

Half Wave Rectifier

&

Integrator

STRFs

Auditory Spectrogram 4D Cortical

Output Speech

First stage of Auditory

Cortical Stage of Auditory

Model

Clustering

Phoneme Classification Speech

523

(3)

where D is the number of dimension of vector

x

_i . Selection of the number of mixtures is depended on the number of training data. The mixing weights should satisfy

0 , 1

1

¦ ^kt

K

k

k C

C .

B. Spectro-Temporal extracted features

For clustering in spectro-temporal domain using GMM, The outputs of the upward and downward cortex STRFs are calculated for each point of four-dimensional space of scale (ȍ in cycles / octave), rate (Ȧ in Hz), frequency and time that are the coordinates of the auditory output as two amplitudes and two phases of the complex outputs. Therefore, the vector in the clustered space is represented as a five dimensional vector

) , , , ,

(rs f A₁I₁

v . In this vector,

r

denotes rate, s is the scale,

f is the frequency,

A

₁is the magnitudes and

I

₁ denotes the phase output of upward or downward STRFs at each points of the space. To combine both upward and downward STRF responses, the output of both set of filters are concatenated in the rate axis. Therefore, in each frame, two 3D cubes with rate axis, scale axis and frequency axis are concatenated to have a new 3D cube with twice lengthened in the rate axis. Scale, rate and frequency are the coordinates of the location at the new space in each frame which should be considered in the feature vectors. In regions of spectro-temporal space which there is speech signal, the magnitude component of points are large. As a result, clusters with less magnitude, have low valuable information. Assuming that the points in this space are Gaussian distributed, these five-dimensional vectors are clustered based on GMM assuming diagonal covariance matrix. Model parameters can be determined using EM algorithm and the center of each cluster, (i.e. the mean vector of the cluster), is considered as the feature vector of the cluster.

Considering that the number of clusters is predefined and then the center of cluster with larger magnitude is considered as a more valuable feature. Therefore, the feature vector is determined after sorting the magnitude components of cluster centers as V (P₁,P₂,P₃,...) where each mean vector consists of five components. In this study, three clusters are extracted for each frame. Therefore, the feature vector consists of 15 elements in the mean only case. It was empirically shown that the phase components are not important in the clustering procedure. Therefore, they were removed from the feature vector in the mean plus variance case, to avoid the curse of dimensionality problem. In another approach, the clustering is performed on the whole phoneme to have a single feature vector for each phoneme.

IV. EXPERIMENTALRESULTS

The extracted features are tested for (/b/, /d/, /g/) phoneme classification on clean speech from TIMIT database.

TIMIT contains a total of 6300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States. The proposed method is tested on different dialects [15]. To test the proposed method, in the first stage, the data was selected from TIMIT database, and then features were extracted from clustered spectro-temporal space. Extracted feature vectors, was used for training. Then evaluations were

performed by using test set feature vectors employing frame- wise classification rate and two mechanisms for classification results fusion of frames. After feature selection procedure, the new feature vectors are applied to a nonlinear SVM classifier with Gaussian kernel and Hidden Markov Model (HMM), for phoneme classification [16,17]. The parameters of SVM are the Radial Basis Function (RBF) kernel parameter

J

and miss- classification cost C in the training process. The optimum values of parameters were empirically determined to optimize the classification rate.

A. Mechanisms of Features Combination

The system was primarily evaluated on a frame base framework. In this evaluation, each frame was clustered and the resulted phoneme feature vector was extracted for each frame. In the next evaluation strategy, the phoneme classification rate was evaluated. For this purpose, two mechanisms were used for combination of features. In the first mechanism, the extracted feature vector of each frame of each phoneme is classified by using SVM and the final class of each phoneme is determined by voting among the phonemes classes of all frames of each phoneme utterance. This mechanism is called voting after classification. The second mechanism for combining of feature vectors is using the mean feature vector of a phoneme sequent frame, before classification.

B. Results

In table I, the classification rates of (/b/, /d/, /g/) phonemes is shown for MFCC (Mel Frequency Cepstral Coefficients), LPC (linear predictive coding), NPC (neural predictive coding) features [18] and they are compared with the proposed feature vector in this study. The new feature vector gives better results for phoneme classification in all dialects. The results of (/b/, /d/, /g/) phoneme classification with HMM, are shown in table

II. In addition, the results of (/b/, /d/, /g/) phoneme classification with SVM, are shown in table III, using frame- wise and phoneme-wise evaluation strategies of feature evaluation for frame clustering. The results show that, the mechanism of classification after voting is better than the second mechanism, for phoneme classification. This may be due to the fact that in this strategy, the classification results of all frames of a phoneme are considered in final decision. The results of (/b/, /d/, /g/) phoneme classification, that feature vectors are extracted for each phoneme, are shown in table IV, for Dialect 1 with SVM. One of the drawbacks of this idea is the processing time for feature extraction. In another experiment, threshold technique was used to reduce the dimension of feature vector in each frame.

TABLE I. PHONEME CLASSIFICATION RATE OF THE PROPOSED METHOD COMPARING TO WITH EXISTING METHODS

Features LPC

MFCC NPC Proposed

Classification Rate 55.5

57.8 62.5 70.4

For this purpose, the points of spectro-temporal domain, with amplitudes that are smaller than an empirically

524

(4)

determined threshold, can be removed before clustering from the feature vector in each frame. The results of (/b/, /d/, /g/) phoneme classification with threshold technique are shown in table V, for all dialect. Although the process was accelerates, however, phoneme classification rate was decreased significantly.

TABLE II. PHONEME CLASSIFICATION RATE FOR VARIOUS DIALECTS ON TIMIT DATA BASE USING HMM

ALL 8 7 6 5 4 3 2 1 Dialect

59.1 82.8 65.9 69.4 67.7 67.3 68.3 60.9 71.8 Train

59.1 67.0 62.1 62.2 64.0 59.2 65.6 55.3 64.7 Test

TABLE III. PHONEME CLASSIFICATION RATE FOR VARIOUS DIALECTS ON TIMIT DATA BASE USING SVM CLASSIFIER

Phoneme classification rate

Dialect mean feature

vector voting after classification Frame classification

Test Train Test Train Test Train

74.6 71.7 78.2 75.7 73.0 73.3 1

64.5 72.3 66.4 77.4 64.8 74.3 2

71.0 72.2 72.4 76.0 68.4 73.3 3

69.7 72.9 71.2 75.8 68.2 72.9 4

69.5 73.4 70.8 76.3 67.6 73.7 5

72.0 75.5 72.6 80.4 70.3 78.2 6

68.7 71.4 70.4 76.7 67.0 73.5 7

74.2 76.5 74.7 82.3 69.1 78.1 8

70.3 71.4 73.7 75.8 70.4 72.9 ALL

TABLE IV. PHONEME CLASSIFICATION RATE FOR FEATURE EXTRACTION FOR EACH PHONEME , ON DIALECT 1TIMIT DATA BASE WITH SVM

Test Train

66.32 77.23 Phoneme Classification Rate

TABLE V. PHONEME CLASSIFICATION RATE FOR THRESHOLD TECHNIQUE, ON TIMIT DATA BASE USING SVM CLASSIFIER

Phoneme classification rate

mean feature vector voting after classification

Frame classification

Test Train Test Train Test Train

45.8 47.3 65.2 76.3 59.1 69.8

V. CONCLUSION

In this paper, a new method was presented for spectro- temporal features selection in order to reduce the features space dimensions. This method is based on a clustering method in the spectro-temporal domain to extract the main clusters of each acoustic event in the spectro-temporal domain. In the proposed method, the position of clusters is determined at each frame of speech. Therefore, a training procedure is needed to fulfill the feature extraction for each frame. As a result, the feature extraction method may be regarded as an adaptive feature extraction method which tries to consider the frames conditions. Two mechanisms of feature combination were evaluated for phoneme classification, voting after classification and mean feature vector. This feature vector was used for (/b/, /d/, /g/) phoneme classification on TIMIT database. Results indicate that new features have better performance in the phoneme classification, comparing to existing methods.

REFERENCES

[1] X. Yang, K. Wang, and S. A. Shamma, “Auditory representation of acoustic signals,” IEEE Trans. Inf. Theory, vol. 38, no. 2, pp. 824–

839,Mar. 1992.

[2] K. Wang and S.A. Shamma, “Spectral shape analysis in the central auditory system,” IEEE Trans. Speech Audio Process., vol. 3, no. 5, pp.382–395, Sep. 1995.

[3] B. Meyer and M. Kleinschmidt, “Robust Speech Recognition Based on Localized Spectro-Temporal Features,” in Proceedings of the Elektronische Sprach- und Signalverarbeitung (ESSV), Karlsruhe, 2003.

[4] N. Mesgarani, S.A. Shamma and M. Slaney, “Discrimination of speech from non-speech based on multiscale spectro-temporal modulations,” IEEE Transactions on audio, speech, and language processing vol.14, no.3, may 2006.

[5] T.T. Wang, and T.F. Quatieri. “High-Pitch Formant Estimation by Exploiting Temporal Change of Pitch.” Audio, Speech, and Language Processing, IEEE Transactions on 18.1 (2010): 171-186. © 2009 Institute of Electrical and Electronics Engineers.

[6] N. Mesgarani, S. Shamma, “Speech Enhancement Based on Filtering the Spectrotemporal Modulations,” International Conference on Acoustic, Speech and Signal Processing(ICASSP), 2005.

[7] J. Bouvrie, T. Ezzat and T. Poggio, .Localized Spectro-Temporal Cepstral Analysis of Speech, ICASSP, Las Vegas, Nevada, 2008.

[8] J. Pylkkönen, “LDA based feature estimation methods for LVCSR, in InterSpeech 2006, Pittsburgh PA, USA, 17-21 Sep 2006.

[9] T. Kinnunen, I. Kärkkäinen and P. Fränti, “Is speech data clustered? - Statistical analysis of cepstral features, In Euro Speech 2001, 2627-2630.2001.

[10] X.Y. Liu, Z.W. Liao, Z.S. Wang and W.F. Chen, “Gaussian mixture models clustering using Markov Random Field for multispectral remote sensing images, Proceedings of the Fifth International Conference on Machine Learning and Cybernetics, Dalian, 13-16 August 2006.

[11] N. Mesgarani, S. V. David, J. B. Fritz, S. A. Shamma, “Phoneme Representation and classification in primary auditory cortex”, JASA, 123(2):899-909, 2008

[12] M. Kleinschmidt, “Methods for capturing spectro-temporal modulations in automatic speech recognition, in Acta Acustica united With Acustica 88, pp.416-422, 2002.

[13] X.Y. Liu, Z.W. Liao, Z.S. Wang and W.F. Chen,

“Gaussian mixture models clustering using Markov Random Field for multispectral remote sensing images, Proceedings of the Fifth International Conference on Machine Learning and Cybernetics, Dalian, 13-16 August 2006.

[14] A. Dempster, N. Laird, and D Rubin, “Maximum likelihood from incomplete data via the EM algorithm, Journal of Royal Statistical Society. Ser. B, vol. 39, no. 1, pp. 1-38, 1977.

[15] University of Pennsylvania Linguistic Data Consortium. Darpa- timit: a multi speaker's data base.

[16] J. Salomon, ”Support Vector Machines for Phoneme Classification,” Master thesis, University of Edinburgh, 2001.

[17] P. Somervuo, “Experiments with linear and nonlinear feature transformations in HMM-based phone recognition,” in ICASSP, 2003.

[18] B. Gas, J.L. Zarader, C. Chavy and M. Chetouani, . Discriminant neural predictive coding applied to phoneme recognition, Neurocomputing 56,141 .166, 2004.

525