Facial Expression Recognition Based on Combination of Spatio-temporal and Spectral Features in Local Facial Regions

(1)

Facial Expression Recognition Based on Combination of Spatio-temporal and Spectral

Features in Local Facial Regions

Nakisa Abounasr

Department of Electrical Engineering, Najafabad Branch, Islamic Azad University,

Isfahan, Iran.

[email protected]

Hossein Pourghassem Department of Electrical Engineering, Najafabad Branch, Islamic Azad University,

Isfahan, Iran.

[email protected]

Abstract—This paper presents two new approaches for facial expression recognition based on digital curvelet transform and local binary patterns from three orthogonal planes (LBP-TOP) for both still image and image sequences. The features are extracted by using the digital curvelet transform on facial regions in still image. In this approach, some sub-bands correspond to angle of facial region is used. These sub-bands consist of more frequency information. The digital curvelet coefficients and LBP- TOP are represented to combine spatio-temporal and spectral features for image sequences. The obtained results by our proposed approaches on the Cohn-Kanade facial expression database have acceptable recognition rates of 91.90% and 88.38% for still image and image sequences, respectively.

Keywords—facial expression recognition; digital curvelet transform (DCUT); local binary patterns from three orthogonal planes (LBP_TOP).

I. INTRODUCTION

Automatic facial expression recognition has many applications in areas such as human-computer interaction (HCI), emotion analysis, indexing and retrieval of image and video databases interactive video, image understanding, and synthetic face animation [1].

According to [2], existing approaches to facial expression analysis can be categorized into approaches based on geometric and appearance. Geometric features include shapes and positions of face components, such as the corners of the mouth, noise and eyebrows. Appearance-based methods depend on skin motion and texture changes (deformations of the skin) such as wrinkles, bulges, and furrows. Dynamic texture recognition can be seen as a generalization of appearance- based approaches [2] . In other words, in addition to the above presented approach dynamic texture-based approach can be regarded for facial expression analysis. Facial expressions can be considered as a dynamic texture because of face muscles activity is dynamic. Therefore, the appearance and motions of dynamic texture can be considered in two directions, it means that information of spatial and temporal domains is joined together. Saha et al. [3] were recognized facial expressions by combination of curvelet transform and local binary patterns.

Also, they used curvelet entropy for classifying facial

expressions. Whereas, they only experimented on the still image and performed on JAFFE databases and final image of image sequences of Cohn-Kanade databases. Juxiang et al. [4]

were used curvelet sub-bands (i.e., low frequency, first detailed layer with 4 and 8 directions, second detailed layer with 4 and 8 directions) as features for recognition. They examined their approach on Cohn-Kanade and JAFFE databases.

Aleksic and Katsaggelos [1] proposed facial animation parameters as features describing facial expressions and used multistream HMMs for recognition. The system is complex, making it difficult to perform in real time. Bartlett et al. [5], [6]

is performed different methods, such as optical flow, explicit feature measurement (i.e., length of wrinkles and degree of eye opening), ICA, and the use of Gabor wavelets [2]. Yeasin et al.

[7] were used the horizontal and vertical components of the flow as features. At the frame level, the k-nearest neighbor (NN) rule was used to derive a characteristic temporal signature for every video sequence. At the sequence level, discrete Hidden Markov Models (HMMs) were trained to recognize the temporal signatures associated with each of the basic expressions. This approach cannot handle illumination variations, however. Cohen et al. [8] were introduced a Tree- Augmented-Naive Bayes classifier for recognition. Whatever, they only experimented on a set of five people, and the accuracy was only around 75 percent [9].

Recently Local Binary Patterns have been introduced as effective appearance features for facial image analysis. Local binary pattern descriptor is used for feature extraction. Not only it is a simple method but also it produces proper features to classify the texture with high accuracy. In other words, it is a powerful mean for feature extraction form texture. The most important benefit is that it is the insensitive to gray-scale and illumination invariance [10].

Curvelet transform is a multi-scale approach and multi directional decomposition [11, 12]. Curvelet transform can be used for feature extraction because this transform gives a sparse representation of objects, and basis elements of this transform have very high directional sensitivity, and are highly anisotropic [13].

2013 8th Iranian Conference on Machine Vision and Image Processing (MVIP)

(2)

The facial expression recognition methods based on curvelet coefficients are extracted features from the low frequency, high frequency and detailed information parts. Most of the previous methods used whole of approximate sub-band (low frequency) and detailed information as features.

In this paper, two algorithms are proposed for still image and image sequences on the Cohn-Kanade databases. The proposed algorithm for still image is that: the digital curvelet coefficients in first scale (low frequency part or approximate sub-band) and also specific detailed information parts (some sub-bands) in second scale are selected into eye regions and mouth regions as features. The proposed algorithm for image sequences is that: the digital curvelet transform in different scale and angle is taken from each image of image sequences.

Thus, the coefficients corresponding to low frequency are obtained, and then image sequences are produced from low frequency coefficients. Local binary pattern from three orthogonal plans (LBP-TOP) extracted features form low frequency curvelet coefficients. In fact, the spatiotemporal and spectral domains are combined together. At last, the SVM classifier for both of proposed algorithm is used for classification.

The rest of the paper is organized as follows: Section 2 represents the proposed method and describes the methods of feature extraction and classification. Then, Section 3 discusses and represents the results of facial expression recognition by our method. Finally, conclusions are given in section 4.

II. PROPOSED ALGORITHM

Feature extraction is an important stage in facial expression recognition. Hence, the proposed method relies on extract suitable and accurate features. The proposed representation method for still image (the left column) based on digital curvelet coefficient and also, image sequences (the right column) based on DCUT and LBP-TOP is illustrated in Fig. 1.

A. Digital Curvelet Transform

The second generation of digital curvelet transform (DCUT) is described by Cand’es and Donoho [11]. This transform can be performed by using two methods, i.e., the 2D FFT transform and the wrapping transform [13]. These two implementations differ by the choice of spatial grid proposed to translate curvelets at each scale and angle. The wrapping algorithm gives a faster computation time. The motivation of the use of DCUT for feature extraction could be explained based on the statistical properties of curvelet coefficients such as directional sensitivity. Thus, proper features can be extracted by using corresponding coefficients and specific sub-bands [14].

B. Local Binary Patterns from Three Orthogonal Planes The LBP operator was introduced in [15]. The implement basic LBP consider a neighborhood around each pixel next threshold the neighbor pixels by value of the central pixel.

Then obtained binary value is weighted as follows:

¦

^p ^l

i

c i p

c sI I

I LBP

0

2 )

( (1)

Fig. 1. Diagram of proposed methods.

(A) (B) (C) Fig. 2. (a)Three orthogonal planes XY, XT and YT.(b)

histogram of each plan.(c) concatenated into a single histogram [10].

where p is a neighbor point. I_i and I_c are the gray level values at c and i, and s follows:

¯®

t

0 1 0

s otherwise

s I

I_i _c (2)

The term "uniform" refers to the uniform appearance of the local binary pattern, i.e., there are a limited number of transitions or discontinuities in the circular presentation of the pattern. In other words can be say simply, the LBP code is

"uniform" when maximum changes between 0, 1 and vice versa should be two times [15].

The LBP code is extracted from the XY, XT, and YT planes, which are denoted as XY - LBP, XT - LBP, and YT – LBP [9]. Fig. 2 shows that three orthogonal planes XY, XT and YT on image sequences of face. The statistics of the three different planes are obtained and then concatenated into a

Image sequence

DCUT

Select low frequency sub- band

Extract LBP- TOP features Still image

Segment facial region

DCUT of facial regions

Select proper sub- band

Normalized original image

Classification

(3)

single histogram. In such a representation, image sequences are encoded by the XY-LBP, XT-LBP, and YT -LBP, whereas the appearance and motion of facial expressions are considered in three directions, it means that spatial information and spatial temporal information are incorporated.

C. Support Vector Machine (SVM)

Support vector machine (SVM) is a binary decision (classifiers) that classified two classes by using a linear boundary. The multi-class classification is accomplished by using the one-against-one and one-against-rest (one-against-all) technique. In one-against-rest technique trains binary classifiers to discriminate one expression from all others, and outputs the class with the largest output of binary classification. In one- against-one technique trains binary classifiers to discriminate each expressions, In other words, one-against-one technique constructs N × (N -1)/2 binary classifiers, using all the binary pair-wise combinations of the N classes. Each classifier is modified by using the samples of the first class as positive and the examples of the second class as negative samples. To combine these classifiers, the Max Wins algorithm is accepted.

It finds the subsequent class by selecting the class voted by the majority of the classifiers. For example in our experiment, six classes are decomposed into 15 pair-wise classes [16, 10].

III. EXPERIMENTAL RESULTS

A. Data Set

The proposed algorithm was implemented on Cohn-Kanade Facial Expression Database [17]. This database consists of 100 university students in age from 18 to 30 years. 65% were female, 15% were African-American, and 3% were Asian or Latino. The facial expression database is arranged at six basic emotions such as anger, disgust, fear, happiness (joy), sadness, and surprise. For our study, 243 images are selected for both methods. The experiment is performed on six emotions for combination second generation curvelet transform and LBP- TOP together.

B. Preprocessing and Selected Facial Region

At first, the image sequences are normalized in to 144x120 pixels like what was mentioned in [18].As it is observed in Fig.

3, in comparison with some parts of face such as cheek, other parts like eyes, eyebrows, the space between eyebrows and lips can change more. Thus, the interest features are extracted from the regions that can change more.

Described regions were cropped from the final frame of each sequence manually. These regions were obtained by determining center of eyebrows and lips. The eye regions and lip regions are selected 34x94 pixels and 60x40 pixels, respectively. The sample of one image from database, the normalized image and selected facial regions are shown in Fig.

4a, b and c, respectively.

C. Experiment Results Based on Proper Curvelet Sub-bands in Local Facial Region for Still Image

The DCUT in different scales and angles is applied on facial regions instead whole of the image, as mentioned above.

According to size of images, each image is decomposed into 3 scales. At second scale 16 directions is defined, which makes 32 directions in third sub-bands. The First and third scales

Fig. 3. facial expression for anger emotion.

(A) (B) (C)

Fig. 4. (a) Sample of Cohn-Kanade dataset. (b) Normalized image. (c) selected facial regions.

Fig. 5. The directional sub-bands in specific angle.

consist of low and high frequency information, respectively.

The energy of third scales (high frequency) is low and not has appropriate information at this field.

As it is shown in Fig. 5, in each scale, sub-bands which are in front of each other, consist of the frequency information in same range of angles. For example, there are 16 sub-bands in second scale, the first and 9th sub-bands, the second and 10th sub-bands and … then, the 8th and 16th sub-bands are the pair wise sub-bands that consist of frequency information in same range of angles in this scale. Following this reasoning, the features can be extracted from only eight sub-bands.

Consequently, the dimension of feature vector and calculating are reduced. In addition, the dimension of feature vector is too large if whole sub-bands are used. Hence, we propose a method that some proper sub-bands are utilized.

Table 1 presents the average energy of curvelet coefficient at first and second scales for each expression. The values of Table 1 are multiplied by10⁵. As we can see, the energy of approximate sub-band (first scale) is the most. In addition to, the energy of second and third sub-bands after that, 6th and 7th sub-bands are more than others sub-bands in second scale. This means that, these sub-bands consist of more frequency information into related angles. Fig. 5 shows the scale and angle segmentation of curvelet transform. The 2nd, 3rd, 6th and 7th sub-bands in the second scale are denoted.

(4)

In each sub-band frequency information, for each 1<i<8,

are in this range _]

8 4 ,3 8

1 4

[3S i S SiS . Thus, the angles of the mentioned sub-bands are corresponding to facial expressions changes. Here, the first scale (low frequency) next, the first scale with four sub-bands in second scale (the second, third, 6th and 7th sub-bands that denoted in Fig. 5) are used as features. In the experiment, the SVM classifier by second degree polynomial kernel function, as defined by

)2

1 (

) ,

(x_i x_j x_i^Tx_j

k is used.

Table 2 presents the classification rates for 65 percent numbers of training samples. The results of using approximate sub-band and then, approximate sub-band with the second, third, 6th and 7th sub-bands in second scale as feature are denoted at first and second row, respectively. The best recognition rate obtains a 91.90 % by using proposed approach (approximate sub-band with mentioned sub-bands as feature).

The recognition rate of expressions as labeled joy, sadness and surprise are the most. Since motion of the mouth is unique and different to other expressions in three expressions. The misclassified is occurred when discrimination of expressions in motion of the mouth is little. For example, motion of the mouth in fear resembles to other expressions specifically joy.

D. Experiment Results Based on Combination of DCUT and LBP-TOP for Image Sequences

In [3] and previous works were recognized facial expression on still image, whereas according to psychologists [19], analyzing a sequence of images produces more accurate recognition of facial expressions [9]. So, the proposed method is implemented on image sequences. The DCUT is taken from each image of image sequences. According to the size of normalized image, the curvelet decomposition produces in 5 scales. The approximate sub-band is selected for feature extraction that its energy is more than others. So, image sequences for each expression corresponding to the low frequency of curvelet coefficients are obtained. This image sequences is divided into different number of non-overlapping blocks. Then, uniform local binary pattern from three orthogonal plans, as shown in Fig. 2, extracted features form each block. Finally, the SVM classifier, as described in section C, classified six classes. Here, in each planes four and eight neighboring points and also in three axes the radii one and three are examined. Considering the results of experiments, the performance is more efficient when the neighboring points in each planes is eight, the radii in axes X, Y is one, and the radii in axes T is two. It should be noted that the approximate sub- band contains low frequency information, so it is completely different from original image that contains intensity. Actually, the proposed method considers information in spectral, special and temporal domains. Table 3 summarizes the recognition rate using first scale of DCUT and LBP_TOP as feature for 70 percent number of training samples. The proposed algorithm achieves an almost 88.38 percent overall recognition rate.

As expected, surprise expression result due to unique of motion of the mouth into whole block size is the most. As observed, the classification rate is better when block size is 3x2 and 3x3. Table 4 compares our two approaches with other

methods. Comparison results show that the proposed method can be recognized by acceptable accuracy.

TABLE I. AVREAGE ENERGY OF FIRST AND SECOND SCALES Average energy (u10⁵)

anger joy disgust fear sadness surprise

First scale Approxi

-mate 3919 2531 3125 3325 3731 3464 1st 0.730 0.886 0.745 0.886 0.956 0.991

2nd 5.43 5.21 5.21 5.41 5.51 3.955

3rd 5.43 4.39 5.01 5.54 5.21 4.40

4th 0.767 0.694 0.742 0.854 0.816 1.03 5th 0.591 0.549 0.538 0.635 0.622 1.018

6th 1.77 1.25 1.88 1.48 1.85 2.04

7th 1.72 1.33 1.58 1.68 1.92 2.21

Second scale

8th 0.54 0.610 0.609 0.631 0.645 1.19

TABLE II. RESULTS (PERCENT) OF DIFFERENT FEATURS FOR RECOGNITION STILL IMAGE

Recognition Rate (%)

surprise sadness

fear disgust joy

anger Features

93.33 100

64.28 81.81

100 72.72 approximate

sub-bands

100 100

71.42 90

100 90 approximate

&

mentioned sub-bands

TABLE III. RESULTS (PERCENT) OF DCUT&LBP-TOP FOR IMAGE SEQUENCES

TABLE IV. COMPARISON WITH DIFFERENT APPROACHES Method Classification rate % Saha & Jonathan [3] 90.33 Still image

Our method 91.90

Cohen [9] 73.22

Image sequences

Our method 88.38

Recognition Rate (%) block size

anger joy disgust fear sadness surprise

2×2 72.72 66.66 83.33 78.57 57.14 100

3×2 88.88 84.61 81.81 91.66 83.33 100

3×3 72.72 73.33 81.81 100 83.33 100

(5)

IV. CONCLUSIONS

This paper presents two algorithms for recognized facial expression for still image and image sequences which rely on selecting proper sub-bands. Since using the whole of sub-bands is not reasonable and the feature vector dimension is huge, so some sub-bands are selected as features. The selection criterion for proper sub-bands is based on calculation of the energy into each sub-band. The energy of first scale is the most after that the second, third, 6^th and 7^th sub-bands are more than other sub- bands in second scale.The spatiotemporal and spectral features are developed by combining DCUT and LBP-TOP. The results show that the proposed methods have enough accuracy to recognized facial expressions.

R^EFERENCES

[1] S.P. Aleksic and K.A Katsaggelos, “Automatic Facial Expression Recognition Using Facial Animation Parameters and Multi-Stream HMMS,” IEEE. Trans. Signal Procrssing, Supplement on Secure Media, vol. 1, pp. 3-11, March 2006.

[2] S. Koelstra, M. Pantic, and L. Patras, “A Dynamic Texture-Based Approach to Recognition of Facial Actions and Their Temporal Models,” IEEE. Trans. Pattern Analysis and Machine Intelligence, vol.

32, pp. 1940-1954, 2010.

[3] A. Saha and Q.M.J Wu, “Facial Expressio Recognition Using Curvelet based Local Binary Patterns,” Proc. IEEE Int’l Conf. Acoustice Speech and Signal Procssing, pp. 2470 – 2473, March 2010.

[4] Z. Juxiang, W. Yunqiong, X. Tianwei, and L. Wanquan, “A Novel Facial Expression Recognition based on the Curvelet Features,” IEEE Conf. Image and Video Technology, pp. 82-87, 2010.

[5] M. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel and J.

Movellan, “Recognizing facial expression: machine learning and application to spontaneous behavior,” IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp. 568-572 , 2005.

[6] M. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel and J.

Movellan, “Fully Automatic Facial Action Recognition in Spontaneous Behavior,” IEEE Conf. Computer Vision and Pattern Recognition, vol.

2, pp. 223-230, 2006.

[7] M. Yeasin and B. Bullot, “From Facial Expression to Level of Interest:

A Spatio-Temporal Approach,” IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp. 922-927, 2004.

[8] I. Cohen, N. Sebe, A. Garg, L.S. Chen, and T.S. Huang, “Facial Expression Recognition from Video Sequences: Temporal and Static Modeling,”Computer Vision and Image Underatanding, vol. 37, pp. 160- 187,2003.

[9] G. Zhao and M. Pietikainen, “Dynamic texture recognition using local binary patterns with an application to facial expressions,” IEEE Trans.

Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 915-927, 2007.

[10] N. Abounasr and H. Pourghassem, “Facial expression recognition using ALGBP-TOP,” Int’l Journal of Imaging and robotics, vol. 11, no. 3, pp.

11-23, 2013.

[11] D.L. Donoho and M.R.Duncan, “Digital Curvelet Transform: Strategy, Implementation and Experiments”, Department of Statistics, Stanford University, Nov 1999.

[12] E.J. Candes and D. L. Donoho, “Curvelets: A Surprisingly Effective Non-Adaptive Representation for Objects with Edges”, in Proc. Saint- Malo, PP 1-10, Apr 2000.

[13] D. Donoho, M.R. Duncan, “Digital Curvelet Transform: Strategy, Implementa- tion, Experiments, Technical Report”, Stanford University, 1999.

[14] M. Esmaeili, H. Rabbani, A. Mehri, “Automatic optic disk boundary extraction by the use of curvelet transform and deformable variational level set model”, Pattern Recognition, vol 45, pp. 2833-2842, 2012.

[15] T. Ojala, M. Pietikainen and T. Maenpaa, “Multiresolution Gray Scale and Rotation Invariant Texture Analysis with Local Binary Patterns”, IEEE Trans. Pattern Analysis and Machine Intelligence,vol 24, pp. 971- 987, 2002.

[16] F. Falah Chamasemani and Y. Prasad Singh, , “Multi-class Support Vector Machine (SVM) classifiers – An Application in Hypothyroid detection and Classification”, Int’l Conf. Bio-Inspired Computing:

Theories and Applications, pp. 351-356, 2011.

[17] P. Lucey, J. F. Cohn, T. Kanade, J. Saragij, Z. Ambadar and I.

Matthews, The extended Cohn-Kanade Dataset (CK+): A complete expression dataset for action unit and emotion-speified expression.

Proceeding of the Third International Workshop on CVPR for Human Communicative Behavior analysis (CVPR4HB2010), San Francisco, USA: 94-101.

[18] Y.L. Tain, “Evaluation of Face Resolution for Expression Analysis”, Proc. IEEE Workshop face Processing in Video, 2004.

[19] J. Bassili, “Emotion Recognition: The Role of Facial Movement and the Relative Importance of Upper and Lower Areas of the Face,” J.

Personality and Social Psychology, vol. 37, pp. 2049-2059,1979.