Human Hand Gesture Recognition Using Spatio-Temporal Volumes for Human-computer Interaction

(1)

Human Hand Gesture Recognition Using Spatio-Temporal Volumes for Human-computer Interaction

Maryam Vafadar, Alireza Behrad

* Faculty of Engineering, Shahed University, Tehran, Iran.

Emails: [email protected], [email protected]

Abstract

With the emerging of the new applications like virtual reality in image processing and machine vision, it is necessary to have more perfect interfaces than mouse and keyboard for human computer interaction. To cope with this problem, variety of tools has been presented to interact with computers. Hand gesture recognition is one of the proper methods for this purpose. This paper presents a new algorithm based on spatio-temporal volumes for hand gesture recognition. In this algorithm, after applying necessary pre-processing on video frames, spatio-temporal volumes are constructed for each gesture. These volumes are then analyzed for matching and feature vectors are extracted in two consecutive stages. Finally a classification algorithm is applied for classification.

We examined three different classifiers including k-nearest neighbor, learning vector quantization and back propagation neural networks for recognition. We tested the proposed algorithm with the collected data set and results showed the correct gesture recognition rate of 99.98 percent for noiseless and 92.08 percent for noisy data.

Keywords

: Back propagation neural network, Feature vector, Hand gesture recognition, k-Nearest Neighbor, Learning vector quantization neural network, Spatio-temporal volume, Virtual reality.

I. Introduction

With the emerging of the new applications like virtual reality in image processing and machine vision, it is necessary to have more perfect interfaces than mouse and keyboard for human computer interaction. New tools are also needed for users to interact with computer without mouse or keyboard 2D limitations [1]. Sensors and video cameras are two sample tools that have been presented for this purpose. Sensor outputs need some filter stages to eliminate temperature, pressure and other noise effects. In

addition to low accuracy, higher cost of these devices put them out of reach for general use. Webcams are cheap and widespread [2]. By applying video image processing, they can be used for gesture recognition and interaction of persons with computer.

Different vision-based hand gesture recognition methods have been proposed including applying Fourier transform [3], wavelet transform [4] or Principal Component Analysis (PCA) [5] on images.

Edge orientation histogram [6], temporal templates [7] and oriented rectangular patches [8] also are used for human hand gesture recognition.

These methods are mostly pair-wise based and gesture information is extracted from groups of two consecutive frames. For example in temporal template methods, the difference of two consecutive frames is used. Edge orientation histogram is also another pair- wise based approach that recognizes gestures by extracting edge as a special feature in each frame and matching them between frames.

Although the two-frame based approaches have been successfully applied in some applications, it faces considerable difficulties in some conditions. For example noise and occlusion may degrade feature extraction and matching stages and reduce the performance of the algorithm. Spatio-temporal volumes have been utilized as a means to alleviate the shortcoming of the traditional pair-wise approaches [9]. Spatio-temporal volume unifies the analysis of spatial and temporal information by constructing a volume of data in which consecutive images are stacked to form a third, temporal dimension. So spatial and temporal continuity reduces the complexity of feature correspondence, significantly.

In addition, noise of frames has low effect in the volume and can rarely degrade performance of the algorithm. Detecting occlusion events are much easier too, because they are represented explicitly as truncated paths in this volume [10].

2008 Internatioal Symposium on Telecommunications

(2)

This paper presents a new spatio-temporal algorithm for hand gesture recognition. In this algorithm, after applying necessary pre-processing on video frames, spatio-temporal volumes are constructed for each gesture. These volumes are then analyzed for matching and feature vectors are extracted in two consecutive stages. Finally, three different classifiers including k-Nearest neighbor, learning vector quantization neural network and back propagation neural network are presented for feature vector classification and hand gesture recognition. We tested the proposed algorithm with the collected data set and results showed the correct gesture recognition rate of 99.98 percent.

The paper continues as follows: In section 2, we describe the collected data set of hand gestures.

Section 3 explains the preprocessing steps for constructing spatio-temporal volumes. Section 4 analyzes spatio-temporal volumes for feature vector extraction. In section 5, feature vector classification using three different classifiers are presented for hand gesture recognition. Section 6 shows experimental results for the proposed method and conclusions appears in section 7.

II. Collected data set

Data set is collected from different persons with different hand size and skin color. No obvious feature or sign exists on the hands of these persons. The collected data set contains different videos of 24 hand gestures. These gestures are 4 different motions of the hand including right, left, up and downward motion, with 6 gestures of changing fingers position to each other, simultaneously. Gestures are repeated 10 times in different illumination condition and with different user’s hand sizes. Hand motion velocities may be different and total frame number of each gesture may vary from 18 to 44. Resolution of the video frames is 320*240. Sample frames of 6 different rightward hand gestures are illustrated in Fig. 1.

Fig. 1. Sample frames of 6 rightward hand gestures; in each row frames are arranged from right to left.

III. Preprocessing

a) Hand segmentation

Prior to hand segmentation, it is necessary to remove noise from captured video frames. We applied median filter with window size=3 to each frame for handling salt pepper noise. Gaussian filter with zero mean and standard deviation=0.8 is used for removing Gaussian noise, too.

Skin-color based segmentation has proven to be an effective method for segmentation the hand in fairly unrestricted environment [11]. Video frames are generally in RGB format which is known to be sensitive to the conditions of lighting, because RGB includes the mixed information of color and intensity [12], [13]. Therefore, YIQ and HSV color systems is utilized that are known to be less sensitive to lighting condition. We use both color systems to obtain more accurate results. In YIQ color system, Y means intensity, while I and Q represent color information.

Also in HSV system, V denotes intensity, while H and S components specify color information. To reduce the effect of lighting and intensity, only I&Q values and H&S moments are used to build up a skin-color model for hand segmentation.

We individually apply k-means algorithm to I&Q and H&S color information and cluster two regions using two color models. Then the overlap of these two regions is specified as hand region. Fig. 2 shows the result of hand segmentation algorithm on sample frames.

Fig. 2. Result of hand segmentation algorithm on sample frames;

top row includes original frames and bottom row shows segmented hands

(3)

b) Hand contour extraction

As it is shown in Fig. 2, hand region may not be segmented efficiently, and noise may appear in output binary images. To handle noise and extract hand contour precisely, we use some morphology operators and apply them to binary image of previous stage as follows:

1. Remove connected component with the size of smaller than P pixels. Wrong regions that are considered as hand region are removed.

(opening operator)

2. Connect gaps and small dints to obtain hand image with smoothed edge. We use disk- shaped structuring element with radius of 10.

Radius=10 is an optimized value for different hand sizes.(closing operator)

3. Fill hand region to find edges easily.

4. Apply a proper operator to resultant binary image of previous stages to find edge points and extract hand contour.

Fig. 3 shows the result of hand contour extraction algorithm.

Fig. 3. Result of hand contour extraction algorithm Finally the extracted contours of consecutive frames for each gesture are stacked over each other to construct a spatio-temporal volume. Fig. 4 shows an example of spatio-temporal volume constructed from hand contours of a typical gesture.

Fig. 4. Example of a spatio-temporal volume constructed from hand contours of a typical gesture

IV. Spatio-temporal volume analysis

To recognize a gesture, spatio-temporal volume is extracted and matched with the spatio-temporal volumes of database to extract correct gesture. We used a two-stage algorithm which stages are in consecutive. In first stage, we apply spatio-temporal curve matching algorithm to database of hand gestures and select 6 gestures as candidate gestures.

Then the second stage of the algorithm is applied to the outputs of first stage to select the final gesture.

The first stage of the algorithm uses only the motion information of the hand center and has low computation overhead; however the second stage utilize hand contour shapes in spatio-temporal volumes and is more accurate. Therefore the proposed two-stage algorithm has the benefit of low computation overhead and high accuracy.

a) Motion curve matching

The motion of hand creates a curve in 3D spatio- temporal coordinates which we call it motion curve.

To extract motion curve we calculate the center of mass for hand contour in each frame. By saving center coordinates in a 1D array the motion curve is formed.

For each gesture, centers of consecutive frames create an individual curve. Fig. 5 shows hand motion curves of two sample gestures.

Fig. 5. Hand motion curves of sample gestures

The proposed algorithm should be insensitive to hand sizes and several users with different hand sizes may use this method. Also because of different camera and hand position and different hand motion velocity, it is possible that the same gesture produce different center coordinates. To handle this problem it is necessary to make center coordinates insensitive to these factors.

Fortunately the center of hand is insensitive to hand size as it is shown in Fig. 6.

Fig.6. Robustness of contour centers to different hand sizes

To deal with different gesture velocities, we normalize the motion curve and spatio-temporal volume to a constant length along time dimension.

For this purpose, we use a linear interpolation method: In each gesture, we divide gesture frames to a specified number of sectors as follows:

M

f = N (1)

(4)

WhereNis total frames number,Mis sectors number andf is the number of frames in each sector. Curve motion is also calculated using the average of contour centers in each sector. Therefore, all gestures with different frames number are converted to

Mdimensional feature vectors. This method handles

d

ifferent gesture velocities efficiently. We select 6

M= experimentally in our proposed method;

therefore by varying N from 18 to 44, f varies from 3 to 7.

To compensate for different hand position in each frame, we use difference of sectors centers instead of sectors centers coordinates to extract curve motion as follows:

a b T a T b d

a b d

−

= +

− +

=

−

=

) ( )

2 (

1 (2)

Where a,b are hand contour centers in two consecutive sectors, T is initial hand contour translation and d₁,d₂ are difference vector of contour centers in two consecutive sectors. As it is shown, even for different initial hand contour centers in consecutive sectors, difference vector of contour centers are the same and both denote the same gesture.

Now we are ready to match motion curves. We compare a given curve with all templates in curve database. We use the minimum square error to compare similarity between curves as follows:

[ ]

∑

=

−

=

M t

t S t Y MSE

1

)2

( )

( (3)

Where )Y(t is an unknown curve, S(t) is a template curve and Mis sectors number. The smaller MSE shows the higher degree of similarity. Finally, we select 6 gestures which will be processed by next stage to select final gesture.

b) Hand contour shape matching

The motion of fingers doesn’t change the curve motion significantly; therefore the algorithm of first stage cannot differentiate gesture with the same motion of centers and different finger position. For example all rightward hand motions with opening or closing fingers in consecutive frames have the same direction and similar motion curves, but they produce different hand contour shapes in spatio-temporal volumes. So we use another stage to match hand contour shapes in 3D spatial-temporal coordinates. As mentioned before we divide all frames into Mdistinct sectors by equation (1) to compensate for different hand velocity. In addition to sectors we divide the contour of each frame to sections. Fig. 7 shows the division of typical frame to S=5and 8S= sections.

Fig. 7. Dividing hand contour of sample frame to

5

S= andS=8 sections

By dividing contours to M sector and S section, we have M*S different areas in spatio-temporal volumes for feature vector extraction and matching. We extract features in each area using two different set of points in the area as follows:

1. Points in the area with maximum curvature.

2. All points of the area.

In the first method, points with maximum curvature is used for feature extraction, because in addition to indicating contour shape, they are less sensitive to different hand sizes and noises.

When points for extraction of feature are denoted, they are used for feature vector extraction by fitting a plane on these points. We fit a plane for each area and use normal vector of the plane as features. To fit a plane on points we use the following equation for a plane.

1 1 0

= + +

=

−

= + + +

Cz By Ax

d z y c d x b d a

d cz by ax

(4)

In equation (4), x,yand zdenote points in spatio- temporal coordinates and A,Band Care normal vector moments of the plane. z refers to temporal frame number.

Extracted points from each area form a set of equations as follow:

m m

n n n

B X A C

B A

z y x

⇒ =

⎥⎥

⎦

⎤

⎢⎢

⎣

⎡

=

⎥⎥

⎥

⎦

⎤

⎢⎢

⎢

⎣

⎡

⎥⎥

⎦

⎤

⎢⎢

⎣

⎡

1 ...

...

1 1

...

2 2 2

1 1 1

(5)

Where x₁,y₁,z₁,....,x_n,y_n,z_nare extracted points and A,Band Care normal vector moments of the best fitted plane that can be presented by matrixX. We can solve equation (5) and obtainX, using Least Mean Squared (LMS) criterion as follows:

( )

T m

m T m

mA A B A

X = ⁻¹ (6) Where X is the normal vector of the plane. Since magnitude of the normal vector is unit, just 2

(5)

moments is sufficient for identifying the plane. So with S sections and M sector for a spatio-temporal volume, we obtain a vector with size of 2*S*Mas feature vector.

It is important to note that dividing the spatio- temporal volumes to constant number of sectors and sections, and using plane normal vectors which are not affected by hand size, enable us to have feature vectors which are insensitive to different hand size, position and velocity.

c) Feature vector compression

In some application the size of feature vector is high and compression algorithms may be applied to reduce the size of the vector. The feature vector size for the proposed algorithm is2*S*M and the size depends on Mand S parameters which can be adjusted for data compression. The effect of M and S on recognition rate will be discussed in subsequent sections. Larger S and M provides higher dimension feature vector with higher accuracy. SmallerS and M provides compressed lower dimension feature vector with lower accuracy, too.

V. Classification

We used three different classifiers to classify 6 hand candidate gestures which are

1- k-Nearest Neighbor (kNN)

2- Learning Vector Quantization Neural Network (LVQ N.N.)

3- Back Propagation Neural Network (BP N.N.) K-nearest neighbor is a supervised learning algorithm based on minimum distance of input hand gesture from the training candidate gestures and determining k-nearest neighbors to classify gestures. Each of these k neighbors is related to a class of gesture. We select a class of gesture with maximum number of occurrence.

We use Euclidean distance to determine nearest neighbors.

Learning vector quantization (LVQ) is a method for training competitive layers in a supervised manner.

An LVQ network has two layers. First layer is competitive layer and second layer is linear. Back propagation is the generalization of the Widrow-Hoff learning rule to multiple-layer networks and nonlinear differentiable transfer functions.

VI. Experimental results

We tested the proposed algorithm with our collected data set. As we said before, we used two different methods for point selection as follows:

1. Points in the area with maximum curvature.

2. All points of the area.

We tested these two methods with different classifiers and different M and S(Section number) values. The experimental results showed the optimum value for M is 6. Fig. 8 and 9 show recognition rate of first and second methods for different classification algorithms and different S values.

In LVQ neural network, if two input vectors are very similar, the competitive layer probably will put them in the same class, but there is no mechanism in a strictly competitive layer design to say whether or not any two input vectors are in the same class or different classes. However, we select optimized value of 20 for number of hidden neurons in competitive layer and 300 epoch number for training network.

Approximated function of BP neural network may be not enough accurate, too. To cope with this problem we try to select optimum parameters experimentally.

We select different training epochs and neuron number for different parameters of the algorithm. As forS=9, we use optimized training epoch number of 600 and hidden neuron numberRI30, Zhereas

forS=18, optimized training epoch number of 450 and hidden neuron number of 15 are selected.

K-nearest neighbor has better recognition rate depending on k value. We tested the algorithm with different k values and results are illustrated in Fig. 8 and 9. As it is shown the optimum value of k is 2. In First method with kNN(k=2) classifier, recognition rates are between 96.66 to 99.58 percent. Similar recognition rates for second method are between 98.75 to 99.58 percent. This rate is an excellent rate comparing the rates reported in [6] and [9] for simple gestures.

We tested the proposed algorithm with noisy data by adding salt & pepper and Gaussian noise (N(0,σ)) on each frame of captured hand gesture, separately.

Then the algorithm applied with optimum classifier kNN(k=2) and S value of 9. Fig. 10 and 11 show proper recognition rate and result showed the promise of the algorithm even in the presence of noise.

First method

40 45 50 55 60 65 70 75 80 85 90 95 100

5 7 9 12 15 18

S (section num ber)

Recognition Rate%

kNN(k=2) LVQ N.N.

BP N.N.

kNN(k=3) kNN(k=4) kNN(k=5) kNN(k=6)

Fig. 8. Hand gesture recognition rates of first method

(6)

Second method

40 45 50 55 60 65 70 75 80 85 90 95 100

5 7 9 12 15 18

S (section num ber)

Recognition Rate%

kNN(k=2) LVQ N.N.

BP N.N.

kNN(k=3) kNN(k=4) kNN(k=5) kNN(k=6)

Fig. 9. Hand gesture recognition rates of second method

90 91 92 93 94 95 96 97 98 99 100

.04 .06 .08 .1 .12 .14

Noise density

Recognition Rate%

First method Second method

Fig. 10. Hand gesture recognition rates in presence of salt & pepper noise with various noise density

90 91 92 93 94 95 96 97 98 99 100

2 2.5 3 3.5 4 4.5 5

Gaussian noise sigma(10e-3)

Recognition Rate%

First m ethod Second m ethod

Fig. 11. Hand gesture recognition rates in presence of Gaussian noise with various σ values

VII. Conclusion

A new algorithm based on spatio-temporal volumes for hand gesture recognition LVpresented. ,n this

algorithm, after applying preprocessing steps on video frames, spatio-temporal volumes are constructed for each gesture. These volumes are then analyzed and feature vectors are extracted. At last, three different classifiers including k-Nearest Neighbor, Learning Vector Quantization and Back Propagation Neural Networks are presented for feature vector classification and hand gesture recognition. We tested the proposed algorithm with the collected data set and experimental results showed the reliability of our algorithm even in presence of noise. In addition, the proposed algorithm is robust to traditional problems

of gesture recognition like illumination variations, hand initial position in each frame, different hand gesture velocities and hand sizes and can be used for human-computer interaction, efficiently.

References

[1] G. Derpanis, “A Review of Vision-Based Hand Gestures”, York University, 2004.

[2] W. Hai, A. Shamaie and A. Sutherland, “Real Time Gesture Recognition of Human Hand”, Online Available:

http://www.computing.dcu.ie/~alistair/CA107.ppt [3] P.R.G. Harding and T. Ellis, “Recognition Hand

Gesture Using Fourier Descriptors”, Proc. of IEEE Conf. on Pattern Recognition (ICPR2004), vol. 3, pp. 286-289, 2004.

[4] C.R.P. Dionisio, R.M. Cesar and Jr, “A Project for Hand Gesture Recognition”, Proc. of IEEE Symposium on Computer Graphics and Image Processing, pp. 345, 2000.

[5] M. Vafadar and A. Behrad, “Human Hand Gesture Recognition Using Image Processing for Human-Computer Interaction”, Proc. of Int.

Conf. on Information and Knowledge Technology, IKT2007, 2007.

[6] W.T. Freeman and M. Roth, “Orientation Histogram for Hand Gesture Recognition”, IEEE Int. Workshop on Automatic Face and Gesture Recognition, 1995.

[7] M. Vafadar and A. Behrad, “Human Hand Gesture Recognition Using Motion Orientation Histogram for Interaction of Handicapped Persons with Computer”, Proc. of Int. Conf. on Image and Signal Processing, ICISP2008, LNCS 5099, pp. 378-385, 2008.

[8] N. Ikizler and P. Duygulu, “Human Action Recognition Using Distribution of Oriented Rectangular Patches”, Proc. of Workshop on Human Motion, LNCS 4814, pp. 271-284, 2007.

[9] C. Achard, X. Qu, A. Mokhber and M. Milgram,

“A Novel Approach for Recognition of Human Actions with Semi-global Features” Journal of Machine Vision and Applications, vol. 19, no. 1, January 2008.

[10] T. Collins, “Analyzing Video Sequences Using the Spatio-temporal Volume”, MSc Informatics Research Review, 2004.

[11] B. C. Nianjun Liu Lovell and P. J. Kootsookos,

“Evaluation of HMM Training Algorithms for Letter Hand Gesture Recognition”, IEEE Int.

Sym. on Signal Processing and Information Technology, ISSPIT2003, 2003.

[12] M.C. Chang, L. Matshoba and S. Preston, “A Gesture Driven 3D Interface”, Technical Report CS05-15-00, University of Cape Town, 2005.

[13] J. M. Salles, P. Nande, N. Barata and A. Correia,

“O.G.R.E-Open Gesture Recognition Engine”, IEEE Proc. on Computer Graphics and Image Processing, 2004.

Human Hand Gesture Recognition Using Spatio-Temporal Volumes for Human-computer Interaction