Exploiting Deep Learning-Based LSTM Classification for Improving Hand Gesture Recognition to Enhance Visitors’ Museum Experiences

(1)

Exploiting Deep Learning-Based LSTM Classification for Improving Hand Gesture Recognition

to Enhance Visitors’ Museum Experiences

Item Type Conference Paper

Authors Zerrouki, Nabil;Houacine, Amrane;Harrou, Fouzi;Bouarroudj, Riadh;Cherifi, Mohammed Yazid;Sun, Ying

Citation Zerrouki, N., Houacine, A., Harrou, F., Bouarroudj, R., Cherifi, M. Y., & Sun, Y. (2022). Exploiting Deep Learning- Based LSTM Classification for Improving Hand Gesture Recognition to Enhance Visitors’ Museum Experiences. 2022 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT). https://

doi.org/10.1109/3ict56508.2022.9990722 Eprint version Post-print

DOI 10.1109/3ict56508.2022.9990722

Publisher IEEE

Rights This is an accepted manuscript version of a paper before final publisher editing and formatting. Archived with thanks to IEEE.

Download date 2024-01-02 18:01:37

Link to Item http://hdl.handle.net/10754/686732

(2)

Exploiting Deep Learning-Based LSTM Classification for Improving Hand Gesture Recognition to Enhance Visitors’ Museum

Experiences

1^st Nabil Zerrouki Center for Development of Advanced Technology

DI2M Laboratory Algiers, Algeria [email protected]

2^nd Amrane Houacine University of Sciences

and Technology Houari Boum´edienne (USTHB), LCPTS Laboratory

Algiers, Algeria [email protected]

3^rd Fouzi Harrou King Abdullah University

of Science and Technology CEMSE Division Thuwal, Saudi Arabia [email protected] 4^th Riadh Bouarroudj

USTHB

Computer Science department Algiers, Algeria [email protected]

5^th Mohammed Yazid Cherifi USTHB

Computer Science department Algiers, Algeria em [email protected]

6^th Ying Sun King Abdullah University of Science and Technology

CEMSE Division Thuwal, Saudi Arabia

Abstract—Hand gesture recognition (HGR) is one of the main axes of the Human-Computer Interaction (HCI) research field.

And computer vision is a very active dedicated research area.

However, traditional vision-based methods, like using a fixed camera to record video sequences of sign language, have some serious drawbacks inherent to the fixed camera location, complex lighting conditions, and cluttered backgrounds. Motivated by these potential limitations, the present paper addresses the detection and classification of hand gestures based rather on wearable video monitoring data. A new feature extraction strategy based on five hand’s partial occupancy areas in images is provided. And a deep learning formalism, using the Long Short-Term Memory (LSTM) algorithm, has been implemented to ensure an effectual separation between classes. To analyze the performances of the classification, available data have been used for experiments, and both virtual and real museum scenarios are considered. The obtained results demonstrated that the combined five area ratios and LSTM classification were not only able to recognize different hand gestures but it was also able to distinguish between actions with a high degree of similarity (like slide left and slide right classes). The use of deep learning-based LSTM algorithm in the classification phase helped in reducing significantly the number of misclassifications, and achieving an outstanding recognition performance when challenged with real-world data.

Index Terms—Hand gesture recognition, wearable vision, ego- centric vision devices, feature extraction, LSTM classification.

I. INTRODUCTION

In view of the important development of media manage- ment techniques, and intelligent vision devices; the “people’s activity understanding” research field has received significant interest. Numerous works have been developed during the few last years, such as smart rooms, environment modeling [1],

people detection [2], [3], face detection and identification [4], and motion and action estimation [5]. In this paper, we have selected one of the most interesting topics in the scene analysis field known as “Hand gesture recognition”. This domain field has made a great evolution due to its potential applications in several areas like human-computer interaction, robotics, and user interface communication [6]. Hand signs may also have special consideration in medical applications for seniors’

health care, where providing a natural interaction for diverse needs can be performed by generating particular motions [7].

In the present work, we have focused on the interpretation of some hand signs, commonly encountered in museums, and in historical or cultural sites.

According to several studies conducted by the American Office of Travel and Tourism Industries and the European Travel and Tourism Council, virtual tours, interactive maps, and audio guides allow visitors and interested people to understand the cultural and historical heritage easily and much faster [8].

In this way, many researchers and engineers have developed several hand gesture recognition systems, where an important effort is ongoing to propose new smart solutions, adapted to cultural and historical patrimony. Sparacino proposed a wearable device using an augmented reality tool offering an audiovisual narration [9]. This device is based on statistical mathematical modeling but does not use any computer vision or machine learning algorithm for understanding the visitor’s interaction with the environment. Kuusik et al. proposed a

“SmartMuseum” application for mobile users using PDAs and

(3)

RFIDs [10]. Visitors can have information on the content of the cultural site, and especially personalize the visit according to their interests. Based on learning probabilistic models, Nowozin and Shotton proposed the concept of “action points”

to analyze human actions by the use of the Hidden Markov Model algorithm [11]. The choice of HMM was motivated by its properties in resolving classification problems with sequential data. One more temporal model has been proposed by Wang et al. to classify human gestures, which added a discriminative hidden state in the concept of the Hidden conditional random fields’ algorithm [12]. However, the use of only one layer of hidden states might not be powerful enough to train a higher-level representation and may have some limitations with problems requiring a large amount of data.

In [6], Baraldi, et al. proposed hand gesture recognition system based on ego-vision embedded devices. Trajectory descriptor, histograms of oriented gradients (HOG), of optical flow (HOF), and motion boundary histograms (MBH) have been extracted and utilized as attributes that represent the input of an SVM classifier. In [13], Paci proposed the use of interest points and local descriptors as attributes. The recognition is based on template images matching using Random sample consensus (RANSAC) algorithm.

The present work was initially interested by the growing attention in ego-centric vision which inspires the development of novel interfaces to interact with the cultural and historical patrimony. The choice of hand gesture recognition problematic was also encouraged by the context of movable camera (since it is embedded). The context of mobile camera is not frequently evoked, and the majority of the current studies are based on fixed cameras. Gesture interpretation with non-fixed camera (directly attached to the user) remains very challenging for a vision based approach since wearable camera generates in some cases poor quality and noisy images. One more point that inspires our work is related to feature extraction phase. In our former work, we proposed the use of partial occupancy areas to describe human postures for scene analysis applications [2].

Based on the encouraging results obtained on human activity recognition, in this work we applied a similar approach to characterize hand movements. The method described in this paper, aims to reduce errors in separating different hand gestures using a deep learning framework, where Long Short- Term Memory (LSTM) algorithm is applied in classifying data.

The remaining parts of this work are structured as follows:

Section II presents a global system overview. In Section III hand detection and segmentation steps are reported. Section IV defines the use of area ratios as human hand attributes.

Section V details the LSTM formalism and outlines its usage in hand gesture recognition. Obtained results of the proposed scheme on the publicly available datasets and comparative tests with the state of the art methods are illustrated in Section VI.

Finally, Section VIII concludes this work with some remarks.

II. SYSTEM OVERVIEW

In order to better recapitulate the main phases of this work, a flowchart summarizing the conducted methodology is illustrated in Figure 1. The system execution: human hand segmentation, generation of feature, and hand gesture recognition.

In this work, the person’s hand is first detected and then segmented by eliminating stationary pixels within frames.

Once the hand was segmented on each image, five hand’s occupancy areas are calculated and used as features. The use of area ratios generates a reasonable dimensionality of attributes, where no prior selection phase or dimensionality reduction technique is required. The last module performs hand gesture classification, where deep learning endowed with LSTM algorithm is adapted to model hand gestures. The idea of using LSTM is mainly in relation to its adaptation sequential data and data presenting variability in time.

Fig. 1. The proposed system flow.

In addition, classifier results in various proposed solutions (as in the case of SVMs) are not really significant when the sequential aspect is not taken into account.

III. HAND DETECTION

The segmentation phase involves of extracting human hand from each RGB input image. In this work, human hands are distinguished using a combination of (i) background subtraction procedure and (ii) skin detection technique. First, from stationary pixels over successive images a background template is defined to eliminate unchanged regions in the

(4)

sequence. This technique is relatively efficient because the decision is made in relation to several difference operations between frames.

Fig. 2. Results of the background subtraction algorithm.

These difference operations are based on threshold values established between each two successive images. This method is suitable for fixed background but remains acceptable in some sense with the application at hand since the background template is defined from a limited number of successive images.

The second step of human hand detection is accomplished with the application of a skin detection technique. Due to the use of moving camera, presenting varying illumination, some false segmentation pixels are present in the resulting image (after the image-subtraction). These pixels are mainly related to the presence of shadows, or to the light reflection effect. To refine the hand detection result, skin color based detection is applied. Figure 2 illustrates some hand segmentation cases, presenting background templates (column 1), input frames (column 2), detected hands (column 3) while fourth column represents segmentation results after applying erosion and dilation functions.

IV. FEATURE EXTRACTION

Feature generation is considered as critical phase in video- based recognition or classification, since generated characteristics directly impact the recognition rate. In the present application, extracted characteristics have to resolve to the (i) translation problem (related to different positions of the hand in the image), and (ii) to scaling problem (related to hand sizes in the image) which is proportional to the distance separating the visualized hand and camera (placed on the user’s head).

Numerous proposed works have been preferred to work with shape information in describing human hand. However, most of the proposed shape-based attributes cannot differentiate among some hand classes, particularly when hand gestures present a high degree of resemblance.

For this purpose, we avoid the modeling considering the human hand as a geometric shape. In this application, characteristics are rather based constituting the hand. More specifically, we apply five fractional occupancy region ratios of the hands

to differentiate the class to which they belong (Figure 3). The partitioning is defined by five sectors from the center of mass (or gravity) (xG, yG) of the hand, which is just the barycenter of the N pixels constituting the hands. where xi ,yi and N

denote respectively the horizontal, vertical coordinates of each pixel and number of pixels belonging to the hand.

The first two sectors are delimited at 45° of the vertical (pixels in green and cyan colors) while the third and fourth sectors are delimited at 100° of the two first sectors respectively (pixels in yellow and magenta). The remaining area represents the last sector (pixels in blue). In order to keep the scale invariance aspect, a normalization step is applied on the five computed areas through dividing each area value by the whole hand area.

V. LSTM-BASED HAND GESTURE RECOGNITION

The recognition framework consists of an LSTM encoder to generate the temporal attributes of the hand motion profile sequences (constructed by the area ratios concatenation) and a softmax layer to calculate the conditional probabilities of belonging to one of seven hand classes.

The main advantage of the use of an LSTM encoder is the fact that it can obtain temporal attributes with different lengths depending on the classes (static or dynamic) and the different ways in which people perform these gestures. LSTM can also model long-term dependencies by moderating the vanishing gradient problem, which is often encountered in other algorithm’s formalisms like RNN [14]–[16].

Fig. 3. Samples of hand portioning phase.

(5)

The LSTM algorithm receives the concatenation of area ratios sequence, corresponding to sequence of frames,AR1, . . . , ART, as input. Different LSTM parameter namely: input gate (it) the forget gate (ft), memory cell (ct), output gate (ot), and hidden state (ht), at each point of time t can be defined as follows:

where σ represents the sigmoid function while J represents the element-wise product. The instant t varies from 1 to T. The three quadruplets [w^mi, w^mf, w^mo, w^mc], [w^hi, w^hf, w^ho, w^hc], and[bi, bf, bo, bc]represent the weights, diagonal weights, and bias vectors in the input, forget, output and cell vectors, respectively. The encoder vector V is repre- sented by the hidden state considering the whole length T of the hand gesture data instance:

Lastly, the encoded vector, V, is associated to the softmax layer, S, in order to convert the encoded vector to the probability of belonging to hand gesture classes. In the estimation of conditional probabilities, s, sigmoid (logistic) function has been used:

where the conditional probability can be expressed as:

while θ represents the model parameters. In the learning process, classification layer computes cross entropy loss which is used as loss function in separation between different classes:

Where yij represents 1-hot encoding of the sample label i. Finally, the model parameter θ is re-estimated after each back-propagation stage Gtvia RMSProp method [14], [17]:

Where the parameters γ, η, L denote respectively the momentum, the learning rate, and the averaged loss function over a mini-batch.

VI. EXPERIMENTAL RESULTS

A. Dataset description

To investigate the proposed strategy’s performance and evaluate its abilities in hand gesture recognition, we propose the use of two new hand sign classification databases acquired via ego-centric vision in virtual and real museum situations.

For both databases, the experimental video sequences are publicly available from the AImageLab research laboratory, University of Modena and Reggio Emilia, Italy [6], [13]. The first dataset, namely the Interactive Museum dataset, contains 700 video samples, acquiring 25 RGB frames per second with a resolution of800×450. Several operators accomplish seven hand signs, namely: like, dislike, point, ok, slide left to right, slide right to left and take a picture. Some hand signs (like the point, ok, like, and dislike gestures) are static, while the rest (like the two slide gestures) are dynamic.

The second dataset, namely the Maramotti database, consists also of 700 video sequences. The difference with the first database is that it is taken under real conditions that are in the Maramotti modern art museum, in which artworks, sculptures, and artistic paintings are exposed. In the Maramotti dataset, the recognition task is more challenging since the operators perform different hand gestures in real conditions considering challenging environments, like different illumination conditions, the presence of other people (visitors) in the scene, etc [13].

B. Hand gesture classification results and Discussion To assess the performance of the investigated methods, several statistical metrics have been employed [18]. In fact, LSTM classification is conducted and compared with several algorithms, such as neural network, k nearest neighbor, support vector machine, and Na¨ıve Bayes method. In the training stage, each algorithm’s optimal parameters are designated, (Table I).

For testing, the obtained evaluation performance statistics are illustrated in Table II. LSTM results present the highest recognition rate of 0.97 and AUC of 0.965, which means its capacity in the identification of different hand gestures with a relatively small number of false negatives and high detection rate even in challenging situations corresponding to degraded illumination conditions, presence of shadows and light reflections, and presence of other visitors in the scene that are considered as moving parts.

One can also note that some classical algorithms, such as KNN, Na¨ıve Bayes, and neural network classifications, achieve low accuracies (0.87, 0.528, and 0.90 respectively) compared to the rest of the methods.

Results in Table III clearly illustrate the superiority of the LSTM-based classification compared to the other explored methods in distinguishing hand gestures.

In the case of the matching-based technique proposed in [13], one can perceive several misclassifications, which led

(6)

TABLE I

THE PARAMETERS SELECTED FOR EACH CLASSIFICATION METHOD.

TABLE II

CLASSIFICATION PERFORMANCE COMPARISON.

to a recognition ratio relatively low (around 71%). In the work presented in [19], where sequential formalism based on HMM has been conducted, the accuracy is relatively acceptable, with an overall of 87.4% achieved. However, this recognition is based on X,Y, andZ directions and the inclination of each hand and each finger (generated as hand gesture features).

Indeed, there are several situations in which hand posture or fingers’ position is hardly distinguished due to illumination conditions, or partial occlusions of parts of the hand in the frame.

In [6], a hand gesture classification using SVMs is applied.

Several features have been utilized namely: trajectory descriptor, HOG, HOF, and MBH. The achieved performances in this work revealed certain efficiency in the separation of different hand gestures, where the overall accuracies of 91% and 93%

were respectively obtained. This efficiency is because the geometric aspect (in hand’s features separation) is considered, and a sparse solution via structural risk minimization is applied.

TABLE III

COMPARISON OF HAND GESTURE RECOGNITION METHODS.

Finally, from Table III, one can presume that the LSTM approach outperformed the other considered algorithms by accomplishing a satisfying accuracy of 97.07% and offers a promising way to hand gesture recognition problematic.

This performance of the LSTM in discriminating different hand postures and gestures is due to the consideration of the sequential aspect (present in the data), which is necessary to distinguish between different classes.

VII. CONCLUSION

Accurate hand gesture recognition is undoubtedly an essen- tial tool for supporting effective interaction between human and machine. This paper caters to an automatic deep learning approach based on LSTM algorithm. Instead fixed camera, the wearable camera offers the opportunity to visualize what the user perceives in the surrounding environment. One can conclude after comparison to other advanced approaches, including HMM, SVM, RANSAC, KNN, Bayes classifier and neural network, that the proposed methodology, using area ratios as features and LSTM-based algorithm, can significantly separate hand classes even with some constrained scenarios like degraded illumination conditions or presence of other visitors in the scene that considered as moving parts. Finally, the presented application (hand gesture recognition) allows visitors and interested people to understand cultural heritage easily and much faster and especially permits to personalize the visit according to their interests.

REFERENCES

[1] R. Scona, M. Jaimez, Y. R. Petillot, M. Fallon, and D. Cremers, “Stat- icfusion: Background reconstruction for dense rgb-d slam in dynamic environments,” in2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 3849–3856.

(7)

[2] N. Zerrouki and A. Houacine, “Combined curvelets and hidden markov models for human fall detection,”Multimedia Tools and Applications, vol. 77, no. 5, pp. 6405–6424, 2018.

[3] N. Zerrouki, F. Harrou, Y. Sun, and A. Houacine, “Accelerometer and camera-based strategy for improved human fall detection,”Journal of medical systems, vol. 40, no. 12, pp. 1–16, 2016.

[4] R. Ranjan, A. Bansal, J. Zheng, H. Xu, J. Gleason, B. Lu, A. Nanduri, J.- C. Chen, C. D. Castillo, and R. Chellappa, “A fast and accurate system for face detection, identification, and verification,”IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 1, no. 2, pp. 82–96, 2019.

[5] N. Zerrouki, F. Harrou, Y. Sun, and A. Houacine, “Vision-based human action classification using adaptive boosting algorithm,”IEEE Sensors Journal, vol. 18, no. 12, pp. 5115–5121, 2018.

[6] L. Baraldi, F. Paci, G. Serra, L. Benini, and R. Cucchiara, “Gesture recognition using wearable vision sensors to enhance visitors’ museum experiences,” IEEE Sensors Journal, vol. 15, no. 5, pp. 2705–2714, 2015.

[7] J. Wachs, H. Stern, Y. Edan, M. Gillam, C. Feied, M. Smith, and J. Handler, “A real-time hand gesture interface for medical visualization applications,” inApplications of Soft Computing. Springer, 2006, pp.

153–162.

[8] T. Pencarelli, “The digital revolution in the travel and tourism industry,”

Information Technology & Tourism, vol. 22, no. 3, pp. 455–476, 2020.

[9] F. Sparacino, “The museum wearable: Real-time sensor-driven understanding of visitors’ interests for personalized visually-augmented museum experiences.” 2002.

[10] A. Kuusik, S. Roche, and F. Weis, “Smartmuseum: Cultural content recommendation system for mobile users,” in2009 Fourth International Conference on Computer Sciences and Convergence Information Tech- nology. IEEE, 2009, pp. 477–482.

[11] T. Kuflik, O. Stock, M. Zancanaro, A. Gorfinkel, S. Jbara, S. Kats, J. Sheidin, and N. Kashtan, “A visitor’s guide in an active museum:

Presentations, communications, and reflection,”Journal on Computing and Cultural Heritage (JOCCH), vol. 3, no. 3, pp. 1–25, 2011.

[12] S. Nowozin and J. Shotton, “Action points: A representation for low- latency online human action recognition,” Microsoft Research Cam- bridge, Tech. Rep. MSR-TR-2012-68, 2012.

[13] F. Paci, “Electronic systems with high energy efficiency for embedded computer vision,” 2017.

[14] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[15] F. Harrou, F. Kadri, and Y. Sun, “Forecasting of photovoltaic solar power production using lstm approach,”Advanced statistical modeling, forecasting, and fault detection in renewable energy systems, vol. 3, 2020.

[16] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget:

Continual prediction with lstm,”Neural computation, vol. 12, no. 10, pp. 2451–2471, 2000.

[17] R. Khan, A. Hanbury, and J. Stoettinger, “Skin detection: A random forest approach,” in 2010 IEEE international conference on image processing. IEEE, 2010, pp. 4613–4616.

[18] N. Zerrouki, F. Harrou, Y. Sun, and A. Houacine, “A data-driven monitoring technique for enhanced fall events detection,” IFAC- PapersOnLine, vol. 49, no. 5, pp. 333–338, 2016.

[19] Z. Parcheta and C.-D. Mart´ınez-Hinarejos, “Sign language gesture recognition using hmm,” inIberian conference on pattern recognition and image analysis. Springer, 2017, pp. 419–426.