Human action recognition using differential motion

Recently, Convolutional Neural Networks (CNN) have delivered state-of-the-art results in classifying images of objects, complex events and scenes. The LSTM is also trained to classify abnormal frames while extracting the temporal features of the frames.

I NTRODUCTION

Introduction

Video Retrieval
Video Surveillance
Health Care
Human-Computer Interaction

Motivation
Research challenges
Main Contributions
Thesis Structure
List of Publication

Referred Conferences

Recognition of human actions based on motion information is done using the most modern methods. Most of the proposed video interest point detectors are three-dimensional (3D) extensions of their 2D (image) counterparts.

L ITERATURE S URVEY

Datasets

Weizmann
UCF11
UCF101

KTH is a simple action database consisting of 6 actions: boxing, clapping, hand waving, jogging, running and walking [89]. UCF101 is an action recognition dataset of realistic action videos collected from YouTube [96] with 101 action categories.

Figure 2.1: Example frames of KTH dataset actions.

Representation

Holistic Approach

Shape-based approaches
Motion-based approaches
Hybrid approach

Local Approach

Local detector

Learning based methods

Dictionary based learning
Deep learning based

In [8], a view-invariant human action recognition method was proposed based on contour points with a sequence of multi-view key poses to represent the action. In [78] and [79], a region-based descriptor was developed to represent human actions by extracting features from the surrounding regions (negative space) of the human silhouette.

Discussion

The deep learning-based models for human action recognition require large amount of video data for training. In this direction, factorized spatiotemporal convolutional networks (FSTCN) were proposed in [98] for human action recognition.

A NALYSIS OF TEMPORAL DIMENSION

Special characteristics of the temporal dimension

Qualitative Analysis
Quantitative Analysis

These properties of the temporal dimension automatically mean that temporal edges and consequently spatiotemporal angles are rare, which explains their infrequent occurrence [56]. Thus, to detect interesting motion, focusing on some locally unique features of the time stream itself is likely to yield better results.

Our early attempt at treating temporal dimension differentlydimension differently

Initial set of points
Trajectory extraction

Tracking with Variational flow
Filtering the trajectories

Interest Point Detection

Ramer Douglas Peucker Algorithm

Experiments and Results

Evaluation dataset
Quantitative Analysis
Effect of scale
Effect of compression
Comparative study

Discussion

To calculate the flow, we use the state-of-the-art algorithm of Brox et. A candidate keypoint trajectory consists of the sequential order of predicted keypoint positions at each instant within the video. 3.5, the algorithm performs remarkably well in preserving the shape of the trajectory with high accuracy.

To test the invariance of points of interest, scaled and compressed versions of the videos were added to the dataset. For each video taken from the KTH database, the compressed and scaled versions of the video were also included in the evaluation set to test the performance of our method. The comparisons of the different methods with different values of²s are listed in Table 3.5 to Table 3.12.

Figure 3.3: System diagram for interest point detection

A CTION RECOGNITION USING LOCAL FEATURES

A flow-based interest point detector for action recognition in videos

Proposed method
Experimental Results

Stability and Robustness
Action Recognition
Discussion

The tortuosity is non-zero because the line integral of the closed-loop optical flow around a boundary point with tangential motion (e.g., the top of the head) gets a unidirectional contribution from the moving side of the boundary (e.g., a head with tangential flow) and no contribution from the stationary side (e.g., the background above head without flow). The locations of interesting points for moving objects detected by the proposed method are thus predictable. We evaluated the proposed stability and robustness of the interest point detected by the proposed method.

It can be observed that the average displacement of interest points is smaller for the proposed method compared to other methods. 4.4, showed that the points of interest detected by the proposed method appear only on relative motion boundaries. The results of a combination of the proposed detector with local descriptors are shown in Table 4.1.

Figure 4.1: Illustration of the optical flow of an object boundary with line integrals for divergence and curl.

Action recognition using temporally localize interest pointsinterest points

Proposed Algorithm
Experiment and results
Discussion

These innovations have delivered substantial improvements compared to the state of the art in benchmark datasets. We compared the performance of our proposed technique with several state-of-the-art and some pioneering legacy techniques on simple and complex video action datasets. Action Recognition Results: Table 4.3 and Table 4.4 show a comparison of the action recognition performance of the proposed method and state-of-the-art methods.

KTH data contains simple operations in a limited environment for which meth-. Importantly, for the UCF11 dataset, the proposed method outperformed all state-of-the-art methods, including non-IP-based ones, as shown in Table 4.4. Class separation: We have also shown the discriminative ability of the proposed video representation by comparing the inter- and intra-class mean square distance in Fig.

Figure 4.6: Illustration shows the trajectory of an interest point. Red dots are the temporally localized points

Interest Point Detector for Videos based on Differential MotionDifferential Motion

Adaptive threshold for curl of optical flow
Scale-adaptive interest points
Appearance feature extraction

Location feature extraction

Datasets used
Robustness and stability of interest point detector
Repeatability
Displacement
Comparison of fixed threshold and adaptive threshold

Qualitative comparison of interest point detectors

Comparison of interest point detectors and descriptors for action recognition
Application of the proposed video representation to action recognition

Most recent interest point feature methods do not consider the location of the interest point when calculating the feature. To do this, we formed the histogram of the distance point of interest with respect to the neighboring point of interest as shown in Fig. In this section, we describe the experimental settings, the datasets used, and the performance evaluation of the detector and the proposed point of interest descriptor.

We have compared the visual quality of the proposed detector with state-of-the-art methods. We have compared the proposed point of interest detector with state-of-the-art detectors along with different descriptors as shown in Table 4.6 and Table 4.7. The proposed method outperformed all the state-of-the-art methods on KTH and UCF11 datasets.

Figure 4.9: DOG scale-space pyramid [67].

Discussion

H UMAN ACTION RECOGNITION USING G LOBAL FEATURES

Introduction

The projection of a depth motion map is easy to understand, but the same projections do not make sense in the case of differential motion. Therefore, we proposed another method, where the projections on the Cartesian planes are very simple and understandable, without significantly affecting the accuracy.

Proposed Method 1

Feature extraction

Absolute motion
Spatio-temporal differential motion
Temporal differencing to capture acceleration
Projection to Cartesian planes
Resizing DMMs to a constant dimension

Compact feature representation

Principal component analysis (PCA)
Semi-supervised non-negative matrix factorization (SSNMF)
Kernel principal component analysis (KPCA)

Classification

L2-regularized collaborative classifier (LRCC)
Support vector machine (SVM)
Random forest (RF)

Datasets used
Experimental setup
Design choices
Comparison of dimension reduction techniques

This is because the movement of the background relative to the camera also produces a non-zero optical flow. Dimension is eliminated using equation 5.6 to obtain the differential motion maps for the side view projection. We tried to match the dimension reduction technique to the nature of the extracted feature (vectorized DMMs).

SVM is a state-of-the-art linear classifier which maximizes the difference between two classes. The final value of the assigned class for each instance will be equal to the most frequent value for the total number of k trees generated. We will then describe the results of comparing different dimensionality reduction techniques, hyper-parameter variation and classifiers.

Figure 5.2: Example frames for hand-clapping video sequence: (a) Original frame, (b) Optical flow (c) Divergence magnitudes of the optical flow, (d) Front view ( x y − plane) differential motion map, (e) Side view ( yt −plane) differential motion map, (f)

Dimensions

PCA KPCA

Hyper-parameter selection
Classifier comparison
Absolute vs. differential motion
Comparison with state-of-the-art
Class separation
Difference between recognizing simple vs. complex activities
Discussion
Proposed Method 2

Feature extraction
Classification
Discussion

We have compared the recognition accuracies of the proposed method with different classifiers for different dimensionality reduction methods as shown in Fig. As shown in Table 5.3 and Table 5.4, we can clearly see that the proposed method outperformed the state-of-the-art technology. methods. Additionally, Table 5.3 and Table 5.4 show that the best results obtained using SSNMF and LRCC are competitive with the state-of-the-art on the two datasets.

Differential motion captures the motion information very efficiently and shows better performance compared to state-of-the-art methods. Comparison with state-of-the-art: We compare our results with state-of-the-art methods, as shown in table 5.7. The proposed method performed better than the advanced methods for KTH as well as UCF11 data sets.

Figure 5.4: Comparison of (a) LRCC (b) SVM and (c) RF with different dimension reduction techniques for UCF11 dataset.

H UMAN ACTION RECOGNITION USING D EEP LEARNING

Introduction

For the same action class, the color and intensity of these pixels can change due to variations in environments, viewpoints, noise, and player movement. Videos of the same action class taken from different angles have a large variation within the class. Differences in lighting and sensors also contribute to differences in pixels of videos of the same action.

Common video transformations, such as compression and scaling when saving or loading a video, also contribute to video variations of the same action. Moreover, actions of the same class are rarely performed identically in space and time, even by the same actor. In this paper, we focus on one of the atomic activities reported in the ARENA dataset.

Abnormal Event Detection on BMTT-PETS 2017 Surveillance ChallengeSurveillance Challenge

Background subtraction
Spatial feature extraction
Temporal feature extraction
Classification
Experiment and Results
Datasets

Results and Discussion

CNN+SVM
CNN+LSTM
CNN+LSTM+SVM
CNN+LSTM+SVM+TA
Results

In [92], the authors showed that the depth of the network plays an important role in its performance. The task was to predict the start and end frames of the abnormal activity sequence taking place. In this section, we explore the benefits of adding a support vector machine for classification of the features determined using CNN.

Discussion: In the second layer of LSTM network, we used two-way LSTM hidden units instead of normal hidden units. Discussion: We now include the top performing networks against each other for the final frame ranking. In order to further improve the classification accuracy and predict the initial and final frame as close as possible to the ground truth, we performed temporal averaging of the predictions of the above-mentioned network (CNN+LSTM+SVM+TA).

Figure 6.1: Block diagram of the proposed method.

Discussion

From Table 6.2 we can see that adding LSTM to CNN improves the results and further adding SVM again improves the results. But in the case of 11-04, it fails to identify the correct frames because it is different from the other two datasets. The model has not seen this type of activity so far from the camera and for a very short duration compared to the other two files.

Therefore, the network cannot correctly classify the case of frames 11-04 and classifies them as unusual if a person is walking near a vehicle. We note that the accuracies of 11-03 and 08-02 jump to 96 and 95 percent, respectively, and the predicted initial and final frames are very close to the ground truth. We evaluated these models for one type of activity, but they can be extended to other activities as well.

S UMMARY AND DISCUSSION

Summary

Key Contributions

We track each point to form the trajectories, then temporal localization is done based on the direction change of the trajectories. We form a histogram based on the orientation of the neighboring point of interest with respect to a point, which is then connected to the HOG-HOF. We have compared these dimensionality reduction techniques and found that SSNMF gives better results because it takes labels into account.

As deep learning is becoming popular nowadays, we have also tried different combinations of CNN and LSTM for abnormal event detection. Based on the experiments, we can suggest that for low sample size, SVM can be used for classification and CNN for feature extraction. Suggested a change in the point of interest descriptor based on the point of interest location.

Limitations

Using SVM produces good results, but the function is not well trained due to the small sample size.

Future Work

BAUCKHAGE, Action recognition by learning discriminative key positions, in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, IEEE, 2011, pp. WANG, Hierarchical recurrent neural network for skeleton-based action recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. BAUCKHAGE,Temporal key poses for human action recension, in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, IEEE, 2011, pp.

YAN, Dl-sfa: slow deep-learning feature analysis for action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, p. , 2011 IEEE Conference on, IEEE, 2011, pp. TANG, Action recognition with trajectory-aggregated deep convolutional descriptors, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, p.