• Tidak ada hasil yang ditemukan

Human action recognition using differential motion

N/A
N/A
Protected

Academic year: 2023

Membagikan "Human action recognition using differential motion"

Copied!
156
0
0

Teks penuh

Recently, Convolutional Neural Networks (CNN) have delivered state-of-the-art results in classifying images of objects, complex events and scenes. The LSTM is also trained to classify abnormal frames while extracting the temporal features of the frames.

I NTRODUCTION

  • Introduction
    • Video Retrieval
    • Video Surveillance
    • Health Care
    • Human-Computer Interaction
  • Motivation
  • Research challenges
  • Main Contributions
  • Thesis Structure
  • List of Publication
    • Referred Conferences

Recognition of human actions based on motion information is done using the most modern methods. Most of the proposed video interest point detectors are three-dimensional (3D) extensions of their 2D (image) counterparts.

L ITERATURE S URVEY

Datasets

  • Weizmann
  • UCF11
  • UCF101

KTH is a simple action database consisting of 6 actions: boxing, clapping, hand waving, jogging, running and walking [89]. UCF101 is an action recognition dataset of realistic action videos collected from YouTube [96] with 101 action categories.

Figure 2.1: Example frames of KTH dataset actions.
Figure 2.1: Example frames of KTH dataset actions.

Representation

  • Holistic Approach
    • Shape-based approaches
    • Motion-based approaches
    • Hybrid approach
  • Local Approach
    • Local detector
  • Learning based methods
    • Dictionary based learning
    • Deep learning based

In [8], a view-invariant human action recognition method was proposed based on contour points with a sequence of multi-view key poses to represent the action. In [78] and [79], a region-based descriptor was developed to represent human actions by extracting features from the surrounding regions (negative space) of the human silhouette.

Discussion

The deep learning-based models for human action recognition require large amount of video data for training. In this direction, factorized spatiotemporal convolutional networks (FSTCN) were proposed in [98] for human action recognition.

A NALYSIS OF TEMPORAL DIMENSION

Special characteristics of the temporal dimension

  • Qualitative Analysis
  • Quantitative Analysis

These properties of the temporal dimension automatically mean that temporal edges and consequently spatiotemporal angles are rare, which explains their infrequent occurrence [56]. Thus, to detect interesting motion, focusing on some locally unique features of the time stream itself is likely to yield better results.

Our early attempt at treating temporal dimension differentlydimension differently

  • Initial set of points
  • Trajectory extraction
    • Tracking with Variational flow
    • Filtering the trajectories
  • Interest Point Detection
    • Ramer Douglas Peucker Algorithm
  • Experiments and Results
    • Evaluation dataset
    • Quantitative Analysis
    • Effect of scale
    • Effect of compression
    • Comparative study
  • Discussion

To calculate the flow, we use the state-of-the-art algorithm of Brox et. A candidate keypoint trajectory consists of the sequential order of predicted keypoint positions at each instant within the video. 3.5, the algorithm performs remarkably well in preserving the shape of the trajectory with high accuracy.

To test the invariance of points of interest, scaled and compressed versions of the videos were added to the dataset. For each video taken from the KTH database, the compressed and scaled versions of the video were also included in the evaluation set to test the performance of our method. The comparisons of the different methods with different values ​​of²s are listed in Table 3.5 to Table 3.12.

Figure 3.3: System diagram for interest point detection
Figure 3.3: System diagram for interest point detection

A CTION RECOGNITION USING LOCAL FEATURES

A flow-based interest point detector for action recognition in videos

  • Proposed method
  • Experimental Results
    • Stability and Robustness
    • Action Recognition
    • Discussion

The tortuosity is non-zero because the line integral of the closed-loop optical flow around a boundary point with tangential motion (e.g., the top of the head) gets a unidirectional contribution from the moving side of the boundary (e.g., a head with tangential flow) and no contribution from the stationary side (e.g., the background above head without flow). The locations of interesting points for moving objects detected by the proposed method are thus predictable. We evaluated the proposed stability and robustness of the interest point detected by the proposed method.

It can be observed that the average displacement of interest points is smaller for the proposed method compared to other methods. 4.4, showed that the points of interest detected by the proposed method appear only on relative motion boundaries. The results of a combination of the proposed detector with local descriptors are shown in Table 4.1.

Figure 4.1: Illustration of the optical flow of an object boundary with line integrals for divergence and curl.
Figure 4.1: Illustration of the optical flow of an object boundary with line integrals for divergence and curl.

Action recognition using temporally localize interest pointsinterest points

  • Proposed Algorithm
  • Experiment and results
  • Discussion

These innovations have delivered substantial improvements compared to the state of the art in benchmark datasets. We compared the performance of our proposed technique with several state-of-the-art and some pioneering legacy techniques on simple and complex video action datasets. Action Recognition Results: Table 4.3 and Table 4.4 show a comparison of the action recognition performance of the proposed method and state-of-the-art methods.

KTH data contains simple operations in a limited environment for which meth-. Importantly, for the UCF11 dataset, the proposed method outperformed all state-of-the-art methods, including non-IP-based ones, as shown in Table 4.4. Class separation: We have also shown the discriminative ability of the proposed video representation by comparing the inter- and intra-class mean square distance in Fig.

Figure 4.6: Illustration shows the trajectory of an interest point. Red dots are the temporally localized points
Figure 4.6: Illustration shows the trajectory of an interest point. Red dots are the temporally localized points

Interest Point Detector for Videos based on Differential MotionDifferential Motion

  • Adaptive threshold for curl of optical flow
  • Scale-adaptive interest points
  • Appearance feature extraction
    • Location feature extraction
  • Experiments and Results
    • Datasets used
    • Robustness and stability of interest point detector
    • Repeatability
    • Displacement
    • Comparison of fixed threshold and adaptive threshold
  • Qualitative comparison of interest point detectors
    • Comparison of interest point detectors and descriptors for action recognition
    • Application of the proposed video representation to action recognition

Most recent interest point feature methods do not consider the location of the interest point when calculating the feature. To do this, we formed the histogram of the distance point of interest with respect to the neighboring point of interest as shown in Fig. In this section, we describe the experimental settings, the datasets used, and the performance evaluation of the detector and the proposed point of interest descriptor.

We have compared the visual quality of the proposed detector with state-of-the-art methods. We have compared the proposed point of interest detector with state-of-the-art detectors along with different descriptors as shown in Table 4.6 and Table 4.7. The proposed method outperformed all the state-of-the-art methods on KTH and UCF11 datasets.

Figure 4.9: DOG scale-space pyramid [67].
Figure 4.9: DOG scale-space pyramid [67].

Discussion

H UMAN ACTION RECOGNITION USING G LOBAL FEATURES

Introduction

The projection of a depth motion map is easy to understand, but the same projections do not make sense in the case of differential motion. Therefore, we proposed another method, where the projections on the Cartesian planes are very simple and understandable, without significantly affecting the accuracy.

Proposed Method 1

  • Feature extraction
    • Absolute motion
    • Spatio-temporal differential motion
    • Temporal differencing to capture acceleration
    • Projection to Cartesian planes
    • Resizing DMMs to a constant dimension
  • Compact feature representation
    • Principal component analysis (PCA)
    • Semi-supervised non-negative matrix factorization (SSNMF)
    • Kernel principal component analysis (KPCA)
  • Classification
    • L2-regularized collaborative classifier (LRCC)
    • Support vector machine (SVM)
    • Random forest (RF)
  • Experiments and Results
    • Datasets used
    • Experimental setup
    • Design choices
    • Comparison of dimension reduction techniques

This is because the movement of the background relative to the camera also produces a non-zero optical flow. Dimension is eliminated using equation 5.6 to obtain the differential motion maps for the side view projection. We tried to match the dimension reduction technique to the nature of the extracted feature (vectorized DMMs).

SVM is a state-of-the-art linear classifier which maximizes the difference between two classes. The final value of the assigned class for each instance will be equal to the most frequent value for the total number of k trees generated. We will then describe the results of comparing different dimensionality reduction techniques, hyper-parameter variation and classifiers.

Figure 5.2: Example frames for hand-clapping video sequence: (a) Original frame, (b) Optical flow (c) Divergence magnitudes of the optical flow, (d) Front view ( x y − plane) differential motion map, (e) Side view ( yt −plane) differential motion map, (f)
Figure 5.2: Example frames for hand-clapping video sequence: (a) Original frame, (b) Optical flow (c) Divergence magnitudes of the optical flow, (d) Front view ( x y − plane) differential motion map, (e) Side view ( yt −plane) differential motion map, (f)

Dimensions

PCA KPCA

  • Hyper-parameter selection
  • Classifier comparison
  • Absolute vs. differential motion
  • Comparison with state-of-the-art
  • Class separation
  • Difference between recognizing simple vs. complex activities
  • Discussion
  • Proposed Method 2
    • Feature extraction
    • Classification
    • Experiments and Results
    • Discussion

We have compared the recognition accuracies of the proposed method with different classifiers for different dimensionality reduction methods as shown in Fig. As shown in Table 5.3 and Table 5.4, we can clearly see that the proposed method outperformed the state-of-the-art technology. methods. Additionally, Table 5.3 and Table 5.4 show that the best results obtained using SSNMF and LRCC are competitive with the state-of-the-art on the two datasets.

Differential motion captures the motion information very efficiently and shows better performance compared to state-of-the-art methods. Comparison with state-of-the-art: We compare our results with state-of-the-art methods, as shown in table 5.7. The proposed method performed better than the advanced methods for KTH as well as UCF11 data sets.

Figure 5.4: Comparison of (a) LRCC (b) SVM and (c) RF with different dimension reduction techniques for UCF11 dataset.
Figure 5.4: Comparison of (a) LRCC (b) SVM and (c) RF with different dimension reduction techniques for UCF11 dataset.

H UMAN ACTION RECOGNITION USING D EEP LEARNING

Introduction

For the same action class, the color and intensity of these pixels can change due to variations in environments, viewpoints, noise, and player movement. Videos of the same action class taken from different angles have a large variation within the class. Differences in lighting and sensors also contribute to differences in pixels of videos of the same action.

Common video transformations, such as compression and scaling when saving or loading a video, also contribute to video variations of the same action. Moreover, actions of the same class are rarely performed identically in space and time, even by the same actor. In this paper, we focus on one of the atomic activities reported in the ARENA dataset.

Abnormal Event Detection on BMTT-PETS 2017 Surveillance ChallengeSurveillance Challenge

  • Background subtraction
  • Spatial feature extraction
  • Temporal feature extraction
  • Classification
  • Experiment and Results
  • Datasets
    • Results and Discussion
  • CNN+SVM
  • CNN+LSTM
  • CNN+LSTM+SVM
  • CNN+LSTM+SVM+TA
  • Results

In [92], the authors showed that the depth of the network plays an important role in its performance. The task was to predict the start and end frames of the abnormal activity sequence taking place. In this section, we explore the benefits of adding a support vector machine for classification of the features determined using CNN.

Discussion: In the second layer of LSTM network, we used two-way LSTM hidden units instead of normal hidden units. Discussion: We now include the top performing networks against each other for the final frame ranking. In order to further improve the classification accuracy and predict the initial and final frame as close as possible to the ground truth, we performed temporal averaging of the predictions of the above-mentioned network (CNN+LSTM+SVM+TA).

Figure 6.1: Block diagram of the proposed method.
Figure 6.1: Block diagram of the proposed method.

Discussion

From Table 6.2 we can see that adding LSTM to CNN improves the results and further adding SVM again improves the results. But in the case of 11-04, it fails to identify the correct frames because it is different from the other two datasets. The model has not seen this type of activity so far from the camera and for a very short duration compared to the other two files.

Therefore, the network cannot correctly classify the case of frames 11-04 and classifies them as unusual if a person is walking near a vehicle. We note that the accuracies of 11-03 and 08-02 jump to 96 and 95 percent, respectively, and the predicted initial and final frames are very close to the ground truth. We evaluated these models for one type of activity, but they can be extended to other activities as well.

S UMMARY AND DISCUSSION

Summary

  • Key Contributions

We track each point to form the trajectories, then temporal localization is done based on the direction change of the trajectories. We form a histogram based on the orientation of the neighboring point of interest with respect to a point, which is then connected to the HOG-HOF. We have compared these dimensionality reduction techniques and found that SSNMF gives better results because it takes labels into account.

As deep learning is becoming popular nowadays, we have also tried different combinations of CNN and LSTM for abnormal event detection. Based on the experiments, we can suggest that for low sample size, SVM can be used for classification and CNN for feature extraction. Suggested a change in the point of interest descriptor based on the point of interest location.

Limitations

Using SVM produces good results, but the function is not well trained due to the small sample size.

Future Work

BAUCKHAGE, Action recognition by learning discriminative key positions, in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, IEEE, 2011, pp. WANG, Hierarchical recurrent neural network for skeleton-based action recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. BAUCKHAGE,Temporal key poses for human action recension, in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, IEEE, 2011, pp.

YAN, Dl-sfa: slow deep-learning feature analysis for action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, p. , 2011 IEEE Conference on, IEEE, 2011, pp. TANG, Action recognition with trajectory-aggregated deep convolutional descriptors, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, p.

Gambar

Figure 2.1: Example frames of KTH dataset actions.
Figure 2.2: Example frames of UCF11 dataset actions.
Figure 2.3: Example frames of UCF101 dataset actions.
Figure 3.3: System diagram for interest point detection
+7

Referensi

Dokumen terkait

The energy levels of this system are called Landau levels there several works are made noncommutative geometry to explain the Landau problem charged particle moving in magnetic field ,