REPRESENTATION LEARNING FOR ACTION RECOGNITION

He has been a guide in the true sense of the word and has brought out the best in me. Blue represents the ground truth, pink represents the output of the proposed framework and red represents the MDNet output.

Single layered representation

Since this process is highly dependent on the accuracy of the tracking, videos without a clear view of the action cannot be recognized. This has led to the development of representations which can describe an action in more detail using the spatio-temporal semantics of the various events involved in the action.

Hierarchical representation

Challenges in representing human actions

In surveillance videos, it is often difficult to get a clear recording of the action of interest due to the number of people in the field of view. It can be observed that for many of the actions such as punching and shaving the beard, there are large variations in the duration of the recorded actions.

Issues addressed in this thesis

In such cases, the foreground must be reliably segmented and the object in the foreground must be tracked. Additionally, any motion compensation method may inadvertently remove motion features belonging to the foreground object.

Organization of the thesis

Local feature extraction

Histogram of optical flow features (HOF) [10] form the following 96 dimensions that describe the motion of a particle in successive frames in a small environment (3×3 or 5×5). For example, Figure 2.3 shows the optic flow or HOF characteristics for walking and hand waving actions.

Aggregated descriptors of local features

Each action template has an average spatial resolution of approximately 50×120 pixels and a temporal length of 40−50 frames in the action bank. The number of spatiotemporal descriptors is set to it is based on volumetric maximum pool and considering three levels in the octree.

Deep features for action recognition

Spatio-temporal networks

These responses are then fused to the fully connected layers to generate a final video descriptor as shown in Figure 2.7. A long-term memory (LSTM) cell (Figure 2.8) was proposed in [38] to solve this issue by limiting the states and outputs of an RNN cell through control gates.

Multi-stream networks

The iDT features tracked using two-stream network convolutional feature maps are clustered using Fisher Vector. Such a combination gives cutting-edge performance, but not too far from iDT's hand-crafted features.

Generative networks

The encoder LSTM accepts an input sequence and the conditions learn both the appearance and dynamics of the sequence. The states are chosen to be the compact representation of the input sequence, and the decoder LSTM tries to reconstruct the input sequence from the compact representation.

Observations from the review

Many of the recognition approaches presented in the review such as [46] rely on human body representation for the extraction of local features. In Chapter 4, a sparse representation-based approach is presented to describe the different postures of the human body.

Summary

Exploring suitable features for sparsification

Action bank features are useful for semantic representation of videos proposed by Sadanand and Corso [96]. From Figure 3.2, it can be seen that the action bank properties follow a Laplacian distribution with most of the property values being zero.

Dictionary based representation

This shows that the features of the action bank are indeed suitable for segmentation using dictionaries, and in this work, we explore dictionaries that promote sparseness to achieve a discriminative representation of human actions. Online dictionary learning (ODL) is an on-line version of the OFK-SVD algorithm proposed by Mairal et al.

Sparsity based classification

It learns one example at a time and gives the online nature similar to online stochastic approximation algorithms. Thus, if a test feature belongs to a certain class, it should ideally allow the sparsest representation with respect to that class dictionary and no other.

Experimental results

Datasets
Evaluation on different feature descriptors
Classification Performance vs. Dictionary Size
Visualization of dictionaries
Comparison with state-of-the-art

As reported in Table 3.1, the best classification performance of 22.08% was achieved for 3D-SIFT features with a dictionary size of 80. On the other hand, the dictionaries constructed for some of the UCF50 classes have strong similarities. Further, it is also clear from Table 3.3 that the proposed method demonstrates significantly higher classification accuracy than that based on CNN and CNN.

Summary

Previous attempts at human tracking include algorithms that have been built considering a certain estimation of human body shape [85]. This motivates us to create a sparse representation of all human body poses in the form of a dictionary, which can later be used for on-line tracking. In [106], the structural sparsity of articulated human body motion is explored for pose localization using a probabilistic particle filter-like observation model.

Dictionary-CNN hybrid approach for human body representation

Dictionary construction

During training, the mean and standard deviation of feature sparsity scores obtained from human patches are noted. The sparsity scores of human (in red) and non-human patches (in blue) obtained from the dictionaries learned from HOG, and CNN features (f c5 layer features) of human patches are shown in Figure 4.5. The scores for human and non-human patches are almost identical in the case of HoG features, which are obtained from a dictionary trained only on human patches.

Dictionary based representation

Experimental results

Dataset

Since the degree of deformation varies under different circumstances, we chose platform and cliff diving from UCF50 [92] and diving from HMDB51. Since gymnastics events are performed indoors, with the camera placed closer to the athlete than diving, we found that it is easier to follow the athlete. All results presented later in this section are obtained on test videos based on the official test distributions for the HMDB51 and UCF50 datasets.

Metrics for evaluation

Comparative analysis

It can be observed that the hybrid tracker consistently outperforms MDNet for all threshold values. The difference in accuracy is a result of MDNet performing not as well as the hybrid tracker in complex deformation cases. The accuracy shown here includes all frames, regardless of the deformation in the target.

Summary

Universal attribute model (UAM)
Representation of actions as super action-vector (SAV)
Extraction of compact action-vectors using factor analysis
Unsupervised action-vector scoring

Since the goal is to find the action pdf that generates a clip, we need to fit the UAM parameters using the data in the clip. This adaptation requires updating the parameters of each of the mixture components in the UAM. In the first estimation step E, mandΣ are initialized with the UAM mean and covariance, respectively.

Experimental Results

Datasets
Experimental Settings
Super action-vector vs. Action-vector
Action-vectors vs. other aggregation methods
Analysis on untrimmed videos

Finally, Table 5.3 compares the performance of action vectors with other clustering methods such as Fisher vector and VLAD. The final action vector is 600-dimensional and uses free descriptor information to improve the classification accuracy of individual action vectors. Furthermore, it can be observed that unsupervised cosine scoring of action vectors performs the same as supervised classification methods using descriptors based on convolutional neural networks combined with Fisher vectors.

Summary

Linear Discriminant Analysis (LDA)

In the case of more than two classes, the goal is to find a subspace that contains all the class variability. If each afm classes has a meanµi and the same covarianceΣ, then the spread between class variability is defined as the sample covariance of the class means. This means that when y is an eigenvector of the projection matrix A = Σ−1Σb and the separation will be equal to the corresponding eigenvalue.

Probabilistic Linear Discriminant Analysis (PLDA)

For a given action represented by a latent vector ˆy, the distribution of the action vector is assumed to be. Hypothesis nullHs: A pair of action vectors iw1 and w2 is generated by the distributionap(w|ˆy) that includes only a single latent vectorˆy representing a given action. For the hypothesis that the action vectors belong to separate actions, the product of marginal probabilitiessp(w1)dhep(w2) is used.

Non-linear decomposition using Siamese networks

Siamese network configuration

The NN3 configuration achieves the best results among the three overall feature descriptors as shown in Table 6.1, and it is used to report the results in subsequent experiments. It has been found empirically that increasing the number of layers or neurons caused significant degradation of the classification performance. As the number of training cases per action in the HMDB51 dataset is only about 70, it is observed that deeper networks with more hidden layers cannot be trained effectively due to overfitting.

Experimental results

Datasets
Experimental settings
Performance of linear embedding techniques
Intermediate fusion techniques using PLDA
Linear embedding of action-vectors vs. state-of-the-art
Comparison of linear and non-linear embedding techniques
Non-linear embedding of action-vectors vs. state-of-the-art

In Table 6.3, the analysis of LDA and PLDA over action vectors is presented on the HMDB51 data set. It can be observed that Siamese network based embedding of action vectors outperforms other linear and non-linear embedding techniques. Even on this challenging data set, both linear and non-linear embedding of action vectors works well, as can be seen in Table 6.7.

Summary

We show that action vectors outperform existing state-of-the-art feature descriptors while exploiting a wealth of existing video data containing human actions to effectively represent snatch thefts. Such an approach was proposed for crowded scenarios in [30] by considering three key characteristics: 1) distance between objects, 2) velocity between objects, and 3) area of the objects. Snap was defined as a sequence of the thief approaching the thief (following or confronting) and the thief grabbing the object (attack).

Baseline: Rule-based snatch theft monitoring

Stage I: Detection and tracking of humans in crowded scenarios

Similarly, the proximity to normal activities makes the use of completely unsupervised methods inappropriate, since with these methods normal behavior is learned, and deviations that are "far" are called abnormal [82]. This motivates the use of representations such as action vectors, which can increase the difference between hijacked thefts and normal actions.

Stage II: Victim and thief identification

The second condition checks whether the victim is dragged during the meeting (Response 3 of Table 7.1). Here we assume that thief on a motorcycle has higher speed than the victim before the meeting (Scenario 1 of Table 7.1). We also use the YOLO tracker [93] to locate motorcycles in the vicinity of the meeting to claim thief identity with more confidence.

Stage III: Snatch Theft Verification

Since frame-based detection during an encounter may not always be robust due to low resolution or occlusion, each encounter is examined for a few frames before and after the encounter for 1) one or the other person moving faster than the other and/or 2) the directions ofiandj are the same. In the event that any person or persons are detected to be slowing down before the meeting and suddenly move during the meeting, it is highly likely that a theft has occurred and therefore the possibleSnatchingflag is set. Finally, if possibleSnatching is set as a result of any of the above detected behaviors, a person speed comparison is performed prior to the encounter to ensure the identity of the potential victim and the thief.

Unsupervised modeling of snatch theft actions

Experimental results

Snatch 1.0 : Dataset Description

Effect of different feature descriptors and UAM mixtures

From Table 7.3, it can be observed that the proposed framework only misses 1 pickpocket compared to 5, 20 and 24 and to the other features. We present some of the detected theft scenarios in Figure 7.5 according to the scenarios explained in Table 7.1. It can be observed that action vectors recognize a diverse set of snatch thefts that have little similarity to each other.

Summary

Finally, we demonstrated the effectiveness of action vectors over existing state-of-the-art feature descriptors. Siamese networks have also been investigated for non-linear embedding of action vectors to mitigate stance variations. A method to address interaction similarity using probabilistic linear discriminant embedding of action vectors to classify visually similar actions.

Directions for Further Research

I Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), side 749–758, juni 2015. InProceedings of IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), side 1–6, august 2015. I Proceedings of International Conference on Articulated Motion and Deformable Objects (AMDO), side 385-394, Berlin, Heidelberg, 2006.