collaborative batch learning for crime scene detection

It is hereby certified that TOH YUE XIANG (ID No: . 18ACB01082 ) has completed this final year project entitled "_COLLABORATIVE BATCH LEARNING FOR CRIME SCENE DETECTION_" under the supervision of _TS DR TAN HUNG KHOON (Supervisor_COM) of the Department of Science. I understand that the University will upload soft copy of my final year project/dissertation/thesis* in pdf format in UTAR Institutional Repository, which can be made accessible to UTAR community and public. I declare that this report entitled "COOPERATIVE BATCH LEARNING FOR CRIME SCENE DETECTION" is my own work, except as cited in the references.

In our project, we trained our model with normal and crime video from the UCF crime dataset. We added a 1D dependency capture module on top of the feature extractor to make the extracted features more useful and suitable for the dataset we were using.

Project Scope and Objectives

The main objective of our project was to develop an improved version of anomaly detection system with MIL model which can provide higher accuracy. We aim to increase this accuracy with a deeper extraction network and include an attention module. This attention module was developed on top of the I3D network to assign weight distribution to the feature extracted from the I3D network.

The functions output from the attention module were function at the region level, where it focused on the specific feature of the entire feature. The input video was fed into the crime scene detection system, which consists of feature extraction network, attention module and MIL model.

Report Organization

Action recognition

This had proved that the temporal dimension should not be neglected for video analysis task. However, the C3D network proposed in this article was not deep, as it only consisted of 8 layers. I3D network was similar to C3D network, but it went further to extract the function on a single spatial RGB stream and temporal stream, the motion stream of moving objects between successive frames along the vertical and horizontal axis, as shown in Figure 2.1.3.

Although the single RGB streaming network is sufficient in action recognition as shown in the work of [7], including the temporal streaming improved the accuracy result even more. As shown in Figure 2.1.4, the I3D model proposed by [10] used the pretrained 2D ConvNet as the feature extractor, then inflated the feature to 3D with the global fusion layer before pumping it into the pretrained C3D model.

Figure 2.1.2 C3D network architecture [7]

Anomaly detection

Therefore, their proposed network solved the problem of not having fully annotated training data and solved the problem caused by training only with normal videos. Another interesting finding in their proposed model was the use of ranking loss instead of the usual classification loss as the loss function to train their network. Their proposed network outperformed the baseline model such as binary classifier, dictionary-based approach [5] and deep learning autoencoder [4] with significantly higher accuracy and lower false alarm rate.

One reason may be the feature they used to train their MIL model. In their proposed approach, they separated the feature extraction module into 2 branches – the interactive dynamic branch and the spatiotemporal branch.

Attention module

Although the implementation of 1-D dependency capture attention module improves the performance of anomaly detection task, there is an underlying problem in the attention module. Their proposed positional embedding method, tube embedding as shown in figure 2.3.3 showed improvement in accuracy compared to the normal positional embedding method, uniform frame embedding as shown in figure 2.3.2. While tubelet embedding method was to extract non-overlapping, spatio-temporal fragment from the input, and then linearly embed it into feature vector.

First, spatiotemporal attention where the cues were all pumped into the transformer encoder to model pairwise interaction, which was also known as Multi-Headed Self-Attention (MSA). However, the complexity of this attention module was quadratic with respect to the number of cues. The floating point (FLOPs) of this model was smaller compared to the space-time attention, but there are more layers of transformers.

The third model was factored self-attention as shown in figure 2.3.5 where the number of transformer layer was the same as spatiotemporal attention. Due to factorization of self-attention computation, this model was more efficient than the spatiotemporal attention with the same number of transformer layer and achieved the same computational complexity as factored encoder, but with less number of transformer layers. The last model was factored dot-product attention where the attention for each sign was calculated separately over spatial and temporal dimension using different muti-head dot-product attention as shown in figure 2.3.6.

Among all attention models, spatiotemporal attention had the best accuracy but required the longest performance time. The factorized encoder had only slightly lower accuracy, but had the shortest running time among the attention model. However, we changed the C3D feature extractor into a deeper network, which was an I3D network, and added an attention module on top of the feature extractor to assign different weights to different extracted features to achieve region and temporal level feature extraction.

Figure 2.3.1 1D dependency attention capturing module [27]

System Overview

Anomaly Detection
Ranking loss function

For example, in a robbery scene, we wanted our network to focus on the robbers and victims instead of the surroundings shown in Figure 3.1.2. In this case, more weight should be given to the feature representing the red box region, as it was the important part to describe the event. We simply passed the 1024 1D feature vector extracted from the I3D model to 3 different 1D convolution layers to extract 3 new 1D feature vectors with the same dimensions, 𝑓1, 𝑓2 and 𝑓3.

The higher the correlation between the 𝑖𝑡ℎ and the 𝑗𝑡ℎ position, the larger the value of 𝐴𝑖,𝑗 in the affinity matrix. The attention weights were assigned by concatenating the original feature vector with value produced by 𝛽(𝑆 ∗ 𝑓3). Different weight distributions were assigned by this 1-D dependency self-attention module, causing the value of all positions inside the feature vector to be correlated.

We achieved region-level feature extraction by capturing global dependencies that connected the distant features in the feature vector. Providing the network with both types of classes to learn allows our model to learn more variation of human patterns. As a result, it was easier for the model to rank the segment whether it was more likely to be anomaly segment or normal segment instead of clearly classifying it as 1 or 0.

However, it was difficult to classify the video in 1/0 substance as a classification mode because there was no clear cut between normal and abnormal video. Since the maximum score for positive cases was enforced to be higher than the maximum score for negative cases, our loss function was still not violated. To avoid the appearance of a huge gap between the number of contiguous segments, we needed to have temporal smoothness and sparsity in our loss function.

UCF Crime dataset

Implementation details

For other type of modules as shown in Figure 4.2.2, we forwarded the feature extracted in 2 different 1D convolution layers to calculate the affinity matrix. The setting of 1D convolution layer and batch normalization was the same as the 1D dependency capture module. Then we flatten the affinity matrix into 1D feature vector with 1024 features to match the input dimension of the first layer of the MIL model.

For the last module, as shown in Figure 4.2.3, we simply passed the function to a 1D convolution layer to increase the complexity of the function. The setup of the 1D convolutional layer and batch normalization were also the same for the 1D dependency attention capture module. The MIL model consisted of 3 fully connected (FC) layers with a 60% dropout layer after each FC layer.

We used only ReLU activation for layer 1 and layer 3, and sigmoid activation for the last layer to predict the instance anomaly score. For the temporal and smoothness constraint hyperparameter, we set it to 8 x 10-5, which is the same setting as [6]. The MIL model was trained with a batch size of randomly selected 60 separated video clips with 30 each from anomaly class and normal class per training.

The computation graph gradient for each forward pass is computed with the tensorflow backend. The score was calculated for each case that was the separated temporal segments of the video. Then, the loss for each cluster was obtained with the rank loss function by resampling these points.

Figure 4.2.1 Block diagram of 1D dependency attention capturing module

Test and evaluation

To further prevent the problem of overfitting, we also added L2 regularization set to 0.001 in all our FC layers. The first FC layer had 1024 neurons, followed by 128 neurons in the second layer, and 8 neurons in the final FC layer. So we can calculate the AUC score with the ground truth and predicted score and plot the ROC for our model.

Comparison between MIL with C3D feature extractor

Error analysis

Successful case
Failure case

Based on the re-implementation of the MIL model proposed by [6], it showed that poorly annotated video dataset and training on both abnormal and normal video produced a good performing crime scene detection system. To improve the performance, we hypothesized that the extraction of 3D features with deeper network will increase the performance of the network in our project. In addition, we also hypothesized that providing the MIL model with more descriptive and refined features to learn about could improve the performance.

Feature extraction with deeper network and adding attention module on top of the feature yielded higher AUC result. Training the model with balanced data may be able to improve the accuracy of all action classes. For future work, we can experiment in using different types of attention module such as spatial attention module and spatial attention module.

Shah, “Chaotic Invariants of Langrangian Particle Trajectories for Anomaly Detection in Crowded Scenes,” in CVPR, 2010. Nishino, “Anomaly Detection in Extremely Crowded Scenes with Spatiotemporal Motion Pattern Models,” in CVPR, 2009. Roy-Chowdhury, “Context-Aware Activity Recognition and Anomaly Detection in Video,” in IEEE Journal of Selected Topics in Signal Processing, 2013.

Form Title: Supervisor's comments on the originality report generated by Turnitin for final year project report submission (for undergraduate programs). Note Promoter/Candidate(s) is/are required to provide a soft copy of the full set of the Originality Report to the Faculty/Institute. Based on the above results, I hereby declare that I am satisfied with the originality of the end-of-year project report submitted by my student(s) as stated above.