V3Trans-Crowd: A Video-based Visual Transformer for Crowd Management Monitoring

(1)

V3Trans-Crowd: A Video-based Visual

Transformer for Crowd Management Monitoring

Item Type Conference Paper

Authors Zuo, Yuqi;Hamrouni, Aymen;Ghazzai, Hakim;Massoud, Yehia Mahmoud

Citation Zuo, Y., Hamrouni, A., Ghazzai, H., & Massoud, Y. (2023). V3Trans- Crowd: A Video-based Visual Transformer for Crowd Management Monitoring. 2023 IEEE International Conference on Smart Mobility (SM). https://doi.org/10.1109/sm57895.2023.10112514

Eprint version Post-print

DOI 10.1109/sm57895.2023.10112514

Publisher IEEE

Rights This is an accepted manuscript version of a paper before final publisher editing and formatting. Archived with thanks to IEEE.

Download date 2023-11-29 20:30:52

Link to Item http://hdl.handle.net/10754/691568

(2)

V3Trans-Crowd: A Video-based Visual Transformer for Crowd Management Monitoring

Yuqi Zuo^1,2, Aymen Hamrouni¹, Hakim Ghazzai¹, and Yehia Massoud¹

1King Abdullah University of Science and Technology (KAUST), Thuwal, Makkah Province, Saudi Arabia

2University of Electronic Science and Technology of China, Chengdu, Sichuan, China Email:{yuqi.zuo, aymen.hamrouni, hakim.ghazzai, yehia.massoud}@kaust.edu.sa

Abstract—Autonomously monitoring and analyzing the behavior of the crowd is an open research topic in the transportation field. The real-time identification, tracking, and prediction of the crowd behavior is primordial to ensure smooth crowd management operations in many public areas such as public transport stations and streets. First, the complexity brought by the interaction and fusion from individual to group that needs to be assessed and analyzed. Second, the classification of these actions which might be useful in identifying danger and avoiding any undesired consequences. In this paper, we propose a transformer-based crowd management monitoring framework called V3Trans-Crowd that captures information from video data and extracts meaningful output to categorize the behavior of the crowd. We provide an improved hierarchical transformer for multi-modal tasks. Inspired by 3D visual transformer, our proposed 3D visual model, V3Trans-Crowd, has been shown to achieve great performances in terms of accuracy compared to state-of-the-art methods, all tested on the standard Crowd-11 dataset.

Index Terms—Crowd management, Crowd behavior analysis, computer vision, visual transformer

I. INTRODUCTION

In nowadays society, crowd monitoring and management is a critical service to ensure the safety of the crowd especially in large public open spaces [1]. The history has shown that an uncontrolled crowd movement could certainly lead to catastrophic disasters. In fact, in 2015, more then 2,177 people died and around 934 others have been injured during the annual Hajj in Mecca, Saudi Arabia. In 2021, the city of Houston, TX, have witnessed a crowd crush during the Astroworld Festival that caused the death of 8 persons and the injury of 300 attendees. These examples of tragic events bring about the pressing need to survey human crowd behavior and establish an efficient crowd analysis system in order to detect dangerous aspects and report them in real-time to the designated authorities to take actions in the hope of preventing any undesired consequences.

Crowd management systems typically rely on CCTV cameras. In fact, they require the disposal of many of these recording devices to cover an extended geographical area for a multi view perspective. With their abundant incorporation in indoor and outdoor places, CCTV cameras are becoming more and more relied on for crowd analysis to better monitor the crowd, asses the situation, and predict any abnormal behavior.

In previous work, many of the studies focused on different tasks in the field of crowd management and analysis, including

crowd counting [2], [3], crowd localization and density [4], crowd behavior analysis with transfer learning [5], and flow motion analysis [6]. In [2], the authors have introduced a dilated convolution-based end-to-end crowd counting model with genetic algorithm for dilated rate. The authors of [7]

proposed an audio-video based data fusion model for crowd abnormal behavior analysis. The authors of [3] proposed an optical flow based crowd counting model with very high accuracy. In [8], the authors proposed an SWIN-transformer based crowd counting model with dilated architecture. The authors of [7] investigated the problem of visual-audio fusion, they proposed a new convolution based architecture for audio- video fusion detection. The authors of [5] focused on the transfer learning for crowd behavior analysis.

As it can be noticed, these approaches can be divided into two categories. The first category is based on Convolu- tional Neural Network (CNN), which have a smaller training overhead and easier deployment. The second category uses the accurate model based on physics movement like optical- flow, such as [3], which has a relatively high accuracy but are also high computationally complex. In literature, the core models for state-of-the-art crowd scene understanding are convolution networks, which achieve great performance in extracting features from videos. However, as these models have a comparatively low accuracy compared with optical, physics based model, they are not efficient to be adopted to real-time and complex crowd monitoring services.

With the rises of transformer for visual understanding, more and more artificial intelligence studies have focused on utilizing them in computer vision applications. Different from Neural Networks, the Transformer were first proposed to deal with language understanding tasks. Thanks to the adoption of sequence embedding and attention mechanisms, transformers have now the advantage of modeling long dependencies between embedding elements (tokens) and process them in a parallel way which ensures fast and efficient training. Through Vilt [9] and VideoBert [10], they have shown great adaptability in multi-modal tasks (image, text, audio, radar, even medical problem from vessels with relation).

Inspired by these multi-modal adoption and the 3D Trans- former’s [11], [12], we propose in this paper, V3Trans-Crowd framework, a 3D vision Crowd Management Transformer based on spatio-temporal fusion. Unlike previous crowd monitoring approaches, V3Trans-Crowd does not rely on CNN

(3)

M classes

M

Fig. 1: V3Trans-Crowd Architecture for processing video.

for behavior prediction and situation assessment but rather on tweaked version of Visual Transformers (ViT) adapted for videos. ViT exploits multi-modal learning to process and fuse different features from audio, text, etc. ViT was introduced by [13] and have shown great ability to achieve very high performance in many visual tasks including action recognition, image understanding, and image segmentation. Since then, it was adapted for many applications such as healthcare and computer vision embedding [14]. In this paper, we introduce a 3D Video Transformer inspired by Video Swin-Transformer [11] and VIVIT [12] to identify the behavior of the crowd. We compare our framework with some state-of-the-art approaches such as fine-tuned TwoStream Inflated 3D ConvNet (I3D) architecture pre-trained on the ImageNet dataset, a pre-trained on Sports 1M dataset, and a fine-tuned 3D Convolutional Net- works (C3D) architecture. Simulation results shows that our proposed V3Trans-Crowd 3D visual Transformer, pre-trained on ImageNet dataset, outperforms benchmark approaches in atleast 80% of the crowd behaviors categories.

II. METHODOLOGY ANDARCHITECTURE

In this section, we introduce the classic ViT architecture and explain the basic concepts of transformer encoder. After- ward, we present our V3Trans-Crowd framework and explain the transformation needed to convert the crowd management videos into a digestable format for the transformer. The goal is to adapt the ViT model to be suitable for video inputs as Vision transformer are initially developed to process 2D images.

A. System Model

Every image data can be represented as an array with dimensions (H, W, C), where H, W, and C represent the

height, width, and the RGB image channel, respectively. This array is converted into N sub-images, referred to as patches, with dimensions (h, w, C). Each patch xi, i ∈ {1,· · · , N}

goes through a linear transformation and gets projected into a 1D token wi ∈R^d, withd= (h×w×C).

Because the Transformer self-attention operations are per- mutation invariant to conserve the order of the patches, a learned positional embedding pi ∈ R^d is added to the tokenswi to retain such positional information. The resultant sequences of these patches is a new vectorz∈R^N×d where each of its elementzi can be expressed as follows:

z_i=

( cls, if i= 0,

wi+pi, if d≥i≥1. (1) The value of z0 = cls is an added special classification token prepended at the beginning of the sequence. This token contains no information itself but serves for classifying the input sequence because the corresponding value for this token in the output side retains which class the input belongs to. The vector z containing all these tokens is then passed through a Transformer encoder consisting of a sequence of total L transformer layers.

As Fig. 1 shows, the transformer encoder is composed of three components: Multi-Head Self-Attention (MHSA), Feed- Forward Network (FFN), and Layer Normalization (LN). We modify the FFN layer with a Multi-Layer Perceptron (MLP) head block layer consisting of two linear projections separated by a GELU non-linearity and the token-dimensionality. In this way, each layer l ∈ {1,· · · , L} comprises of MHSA layers,

(4)

Fig. 2: MSA for spatial-temporal information process and fusion in detail

LN layers, and MLP layers. Furthermore, ifz^l is the tokenz at current layer l,z^l+1 can be written as follows:

y^l= MHSA LN z^l +z^l, z^l+1= MLP LN y^l

+y^l,

wherey^l is an intermediate vector representation for the self attention layer l andz^l is the output of the linear layerl. At the end of the architecture (i.e., at z^L−1 ), a linear classifier is applied to classify the encoded input. The decision is made based on the captured information in the layer l, and more specifically in z_cls^L , the value of the special token zcls in the last layerL. As the transformer, which forms the basis of ViT, is a flexible architecture that can operate on any sequence of input tokens z ∈ R^N×d, we describe strategies for videos token embedding in detail in the sequel.

B. V3Trans-Crowd Architecture

As ViT intially operates on images, a Spatio-Temporal information fusion is needed in order to process videos and capture space and time informations.

The first step consists of the video embedding, which converts the time-series information of the video into tokens digestible for the input of the transformer. To achieve this, Every videoVcan be represented asV∈R^T×H×W×Cwhere T is the number of frames in the video. In order to extract the time and space tokens from the videos, we propose to employ Spatio-Temporal tubes instead of convolution layers as, unlike images, the input is now 4D vectors. These tubes operate on the total 4D input volume and have dimensionnt×nh×nw×C where nt = _T

t

, nh = _H

h

and nw = _W

w

are tokens extracted from the temporal, height, and width dimensions respectively. The resultant vector corresponds to a sequence of tokens˜zwhere˜z∈R^t×h×w×c. This sequence is then linearly projected intoR^N×d, wherednow isd= (t×h×w×C). This 3D ViT embedding corresponds to a 3D convolution operation.

After that, we introduce positional embedding for the N frames patches. However, and unlike Vit, these positional encodings do not only capture the patches in the images, but also, the order of the frames in the video sequence.

Now, and after transforming the video into tokens, we move on to the transformer encoding architecture. Instead of computing all pairs of tokens in the MSA block, we propose to compute the spatial self-attention block and temporal self- attention block separately. As Fig. 2 shows, the total number of heads is divided by two where we only compute the first half in one dimension of input tokens, being the spatial dimension (i.e., left side), while computing the second half in another dimension, being the temporal dimension (i.e., right side). For each part, the attention operation is a parameterized function that learns the mapping between a query q and the corresponding key k and value v representations in a sequencez.˜

Previously in ViT, the attention weights Wk, Wq, and Wv which represent the learnable weights for the keys, queries, and values, respectively, are computed by measuring the similarity between two elements inz˜and their key-value pairs according to the following expression:

Attention(Q,K,V) = Softmax QK^⊤

√dk

!

V, (2) whereQ,K, andVare, similar to the ViT, the queries, keys, and values vectors of the transformer encoder. The queries Q=˜zW_q, the keys K =˜zW_k, and the values V =˜zW_v are linear projections of the input˜zwith˜z,Q,K,V∈R^N^×d. However, as this formulation represents an unfactorised case having the spatial and temporal dimensions merged as in one dimension d = (t×h×w×C), we propose in our V3Trans-Crowd model to change this formulation and separate the temporal and spatial spaces by modifying the keys and values tokens. In fact, we transform K ∈ R^t×h×w×C and

(5)

Fig. 3: Illustration of the possible crowd behavior for each category in Crowd-11 dataset [15].

TABLE I: Crowd-11 video clips per classes, Comparison between instruction of original and ours.

Label Class name #clips (original) clips (after data processing)

0 Gas Free 529 461

1 Gas Jammed 520 506

2 Laminar Flow 1304 1052

3 Turbulent Flow 892 877

4 Crossing Flows 763 705

5 Merging Flow 295 251

6 Diverging Flow 184 182

7 Static Calm 737 643

8 Static Agitated 410 327

9 Interacting Crowd 248 166

10 No Crowd 390 371

V ∈ R^t×h×w×C to K_s, V_s ∈ R^h×w×C and K_t,V_t ∈ R^t, where K_s, V_s are the spatial keys and values and K_t, V_t are the temporal keys and values. Subsequently, we introduce changes to (2). In fact, for half of the attention heads, we attend over tokens from the spatial dimension by computing Ys = Attention (Q,Ks,Vs), and for the rest we attend over the temporal dimension by computing Yt=Attention(Q,Kt,Vt). Given that we are only chang- ing the attention neighbourhood for each query, the attention operation has the same dimension as in the unfactorised case, namely Ys,Yt ∈ R^N^×d. Consequently, we obtain the feature by using data fusion, with a Layer Normalization and concatenating Y = LN(Concat (Ys,Yt)WO) where Concat represents a concatenation operation and W_O is a learnable weight for the normalization layer. At the end, and similary to ViT, a linear classfier with M classed is added to layer L−1.

III. EXPERIMENTS ANDDISCUSSIONS

In this section, we conduct several experiments to evaluate the performances of the proposed V3Trans-Crowd framework.

We pick state-of-the-art 3D processing models, mainly C3D, MLP, I3D and ResNet 3D, for comparison and bench-marking.

We choose to implement V3Trans-Crowd using the Stochas- tic Gradient Descent (SGD). Given a video with true category y and predicted category y. The chosen cross-entropy lossˆ function is defined as:

L=− 1 M^′

M^′

X

i=1 m

X

j=1

yijlog (ˆyij), (3) whereM^′is the batch size andM is the number of categories.

We test our model on large size dataset, namely Crowd- 11 [15]. Crowd-11 consists of more than 6000 fully annotated clips containing real videos of classic crowd behavior. The video clips are classified into 11 categories depicted in Fig. 3.

We perform certain statistics on this dataset such as the number of videos per class. The results is showing in Table I.

For the dataset partitioning, and for each classyi withNy_i

video clips, we choose 0.2×Ny_i as the test datasetm, and 0.7×Ny_i for training dataset,0.1×Ny_i for validation dataset.

For the SGD setting, we choose the momentum of the SGD optimizer to be 0.85, and the weight decay is 0.005. All of the algorithms were trained using Tesla V100 GPU. For the training process, we train these models for 150 epochs with batch size32. For our transformer model, we choose h= 16 andw= 16for our 3D embedding, and we set the parameters of the transformer (i.e., layer and hidden dimension) according to Google Research’s work [12]. The benchmarking models C3D was pretrained on the Sports-1m dataset. The I3D model was pretrained on ImageNet and then on the RGB version

(6)

TABLE II: Top-1 Accuracy Results on Crowd-11 Dataset with multiple model.

MLP for video C3D pre-trained I3D pre-trained ResNet 3D pre-trained V3Trans-Crowd pre-trained

Gas Free 77.41% 86.71% 85.56% 84.32% 87.41%

Gas Jammed 75.06% 84.32% 84.72% 84.46% 85.52%

Laminar Flow 55.33% 57.21% 57.47% 63.12% 61.39%

Turbulent Flow 35.61% 34.12% 35.51% 37.12% 42.71%

Crossing Flow 56.41% 57.32% 59.31% 57.45% 60.33%

Merging Flow 31.17% 36.51% 37.14% 34.12% 36.68%

Diverging Flow 33.51% 31.33% 35.31% 34.52% 38.16%

Static Clam 79.42% 88.42% 89.42% 89.67% 90.12%

Static Agitated 53.51% 53.16% 57.31% 57.56% 61.31%

Interacting Crowd 80.41% 80.13% 81.15% 84.42% 83.52%

No Crowd 90.56% 90.97% 92.14% 91.15% 93.31%

Average Acc per clip 56.81% 60.58% 61.12% 61.58% 63.32%

Fig. 4: Confusion Matrix using C3D Sport-1M pretrained Model on the Crowd-11 dataset.

of Kinetics. Our approach was pretrained with the ImageNet dataset.

As crowd behavior analysis can be viewed as a video classification task, it is appropriate to rely on the classification metrics to assess the performances of V3Trans-Crowd. We propose to use top-1 accuracy and confusion matrix as the evaluation of our prediction model. The top-1 accuracy can be defined as:

Acci= 1 N

M

X

j=1

T{yi,yˆi},

T{y_i,yˆ_i}=

1 if yˆi=yi

0 if yˆi̸=yi

,

(4)

where the indexiindicates thei^thcategory andN is the total number of categories in dataset.

Fig. 4 and Fig. 5 show the confusion matrices for the C3D Sport-1M pretrained model and our V3Trans-Crowd with ImageNet pretrained model, respectively. The computed confusion matrices show that V3Trans-Crowd has been able to achieve top accuracies in classifying all the classes. We notice the our model V3Trans-Crowd outperforms the Sport- 1M pretrained model in class 8 with more than 10%. Also, classes 4 and 5 achieved the lowest accuracy given that these

Fig. 5: Confusion Matrix for the V3Trans-Crowd with Ima- geNet pre-trained Model on the Crowd-11 dataset.

two classes have the lowest number of data. Overall, and despite the fact that the dataset is unbalanced for many reasons, the V3Trans-Crowd has been able to achieve better results in term of detection.

In the second simulation, we compute the classification accuracy using (4). The result of this experiment is illustrated in Table II. As we can notice, the results provide the information that our model V3Trans-Crowd outperformed the other models achieving with 63.32% accuracy. In fact, it out-performed the ResNet 3D with 61.58% accuracy and Inflated ConvNet 3D with 61.12% accuracy. However, it still did not achieve top accuracy in the detection of some classes like Interacting Crowd and Laminar Flow.

IV. CONCLUSION

In this paper, we developed a novel visual Transformer model for deep learning-based crowd behavior analysis monitoring. The proposed model, named V3Trans-Crowd, inte- grates the spatial and temporal dimensions to classify videos depicting crowd movement and classify the crowd behavior accordingly. We proved in our experiments that our proposed model with pre-trained ImageNet outperforms other state-of- art deep leaning models such as the ResNet 3D pre-trained,

(7)

the I3D pre-trained, and the C3D pre-trained when applied to the Crowd-11 Dataset. The proposed model has shown efficiency in detecting many classes related to the crowd behavior problems. However, although it performs better than the benchmarking models, it still does not achieve sufficiently high accuracy in some of the crowd behavior classes. As a direct ongoing work, we will improve the model to better distinguish between crowd behavior categories sharing very common spatial and temporal features and patterns. Moreover, we will inject the proposed architecture in a more generalized system allowing the tracking of the entire crowd or sub-groups of people detected using the proposed V3Trans-Crowd.

REFERENCES

[1] A. Hamrouni, T. Alelyani, H. Ghazzai, and Y. Massoud, “Toward collaborative mobile crowdsourcing,”IEEE Internet of Things Magazine, vol. 4, no. 2, pp. 88–94, 2021.

[2] S. Hamrouni, H. Ghazzai, H. Menouar, and Y. Massoud, “An improved dilated convolutional network for herd counting in crowded scenes,”

in2020 IEEE 63rd International Midwest Symposium on Circuits and Systems (MWSCAS), pp. 1024–1027, IEEE, 2020.

[3] W. Liu, M. Salzmann, and P. Fua, “Counting people by estimating people flows,”arXiv:2012.00452, 2020.

[4] D. Lian, J. Li, J. Zheng, W. Luo, and S. Gao, “Density map regression guided detection network for rgb-d crowd counting and localization,”

inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1821–1830, 2019.

[5] M. Bendali-Braham, J. Weber, G. Forestier, L. Idoumghar, and P.-A.

Muller, “Transfer learning for the classification of video-recorded crowd

movements,” in2019 11th International Symposium on Image and Signal Processing and Analysis (ISPA), pp. 271–276, IEEE, 2019.

[6] Z. Lin, J. Feng, Z. Lu, Y. Li, and D. Jin, “Deepstn+: Context-aware spatial-temporal neural network for crowd flow prediction in metropo- lis,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, pp. 1020–1027, 2019.

[7] J. Gao, M. Gong, and X. Li, “Audio-visual representation learning for anomaly events detection in crowds,”arXiv:2110.14862, 2021.

[8] J. Gao, M. Gong, and X. Li, “Congested crowd instance localization with dilated convolutional swin transformer,”arXiv:2108.00584, 2021.

[9] W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,”arXiv:2102.03334, 2021.

[10] C. Sun, A. Myers, C. Vondrick, K. M. 0002, and C. Schmid, “Videobert:

A joint model for video and language representation learning,” in2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 7463–7472, IEEE, 2019.

[11] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,”arXiv:2106.13230, 2021.

[12] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Luˇci´c, and C. Schmid,

“Vivit: A video vision transformer,”arXiv:2103.15691, 2021.

[13] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu,et al., “A survey on visual transformer,”arXiv:2012.12556, 2020.

[14] M. Wu, Y. Qian, X. Liao, Q. Wang, and P.-A. Heng, “Hepatic vessel segmentation based on 3dswin-transformer with inductive biased multi- head self-attention,”arXiv:2111.03368, 2021.

[15] C. Dupont, L. Tobias, and B. Luvison, “Crowd-11: A dataset for fine grained crowd behaviour analysis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 9–16, 2017.