UDC004.032.26 DOI 10.52167/1609-1817-2022-123-4-217-225 Е.N. Amirgaliyev 1, I.N. Bukenova

(1)

UDC004.032.26 DOI 10.52167/1609-1817-2022-123-4-217-225 Е.N. Amirgaliyev ¹, I.N. Bukenova ², G.S. Bukenov²

1International University of Information Technologies, Kazakhstan, Almaty,

2Almaty Technological University, Kazakhstan, Almaty, Е-mail: [email protected]

STUDY OF VIOLENCE RECOGNITION FROM VIDEO BASED ON ARTIFICIAL INTELLIGENCE

Abstract. This paper will discuss current approaches to the recognition of violent human actions. Recognition accuracy for these methods will also be presented. In the modern world due to development of sensory and visual technologies systems of recognition of violent actions of the person have become popular. The problem of aggression, including children's aggression, has long attracted the attention of psychologists. Among methods using violence there are several separate methods: methods of detection of violence using machine learning, methods of detection of violence using SVM and methods of detection of violence using deep learning. The purpose of this study is to review current software solutions for recognizing violent movements from video. During the work done, an analysis of existing machine learning and artificial intelligence algorithms for automatic detection of aggression was made. And also the application of the PoseNet method for aggression detection was tested and proposed. An attractive feature of the algorithm is that its performance is not affected by the number of people in the input image. Pose evaluation models take the processed camera image as input data and output information about the key points.

Keywords. Recognition of human actions. Deep learning, Support vector machine (SVM), artificial neural network (ANN), recurrent neural networks.

Introduction.

The problem of recognizing human actions from 3D skeleton data and/or RGB images is not new. However, methods for solving the problem are constantly evolving and improving.

Classical methods based on manually generated features extracted from 3D skeleton data (such as covariance of 3D joint positions by action sequence, histogram of 3D skeletal joint locations (HOJ3D), mapping of skeletal configuration and actions to points and curves, skeletal joint angles, joint angular velocities, and joint velocities) in combination with conventional classifiers (such as support vector machine (SVM), hidden Markov model (HMM), and random forest) have been proposed in known works. Also, the development of automated psycho-emotional state detection systems has become an urgent need to reduce wasted labor and time. Detecting aggression in video is a challenging task because the definition of aggression can be ambiguous and uncertain.

This paper examines one important aspect of aggression detection, which is the recognition and detection of violence.

The purpose of this study is to examine current software solutions for recognizing violent movements from videos.

One in three Kazakhstani schoolchildren have experienced bullying and harassment.

This is the data from monitoring conducted in five schools in three regions of the country.

Over 50% of children surveyed reported bullying in their school and classroom. 33.6 percent of students said they were victims of cyberbullying. Most often it was girls who received offensive messages via phone, messenger, or social media they use [14].

(2)

Almost half of the children admitted that it is high school students who resort to bullying. In second place were peers. 10.3% of respondents identified teachers as initiators of bullying, and 5.6% identified their parents and older brothers and sisters. Students in grades 4 to 7 are more likely to be humiliated by older students.

The consequences of bullying can be serious for both victims and bullies. Victimization has been found to contribute to depression, somatic illness, dissatisfaction with life, and low rates of college enrollment for victims of bullying 3 years after the incident.

A Canadian study examined the relationship between suicidality and bullying and sexual harassment among adolescents. Repeated online and offline bullying resulting in symptoms of PTSD and lack of maternal support contributed to active suicidal ideation in 12- to 18-year-old adolescent girls, while sexual harassment was not associated with suicidality.

[12]

Similarly, the relationship between female gender and suicidal ideation was confirmed in a Chinese study with a sample of 23,392 high school students that examined the relationship between aggressive behavior with suicidal ideation and attempts, as well as the influence of gender. Victims, bullies, and victims of bullies had a higher risk of suicidal ideation than neutrals; similar associations can be found in the associations between aggressive behavior and suicide attempts. Further stratification analysis showed that associations of bullying or intimidating others with suicidal ideation and suicide attempts were slightly stronger in girls than in boys. [13]

Modeling methods used to detect emotion or violent acts can be broadly classified as shallow and deep.

Shallow modeling methods are methods that are not capable of extracting objects on their own [1]. Objects extracted by manual methods must be provided to a shallow model for classification. Classifiers such as support vector machine (SVM), artificial neural network (ANN) with one hidden layer, etc. can be used as a shallow model. These models are best suited for reinforcement learning, which requires good partitioned data. The main disadvantage of these methods is that they do not automatically adapt to pattern changes.

Unlike shallow models, most deep models do not require a separate feature extractor because they are based on a feature learning method, which is that they learn their own features from the given data and classify them based on them. In addition, in addition to end-to-end learning, extracted features can be given as input to SVMs and other shallow model classifiers.

Another way to implement deep models is to use features from manually created feature descriptors and give them to a deep classifier. These models work with both supervised and unsupervised learning methods. Although they work with unlabeled data, they require large amounts of data and processing power.

In computer vision, action recognition is becoming an important area of research. Tasks such as aggressive behavior or fights are relatively poorly understood, but can be useful in many surveillance scenarios such as prisons, psychiatric wards or schools. Their extensive practicality creates interest in the development of violence or fight detectors.

Materials and methods.

Violence detection methods are classified into three categories depending on the classifier used: Machine Learning Violence Detection Methods, SVM Violence Detection Methods, and Deep Learning Violence Detection Methods. SVM and deep learning are classified separately because these algorithms are widely used in computer vision. Table 1 provides a list of violent detection methods. SVM (Support vector machine) is an algorithm that is used to solve classification problems using reinforcement learning. The classification task refers to learning with a teacher. SVM is an algorithm for learning with a teacher. The main goal of SVM as a classifier is to find the equation of the separating hyperplane

(3)

in space R^n , which would separate two classes in some optimal way.

The algorithm will correctly classify objects if the condition is met:

Solved analytically through the Kuhn-Tacker theorem. The resulting problem is equivalent to the dual problem of finding the saddle point of a Lagrange function.

. Table 1 - Methods of Violence Detection Using SVMs

Method Object detection method

Feature extraction

method Type of incident

Accuracy

% Real-time detection of

violence in crowded scenes [1]

ViF descriptor Feature set The

crowded 88%

Bag of words framework that uses acceleration to

detect actions [2]

Background subtraction algorithms

Ellipse estimation method for consecutive

frames

Less crowded

Approxim ately 90%

Genetic framework algorithm with tracking and

detection module [3]

Gaussian model

Algorithm for optical stream

extraction

The

crowded 82%-89%

Multimodel features of the subclass-based

framework [4]

Image by CNN and ImageNet

Google Net for feature extraction

Less

crowded 98%

Determining the Frequency of Violent Assignment Actions [5]

Spatial pyramids and grids for object detection

Spatial temporal grid methods for object extraction

The crowded

96%–

99%

using different data sets Violence detection using

oriented violent flow [6]

Optical flow method

Combination of ViF and OViF

descriptors

The

crowded 90%

AEI and HOG have combined a framework for

recognizing anomalous event in visual motion [7]

AEI method for background

subtraction

HOG and spatial and temporal

methods of feature extraction

Both crowded

and less crowded

94%-95%

The framework includes pre-processing, activity

detection, and image extraction. This work identifies the abnormal event and image from the

data [8]

Optical flux and temporal difference for object detection

CBIR image extraction

method

Gaussian function for video file

analysis

Less

crowded 97%

(4)

Late fusion method for time-lapse perception

layers for high level activity detection.

Using multiple cameras from 1 to N. [9]

Motion vector method for identification from multiple cameras in two

dimensions

SGT MtPL method

Less

crowded 98%

Two-channel convolutional neural network for real-time

detection [10]

ImageNet for object detection

VGG-f model for feature extraction

The

crowded 91%-94%

Solve the detection problem by dividing the target by depth and clear format with Connect [11]

Motion detection and the Trof model

BoW approach Less

crowded 96%

For the first time, the deep learning paradigm is applied to a task using 3D CNN, which takes a full video sequence as input. But human motion features are crucial for this task, and using full video as input causes noise and redundancy in the learning process. For this purpose, according to Goyal R., Kahou S. E., Michalski V., Materzyńska J. a hybrid

"handcrafted/learned" framework has been proposed [15]. The method, firstly, aims at obtaining an illustrative image from a video sequence taken as input data for feature extraction, and Hough's forest is used as a classifier. Then, to classify this image and obtain an inference for the sequence, a 2D CNN is used. This method has provided excellent results for one of the best methods for representing fine manual work. However, the dual-flow architecture may not be suitable for real-time applications because of the computational complexity.

And the results show that the proposed method performs better than various manual and deep learning methods based on accuracy and standard deviations.

A three-stage end-to-end violence detection structure in deep learning is proposed to recognize violent actions [16]. First, people are detected in surveillance video streams using a lightweight CNN model to overcome and reduce the huge processing of unsuitable frames.

Second, about 16 frames with detected individuals are transferred to the 3D CNN, where the spatio-temporal characteristics of these sequences are extracted and fed into the Softmax Classifier. The 3D CNN model is then optimized using the neural network optimization and open visual inference tools developed by Intel. The trained model is converted into an intermediate illustration and modified to execute on the final platform for final violence detection. Once violence is detected, an alarm is transmitted to a nearby security department or police station to produce an action.

Body action is also a critical interface, attracting more and more attention in recent years.

Recognizing and detecting violence is becoming an important topic for outdoor surveillance videos. The main goal is to determine if violence is occurring. First, an extension of improved Fisher vectors (IFV) is proposed for video clips [11]. Local objects and their spatio-temporal positions are used to represent the video. The popular sliding window approach is then explored to detect violence. To speed up the approach, the data structure of the summarized area table is used and the IFV formulas are reformulated. Second, local spatiotemporal features are extracted from the video using enhanced dense trajectories (IDT).

The video representation for each descriptor is then computed independently as a HOG to represent the video using IFV. A linear SVM classifier is then used to recognize violence, and finally, using a fast sliding window approach, violence is detected. Extensive evaluation is done using 4 state-of-the-art data sets of violent streams, movies, and hockey games. The

(5)

Violence-Flow 21 dataset is used for the violence detection task. And the results show that the proposed approaches perform better than existing approaches.

Visual Action Recognition is the primary method for recognizing bullying. Action Recognition. Visual action recognition is the primary method for recognizing aggression.

Applying this method, the aggression of the observed can be computed using two supervised recurrent multiple time scale neural networks (MTRNN) [10]. The first layer recognizes human actions, and the second layer predicts human intentions, based on the results of the first layer.

In addition, action features were studied in a hierarchical way to understand spontaneous emotions. This approach applied a multichannel convolutional neural network (CCNN) to integrate multimodal recognition of nonverbal emotions, including facial expressions and body movements.

PoseNET model.

I have validated the PoseNet method, which returns the validity value of each detected person and the key points of each detected gesture.

PoseNet can be used to evaluate a single pose or multiple poses, which means that one version of the algorithm can detect only one person in the image/video, while another version can detect multiple people in the image/video.

Software implementation:

- Skeletons of human figures are detected in the video using the PoseNet model;

- then the coordinates of the key points are used as features for a classifier by aggressive action classes.

The pose evaluation models take the processed camera image as input and output the keypoint information. The detected key points are indexed by part ID with a confidence score ranging from 0.0 to 1.0. The confidence score indicates the probability that a key point exists at that position. The highlights of this method are described below.

Posture Confidence - This determines the overall confidence in the posture assessment.

It is a value between 0.0 and 1.0. It can be used to hide poses in which there is no confidence.

pose -From the top level, PoseNet will return a pose object that contains a list of key points for each person detected and an instance level confidence score.

The keypoint confidence score is a determination of confidence in the accuracy of the estimated location of the keypoint. It is a value between 0.0 and 1.0. It can be used to hide keypoints where there is insufficient confidence.

Keypoint location - The two-dimensional x and y coordinates of the detected keypoints in the original input image.

The key point is a part of a person's perceived posture, e.g. nose, right ear, left knee, right foot, etc. D. It contains confidence estimates for postures and keypoints. PoseNet can currently detect 17 key points.

The different body joints detected by the pose assessment model are shown in Table 2.

Table 2 - Different body joints detected by the PoseNet model

Identifier Part

0 nose

1 left eye

2 right eye

3 left ear

4 right ear

5 left shoulder

6 right shoulder

7 left elbow

(6)

8 right elbow

9 left wrist

10 right wrist

11 left hip

12 right hip

13 left knee

14 right knee

15 left ankle

16 right ankle

Results.

The result of PoseNET is a representation of the human figure with 17 key points with coordinates and confidence scores. These 17 points include: nose, eyes, ears, shoulder tops, lbows, wrists, hips, knees and ankles. An example of PoseNET's identification of the 17 key points is shown in Figure 1.

Figure 1- Identification of 17 key points by the PoseNET network

PoseNet returns the plausibility value of each detected person and the key points of each detected gesture.

From the top level, pose evaluation occurs in two steps:

1) The RGB image is input into the convolutional neural network.

2) Single or multi-position decoding algorithms are used to decode the poses, construct validity estimates, key point locations and key point validity estimates based on the model output.

Discussion.

The single pose estimation algorithm is simpler and faster. Its ideal scenario is that there is only one person in the middle of the input image or video. The disadvantage is that if

(7)

more than one person is in the image, the key points of two people can be evaluated i.e.

determined by the algorithm as part of the same pose. The pose can be combined. If the input image can contain multiple people, the multiple pose evaluation algorithm should be used.

The multiple person pose estimation algorithm can estimate multiple poses (people in the image). It is more complex and slightly slower than the single pose algorithm, but it has the advantage that if multiple people appear in the image, the key points they detect are unlikely to be associated with the wrong pose. For this reason, this algorithm may be more desirable even if the application script is designed to detect the pose of a single person.

In addition, an attractive feature of the algorithm is that its performance is not affected by the number of people in the input image. Whether it is 15 or 5 people, the computation time is the same.

Conclusion.

The use of automatic violence detection in video footage is essential for surveillance camera analysis and law enforcement to maintain public safety. It also helps protect children from being exposed to inappropriate content and helps parents make better decisions about what their children should watch. In computer vision, action recognition is becoming an important area of research. Tasks such as aggressive behavior or fights are relatively poorly understood, but can be useful in many video surveillance scenarios. Deep learning has become a very popular direction within machine learning, surpassing traditional approaches in many computer vision applications. A very advantageous feature of deep learning algorithms is their ability to learn features from raw data, eliminating the need for manual descriptors and descriptors.

This paper describes the features of modern artificial neural networks for recognizing aggressive human actions. This paper analyzes existing machine learning and artificial intelligence algorithms for automatically detecting and classifying physical, social and other kinds of violence, detecting the psycho-emotional state of those being observed. The comparative analysis of these algorithms on examples of other works and researches of foreign experts was done. The application of the PoseNet method to detect aggression was tested and proposed.

The study was supported by project AP14871625 of the Ministry of Education of the Republic of Kazakhstan.

REFERENCES

[1] Marinoiu, E., Zanfir, M., Olaru, V., & Sminchisescu, C. (2018). 3d human sensing, action and emotion recognition in robot assisted therapy of children with autism. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2158- 2167).

[2] El-Ghaish, H., Hussein, M. E., Shoukry, A., & Onai, R. (2018). Human action recognition based on integrating body pose, part shape, and motion. IEEE Access, 6, 49040- 49055.

[3] Song, S., Lan, C., Xing, J., Zeng, W., & Liu, J. (2016). An end-to-end spatio- temporal attention model for human action recognition from skeleton data. arXiv preprint arXiv:1611.06067.

[4] Arici, T., Celebi, S., Aydin, A. S., & Temiz, T. T. (2014). Robust gesture recognition using feature pre-processing and weighted dynamic time warping. Multimedia Tools and Applications, 72(3), 3045-3062.

(8)

[5] Ji, Y., Cheng, H., Zheng, Y., & Li, H. (2015). Learning contrastive feature distribution model for interaction recognition. Journal of Visual Communication and Image Representation, 33, 340-349.

[6] Huynh-The, T., Le, B. V., Lee, S., & Yoon, Y. (2016). Interactive activity recognition using pose-based spatio–temporal relation features and four-level Pachinko Allocation Model. Information Sciences, 369, 317-333.

[7] Charalampous, K., Kostavelis, I., Boukas, E., Amanatiadis, A., Nalpantidis, L., Emmanouilidis, C., & Gasteratos, A. (2015). Autonomous robot path planning techniques using cellular automata. In Robots and lattice automata (pp. 175-196). Springer, Cham.

[8] H. S. Koppula, R. Gupta, and A. Saxena, “Learning human activities and object affordances from rgb-d videos,” International Journal of Robotics Research, vol. 32, no. 8, pp.

951–970, 2013.

[9] C. Granata, A. Ibanez, and P. Bidaud, “Human activity-understanding: A multilayer approach combining body movements and contextual descriptors analysis,”

International Journal of Advanced Robotic Systems, vol. 12, no. 7, 2015.

[10] V. Dutta and T. Zielinska, “Predicting the intention of human activities for real- time human-robot interaction (hri),” in ICRA, 2016.

[11] http://hacs.csail.mit.edu

[12] Zhao, H., Torralba, A., Torresani, L., Yan, Z. Hacs: Human action clips and segments dataset for recognition and temporal localization //Proceedings of the IEEE International Conference on Computer Vision. – 2019. – С. 8668-8678.

[13] M.Ju. Uzdjaev “Raspoznavanie agressivnyh dejstvij s ispol'zovaniem nejrosetevyh arhitektur 3d-CNN,” Izvestija TulGU. Tehnicheskie nauki. 2020. Vyp. 2.

[14] Nelli Adamenko, IA «NewTimes.kz»

[15] Goyal R., Kahou S. E., Michalski V., Materzyńska J., Westphal S., Kim H., Haenel V., Fruend I., Yianilos P., Mueller-Freitag M., Hoppe F., Thurau C., Bax I., Memisevic R. The

"Something Something" Video Database for Learning and Evaluating Visual Common Sense //ICCV. – 2017. – Т. 1. – №. 4. – С. 5.

[16] http://activity-net.org/

Еділхан Амиргалиев т.ғ.д., профессор, Халықаралық ақпараттық технологиялар университеті, Қазақстан, Алматы, [email protected]

Ғани Букенов, лектор, Алматы технологиялық университеті, Қазақстан, Алматы, [email protected]

Индира Букенова, магистр, лектор, Алматы технологиялық университеті, Қазақстан, Алматы, [email protected]

ЖАСАНДЫ ИНТЕЛЛЕКТНІҢ НЕГІЗІНДЕГІ ВИДЕОДАН ЗОРЛЫҚТЫ ТАНУДЫ ЗЕРТТЕУ

Аңдатпа. Бұл мақалада адамның зорлық-зомбылық әрекеттерін танудың заманауи тәсілдері қарастырылады. Бұл әдістер үшін тану дәлдігі де ұсынылады. Қазіргі әлемде сенсорлық және визуалды технологиялардың дамуының арқасында адамның зорлық-зомбылықты тану жүйелері танымал болды. Агрессивтілік мәселесі, оның ішінде балалардын, ұзақ уақыт бойы психологтардың назарын аударды. Зорлық-зомбылықты қолдану әдістерінің ішінде бірнеше жеке ерекшеліктер бар: машиналық оқытуды қолдану арқылы зорлық-зомбылықты анықтау әдістері, SVM көмегімен зорлық- зомбылықты анықтау әдістері және терең оқытуды қолдану арқылы зорлық-зомбылықты анықтау әдістері. Бұл зерттеудің мақсаты бейне арқылы зорлық-зомбылық

(9)

қозғалыстарын танудың заманауи бағдарламалық шешімдерін қарастыру. Атқарылған жұмыс барысында агрессияны автоматты түрде анықтау үшін қолданыстағы Машиналық оқыту және жасанды интеллект алгоритмдеріне талдау жасалды. Сондай-ақ, агрессияны анықтау үшін posenet әдісін қолдану тексеріліп, ұсынылды. Алгоритмнің тартымды ерекшелігі-оның өнімділігіне кіріс кескініндегі адамдардың саны әсер етпейді. Позаны бағалау модельдері камерадан өңделген кескінді кіріс ретінде қабылдайды және негізгі нүктелер туралы ақпаратты шығарады.

Түйінді сөздер. адамның іс-әрекетін тану. Терең оқыту, тірек векторлық машина (SVM), жасанды нейрондық желі (ANN), қайталанатын нейрондық желілер.

Едилхан Амиргалиев д.т.н., профессор, Международный университет информационных технологии, Казақстан, Алматы, [email protected]

Гани Букенов, лектор, Алматинский технологический университет, Казақстан, Алматы, [email protected]

Индира Букенова, магистр, лектор, Алматинский технологический университет, Казақстан, Алматы, [email protected]

ИССЛЕДОВАНИЕ РАСПОЗНАВАНИЯ НАСИЛИЯ ПО ВИДЕО НА ОСНОВЕ ИСКУССТВЕННОГО ИНТЕЛЛЕКТА

Аннотация. В данной статье рассматриваются современные подходы распознавания насильственных действий человека. Также будут представлены точности распознавания для данных методов. В современном мире благодаря развитию сенсорных и визуальных технологий системы распознавания насильственных действий человека стали популярны. Проблема агрессивности, в том числе детской, давно привлекает внимание психологов. Среди методов с использованием насилия выделяется несколько отдельных: методы обнаружения насилия с использованием машинного обучения, методы обнаружения насилия с использованием SVM и методы обнаружения насилия с использованием глубокого обучения. Цель данного исследования рассмотреть современные программные решения по распознаванию насильственных движений по видео. В ходе проделанной работы был сделан анализ существующих алгоритмов машинного обучения и искусственного интеллекта для автоматического выявления агрессии. А также проверено и предложено применение метода PoseNet для выявления агрессии.

Привлекательной особенностью алгоритма является то, что на его производительность не влияет количество людей на входном изображении. Модели оценки позы принимают обработанное изображение с камеры в качестве входных данных и выводят информацию о ключевых точках.

Ключевые слова. Распознавание действий человека, глубокое обучение, машина опорных векторов (SVM), искусственная нейронная сеть (ANN), рекуррентные нейронные сети.

****************************************************************************