motion analysis from encoded video bitstream

(1)

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

NGUYEN MINH HOA

MOTION ANALYSIS FROM ENCODED VIDEO BITSTREAM

MASTER’S THESIS

HA NOI – 2018

(2)

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

NGUYEN MINH HOA

MOTION ANALYSIS FROM ENCODED VIDEO BITSTREAM

Major: Computer Science

MASTER’S THESIS

Supervisor: Dr. Do Van Nguyen Co-Supervisor: Dr. Tran Quoc Long

HA NOI - 2018

(3)

AUTHORSHIP

“I hereby declare that the work contained in this thesis is of my own and I have not submitted this thesis at any other institution in order to obtain a degree. To the best of my knowledge and belief, the thesis contains no materials previously published or written by another person other than those listed in the bibliography and identified as references.”

Signature: ………

(4)

SUPERVISOR’S APPROVAL

“I hereby approve that the thesis in its current form is ready for committee examination as a requirement for the Master of Computer Science degree at the University of Engineering and Technology.”

Signature: ………

(5)

ACKNOWLEDGMENTS

First of all, I would like to express special gratitude to my supervisors, Dr. Do Van Nguyen and Dr. Tran Quoc Long, for their enthusiasm for instructions, the technical explanation as well as advices during this project.

I also want to give sincere thanks to Assoc. Prof. Dr. Ha Le Thanh, Assoc. Prof.

Dr. Nguyen Thi Thuy for the instructions as well as the background knowledge for this thesis. And I would like to also thank my teachers, my friends in Human Machine Interaction Lab for their support.

Thank my friends, my colleagues in the project "Nghiên Cứu Công Nghệ Tóm Tắt Video", and project “Multimedia application tools for intangible cultural heritage conservation and promotion”, project number ĐTDL.CN-34/16 for their working and support.

Last but not least, I want to thank my family and all of my friends for their motivation and support as well. They stand by and inspire me whenever I face the tough time.

(6)

TABLE OF CONTENTS

AUTHORSHIP ... i

SUPERVISOR’S APPROVAL ... ii

ACKNOWLEDGMENTS ... iii

TABLE OF CONTENTS ... 1

ABBREVIATIONS ... 3

List of Figures ... 4

List of Tables ... 5

INTRODUCTION ... 6

CHAPTER 1. LITERATURE REVIEW ... 9

Moving object detection in the pixel domain ... 9

Moving object detection in the compressed domain ... 10

1.2.1. Motion vector approaches ... 11

1.2.2. Size of Macroblock approaches ... 13

Chapter Summarization ... 14

CHAPTER 2. METHODOLOGY ... 15

Video compression standard h264 ... 15

2.1.1. H264 file structure ... 15

2.1.2. Macroblock... 18

2.1.3. Motion vector ... 19

Proposed method ... 21

2.2.1. Process video bitstream ... 21

2.2.2. Macroblock-based Segmentation ... 22

2.2.3. Object-based Segmentation ... 24

2.2.4. Object Refinement ... 28

(7)

CHAPTER 3. RESULTS ... 30

The moving object detection application ... 30

3.1.1. The process of application ... 31

3.1.2. The motion information ... 34

3.1.3. Synthesizing movement information ... 35

3.1.4. Storing Movement Information ... 36

Experiments ... 36

3.2.1. Dataset ... 36

3.2.2. Evaluation methods ... 40

3.2.3. Implementations ... 41

3.2.4. Experimental results ... 41

CONCLUSIONS ... 45

List of of author’s publications related to thesis ... 46

REFERENCES ... 47

(8)

ABBREVIATIONS

MB Macroblock

MV Motion vector

NALU Network Abstraction Layer Unit

RBSP Raw Byte Sequence Payload

SODB String Of Data Bits

(9)

List of Figures

Figure 1.1. The process of moving object detection with data in the pixel domain

... 10

Figure 1.2. The process of moving object detection with data in the compressed domain ... 11

Figure 2.1. The structure of a H264 file ... 15

Figure 2.2. RBSP structure ... 16

Figure 2.3. Slide structure ... 18

Figure 2.4. Macroblock structure ... 18

Figure 2.5. The motion vector of a Macroblock ... 20

Figure 2.6. The process of moving object detection method ... 22

Figure 2.7. Skipped Macroblock ... 23

Figure 2.8. (a) An outdoor and in-door frames (b) The "size-map" of frames, (c) The "motion-map" of frames ... 24

Figure 2.9. Example about the “consistent” of motion vector ... 26

Figure 3.1. The implementation process of the approach ... 33

Figure 3.2. Data struct to storage motion information ... 35

Figure 3.3. Example frames of test videos ... 37

Figure 3.4. Example frames and their ground truth ... 39

Figure 3.5. An example frame of Pedestrians (a) and ground truth image (b) ... 40

(10)

List of Tables

Table 2.1. NALU types ... 16

Table 2.2. Slide types ... 17

Table 3.1. The information of test videos ... 38

Table 3.2. The information of test sequences in group 1 ... 39

Table 3.3. The performance of two approachs with Pedestrians, PETS2006, Highway, and Office ... 42

Table 3.4. The experimental result of Poppe’s approach on 2^nd group ... 42

Table 3.5. The experimental result of proposed method on 2^nd group ... 43

(11)

INTRODUCTION

Today, video content is extensively used in the areas of life such as indoor monitoring, traffic monitoring, etc. The number of videos sharing over the Internet at any given time is also extremely large. According to statistics, hundreds of hours of video are uploaded to Youtube every minute [1]. Not only that, the general trend today is the surveillance cameras installed in homes for surveillance and sercurity purposes. These cameras will normally operate and store the surveillance videos automatically. Only when there are some special situations, or some special events occur, humans will use the video data to revisit.

The problem is that in a short amount of time, how can such a large video volume be evaluated? For example, when there is a burglary, an intrusion occurs, we can not spend hours to check each video previously stored. Then, a tool that lets you determine the moment when an object is moving in a long video is essential to reducing the time and effort of searching.

Normally, in order to reduce the size of videos for transmission or storing, a video compression procedure is performed at surveillance cameras. After that, the compressed information in form of bit stream is stored, or transmitted to a server for analysis. The video analysis process needs a lot of features to describe different aspects of vision. Typically, these features are extracted from the pixel values of each video frame by fully decompressing bitstream. The decompression procedure requires high computation capacity device to perform. However, with the trend of "Internet of Things", there are many low processing capacity devices which are not capable for performing this full video decompression at high speed.

So, it is difficult to perform an approach that requires a lot of computing power in real time.

Another way to extract the feature from the video is using the data on the compressed video. These data can be: transform coefficients, motion vectors, quantization steps, quantization parameters, etc. From the above data, through the process and analysis, we can handle some important tasks in the computer vision include moving objects detection, human actions detection, face recognition, motion objects tracking.

This thesis proposes a new method to determine moving object by exploring and applying some motion estimation techniques in the video compression domain.

After that, the method will be used to build an application that supports movement searching in the surveillance videos in the families. The compression format of

(12)

the videos in the thesis is the H264 compression standard (MPEG-4 part10), a popular video compression standard today.

Aims

The goal of the thesis is to propose a method for determining moving objects in the compressed domain of a video. Then, I try to build an application using the method for support searching the moments which have moving objects in the video.

Object and Scope of the study

Within the framework of the thesis, I study the algorithms related to determining moving objects in video, especially the algorithms that determine moving objects in the compressed domain. The video compression standard is used in the thesis is H264/AVC.

The theory of video compression and computer vision are taken from scientific articles related to the video analysis problem on the compression domain, determine the motion form on the compression domain of the video.

The videos for test and experiment are obtained from the surveillance cameras both indoor and outdoor.

Method and procedures

- Research on motion analysis and evaluation systems on existing compressed video, scientific articles related to the analysis and evaluation of motion on compressed video.

- Experimental research: Conduct experiential settings for each theoretical part such as extracting video data, compiling data, and evaluating motion based on the obtained data.

- Experimental evaluation: Each experiment will be conducted independently on each module and then integrated and deployed.

Contributions

The thesis proposes a new moving object detection method in surveillance video encoded with H264 compression standard using the motion vector and size of macroblock.

(13)

Thesis structure

Apart from the introduction, the conclution and the references, this thesis is organized into 3 chapters with the following main contents:

Chapter 1 is literature review. This chapter will show the related work of the thesis include the moving object detection methods in the pixel domain and the moving object detection methods in the compressed domain.

Chapter 2 mentiones the basic knowledge about video compression standard H264 such as H264 file structure, macroblocks, motion vectors and describes the detail of moving object detection method including processing video bitstreams, macroblock-based segmentation phase, object-based segmentation phase, and object refinement phase.

Chapter 3 shows the results of method including an application using proposed method and experimental results.

(14)

CHAPTER 1.

LITERATURE REVIEW

Today, surveillance cameras are used extensively in the world. The volume of video surveillance has also grown tremendously. Some problems that are often encountered with video surveillance include event searching, motion tracking, abnormal behavior detection, etc. In order to handle these tasks, it is necessary to have a method that can determine which the moments in each videos exist movements.

Usually, the video is compressed for storage and transmission. The previous moving object detection method usually use the data from the pixel images such as color value, edges, etc. To get the images that can be displayed, or processed, the system must decode video fully. This consumes a large number of computing resources, time and memory of the device. I suggest a method that can quickly determine the moving objects in high resolution videos. The data used in the method will be taken from the compressed video domain including information about the motion vector and the size of the macroblock (in bit) after encoding.

The method reduces the processing time of the method considerably compared to methods implemented with data on the pixel domain.

The problem of motion detection in a video has long been studied. This is the first step in a series of computer vision problems such as object tracking, object detection, abnormal movement detection, etc. There are usually two approaches to address this problem: using fully decoded video data (pixel domain data) or using live data from an undecoded video (compressed domain data). The following section will outline the studies based on these two approaches.

Moving object detection in the pixel domain

Typically, to reduce the size of the video for transmission, a video encoding process is performed inside the surveillance camera and the compressed information is transmitted as a bit stream to a server for video analysis. Common video compression standards used today including mp4, H264, H265. To be viewable, these compressed videos need to be decoded to image frames. We call these image frames are the pixel domain and the data obtained from these image frames are the data in the pixel domain. Fig. 1.1 describes the process of moving object detection methods in the pixel domain. The data in the pixel domain include the color values of the pixels, the number of color channels of each pixel, the edges, etc.

(15)

Figure 1.1. The process of moving object detection with data in the pixel domain To determine moving objects in the pixel domain, background subtraction algorithms are commonly used. There are many research results that have been introduced long ago. These methods usually use data as the relationship between frames in a time series.

Background subtraction in [2] is defined as: “Background subtraction is a widely used approach for detecting moving objects in videos from static cameras. The rationale in the approach is that of detecting the moving objects from the difference between the current frame and a reference frame, often called The

“background image”, or “background model”. As a basic, the background image must be a representation of the scene with no moving objects and must be kept regularly updated so as to adapt to the varying luminarice conditions and geometry settings.”.

Results of the researchs may include the methods use Gaussian average such as the method of Wren et al. [3], the method of Koller et al. [4]; the methods use Temporal median filter such as the method of Lo and Velasti [5], the method of Cucchiara et al. [6]; the methods using a mixture of Gaussians such as the method of Stauffer and Grimson [7], methods of Wayne Power and Schoonees [8]; etc.

The above methods have a common characteristic that is the process data are taken by fully decompress the compressed bitstream and this decompression procedure requires a highly computational device to perform. However, with the trend of

"Internet of Things," where most low-end devices are not capable of performing high-speed decompression. Therefore, there should be a video analysis mechanism that includes only uncompressed video.

Moving object detection in the compressed domain

Normally, the videos will be encoded using some compression standard. Each compression standard specifies how to shrink the video size by a certain structure.

The compressed videos will contain fewer data. For example, with the H264 compression standard, the data contained in the compressed video includes

(16)

information about macroblock, motion vector, frame information, etc. We call these data that the data in the compressed domain or video compression region.

Fig. 1.2 shows the process of moving object detection methods by using the data in the compressed domain.

Figure 1.2. The process of moving object detection with data in the compressed domain

In general, the amount of data in the video compression domain is much less than the data in the pixel domain. The idea of using data in the compressed domain with the H264 compression standard for video analysis has also been investigated by some scientists around the world. In order to be able to detect motion in the compressed video domain, we usually use two types of data. They are the motion vector and the size (in bit) of the macroblock.

1.2.1. Motion vector approaches

A number of algorithms have been proposed to analyze video content in the H264 compressed domain, whose good performances have been obtained [9] [10]. Zeng et al. Study in [11] proposed a method to detect moving objects in H264 compressed videos based on motion vectors. Motion vectors are extracted from the motion field and classified into several types. Then, they are grouped into blocks through the Markov Random Field (MRF) classification process. Liu et al.

[12] recognized the shape of an object by using a map for each object. This approach is based on a binary partition tree created by macroblocks. Cipres et al.

[13] presented a moving object detection approach in the H264 compressed domain based on fuzzy logic. The motion vectors are used to remove the noises that appear during the encoding process and represent the concepts that describe the detected regions. Then, the valid motion vectors are grouped into blocks. Each of them could be identified as a moving object in the video scene. The moving objects of each frame are described with common terms like shape, size, position, and velocity. Mak et al. [14] used the length, angle, and direction of motion vectors to track the objects by applying the MRF. Bruyne et al. [15] estimated the

(17)

reliability of motion vectors by comparing them with projected motion vectors from surrounding frames. Then, they combined this information with the magnitude of motion vectors to distinguish foreground objects from the background. This method can localize the noisy motion vectors and their effect during the classification can be diminished. Wang et al. [16] proposed a background modeling method using the motion vector and local binary pattern (LBP) to detect the moving object. When a background block was similar to a foreground block, a noisy motion vector would appear. To obtain a more reliable and dense motion vector field, the initial motion vector fields were preprocessed by a temporal accumulation within three inter frames and a 3×3 median filtering.

After that, the LBP feature was introduced to describe the spatial correlation among neighboring blocks. This approach can reduce the time of extracting moving objects while also performing an effective synopsis analysis. Marcus Laumer [17] proposed an approach to segment video frames into the foreground and background and, according to this segmentation, to identify regions containing moving objects. The approach uses a map to indicate the "weight" of each (sub-)macroblock for the presence of a moving object. This map is the input of a new spatiotemporal detection algorithm that is used to refine the weight that indicated the level of motion for each block. Then, quantization parameters of macroblocks are used to apply individual thresholds to the block weights to segment the video frames. The accuracy of the approach was approximately 50%.

To identify the human action, Tom et al. [18] proposed a quick action identification algorithm. The algorithm uses quantization parameters gradient image (QGI) and motion vectors with support vector machines (SVM) to classify the types of the actions. The algorithm can also handle light, scale and some other environmental variables with an accuracy rate of 85% on the videos with resolution 176x144. It can identifies walking, running, etc. Similarly, Tom, Rangarajan and his colleagues also used QGI and motion vector to propose a new method to classify human actions as the Projection Based Learning of the Meta- cognitive Radial Basis Functional Network (PBL-McRBFN).

With the motion tracking problem, Biswas et al. [19] propose a method for detecting abnormal actions by analyzing motion vector. This method mainly relies on observing the motion vector to find the difference between abnormal actions and normal situations. The classifier used here is the Gaussian Mixture Model (GMM). This approach base on their another approach [20] but improved it by using the direction of the motion vector. The speed of approach when perform experimental is about 70fps. Thilak et al. [21] propose a Probabilistic Data

(18)

Association Filter that detects multiple target clusters. This method can handle cases in which targets split into multiple clusters or clusters should be detected (classified) as a target. Similarly, You et al. [22] use the probabilistic spatiotemporal MB filtering to mark the macroblock as objects and then remove them from the noise. The algorithm can track many objects with real-time accuracy but can only be applied in case of fixed camera and objects must be at least two macroblocks. Kas et al. [23] overcame the fixed camera problem using Global Motion Estimation and Object History Images to handle background movement.

However, the number of motion objects need to be small and the moving objects are not occupied most of the frame area.

1.2.2. Size of Macroblock approaches

The methods mentioned above share the trait of using motion vectors to detect moving objects. However, since motion vectors are usually created at the video encoder to optimize video compression ratio, they do not always represent the real motion in the video sequence. As such, due to its coding-oriented nature, to detect moving objects, the motion vector fields must be preprocessed and refined to remove the noises.

So, Poppe et al. [24] proposed an approach to detect moving objects in the H264 video by using the size of the macroblocks after encoding (in bit). To achieve Sub- macroblock-level (4×4) precision, the information from transform coefficients was also utilized. The system achieved high execution speeds, up to 20 times faster than the motion vector-based related works. An analysis was restricted to Predicted (P) frames, and a simple interpolation technique was employed to handle Intra (I) frames. The whole algorithm was based on the assumption that the macroblocks that contains an edge of a moving object is more difficult to compress since it is hard to find a good match for those macroblocks in the reference frame(s).

Base on Poppe’s idea, Vacavant et al. [25] used the macroblock size to detect moving objects by applying the Gaussian mixture model (GMM). The approach can represent the distribution of macroblock sizes well.

Although the method of Poppe and Vacavant is good for removing the background motion noise, they cannot produce high motion detection results for videos in high spatial resolution (such as 1920 × 1080 or 1280 × 720). In case where the moving objects are large and they contain a uniform color region (such as a black car), then the size of macroblocks corresponding to the inside region of

(19)

the moving object will be very small (normally around zero), and using a filtering threshold or parameter (though very small) will not be effective. In those cases, the algorithm will determine these regions to be background.

Chapter Summarization

In this chapter showed the researchs about moving object detection in both pixel domain and compressed domain. The approachs using data from pixel domain usually have high accuracy but taking a large number of computing resources and time. The approachs using data in compressed domain have lower accuracy because the data in compressed domain usually contain less information. In the next chapters, I will propose a method that can efficiently detect moving objects, especially in high spatial resolution video streams. The method uses the data taken from the video compressed domain, including the size of the macroblocks to detect the skeleton of the moving object and the motion vectors to detect the detail of the moving object.

(20)

CHAPTER 2.

METHODOLOGY Video compression standard h264

Before proposing the moving object detection method, this chapter will show some informations about H264, a popular video compression standard, which is used to encode and decode the surveillance video in the thesis.

This day, the installation of surveillance cameras in house became quite common.

Normally, video data from a surveillance camera over a long period of time usually has very huge size. Consequently, videos need to be preprocessed and encoded before being used and transmitted over the network. There are many recognized compression standards and widely used. One of these is the H264 or MPEG-4 part 10 [26], a compression standard recognized by the ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group.

2.1.1. H264 file structure

Normally, the video after being captured from the camera will be compressed using a common video compression standard such as H261, H263, MP4, H264/AVC, H265/HEVC, etc. In the thesis, I encode and decode the video by using H264/AVC. The H264 video codec or MPEG-4 part 10 is recognized by the ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group.

Typically, an H264 file is splitted into packets called the Network Abstraction Layer Unit (NALU) [27], as shown in Fig. 2.1.

Figure 2.1. The structure of a H264 file

The first NALU byte indicates the type of NALU. The NALU type shows what the NALU's structure is. It can be a slice or set parameters for decompression. The meaning of the NALU in Table 2.1.

(21)

Table 2.1. NALU types

Type Definition

0 Undefined

1 Slice layer without partitioning non IDR 2 Slice data partition A layer

3 Slice data partition B layer 4 Slice data partition C layer

5 Slice layer without partitioning IDR 6 Additional information (SEI)

7 Sequence parameter set 8 Picture parameter set 9 Access unit delimiter 10 End of sequence 11 End of stream 12 Filler data 13..23 Reserved 24..31 Undefined

Other than NALU, the rest of the NALU is called RBSP (Raw Byte Sequence Payload). RBSP contains data of SODB (String Of Data Bits). According to the specification document H264 (ISO/IEC 14496-10) if the SODB is empty (no bits are present), the RBSP is also empty. The first byte of RBSP (left side) contains 8 bits of SODB; The next byte of the RBSP will contain up to 8 bits of SODB and continue until less than 8 bits of SODB.

Figure 2.2. RBSP structure

(22)

A video will normally be divided into frames and the encoder will encode them one by one. Each frame is encoded into slices. Each slice is divided into Macroblock (MB). Typically, each frame corresponds to a slice, but sometimes a frame can be split into multiple slices. The slices are divided into categories as shown in Fig. 2.2. A slice consists of a header and a data section (Fig. 2.3). The header of the slice contains information about the type of slice, the type of MB in the slice, the number of slice frames. The header also contains information about the reference frame and quantitative parameters. The data portion of the slice is the information about the macroblock.

Table 2.2. Slide types

Type Description

0 P-slice. Consists of P-macroblocks (each macroblock is predicted using one reference frame) and/or I-macroblocks.

1 B-slice. Consists of B-macroblocks (each macroblock is predicted using one or two reference frames) and/or I-macroblocks.

2 I-slice. Contains only I-macroblocks. Each macroblock is predicted from previously coded blocks of the same slice.

3 SP-slice. Consists of P and/or I-macroblocks and lets you switch between encoded streams.

4 SI-slice. It consists of a special type of SI-macroblocks and lets you switch between encoded streams.

5 P-slice.

6 B-slice.

7 I-slice.

8 SP-slice.

9 SI-slice.

(23)

Figure 2.3. Slide structure 2.1.2. Macroblock

The basic principle of a compression standard is to split the video into frame groups. Each frame is divided into the basic processing units. (For example, in the H264/AVC standard, it is Macroblock (MB) which is a region 16x16 pixels).

Also, with some data regions carrying more detail, the MBs will be subdivided into smaller sub-macroblocks (4x4 or 8x8 pixels). Each MB after compression will contain the information used to recover the video later, including Motion vector, Residual value, Quantization parameter, etc. as in Fig. 2.4, where:

• ADDR is the position of Macroblock in a frame;

• TYPE is the Macroblock type;

• QUANT is the quantization parameter;

• VECTOR is Motion vector;

• CBP (Coded Block Pattern) show how to split MB into smaller blocks;

• bN is encoded data of residual of color channels (4 Y, 1 Cr, 1 Cb).

Figure 2.4. Macroblock structure

During decompression, the video decoder receives the compressed video data as a stream of binary data, decodes the elements and extracts the encoded information, including coefficients of variation, size of MB (in bit), motion

(24)

prediction information, and so on and perform the reverse transformation to restore the original image data.

2.1.3. Motion vector

With H264 compression, frame-based megabytes are predicted based on the information that has been transferred from the encoder to the decoder. Usually, there are two ways of predicting frame prediction and inter-frame prediction.

Frame forecasting uses compressed image data in the same frame as the compressed macroblock and predicts inter-frame image data using previously compressed frames. Interframe forecasting is accomplished through a predictive and compensatory motion process in which the motion predator retrieves the macroblock in the reference frame closest to the new macroblock and calculates the motion vector, this vector characterizes the shift of the new macroblock to encoding compared to the reference frame.

Referenced macroblocks are sent to the subtractor with the new macroblock that needs coding to find error prediction or residual signal, which will characterize the difference between the predicted macroblock and the actual macroblock. The residual signal or prediction error will be converted to Discrete Cosine Transform and quantized to reduce the number of bits to be stored or transmitted. These coefficients together with the motion vectors will be applied to the entropy compressor and the bit stream. Video streams of binary data include conversion factors, motion prediction information, compressed data structure information, and more. To perform video compression, one compares the values of the two frames. A frame is used as a reference. When we want to compress a MB at position i of a frame, the video compression algorithm tries to find the reference frame of a MB with the smallest value of MB compared to MB at position i. Then, if MB is found in the reference frame at position j, the change between i and j is called the Motion vector (MV) of MB at position i (Fig. 2.5). Normally an MV will consist of two values: x (the column position of MB) and y (row position of MB).

(25)

Figure 2.5. The motion vector of a Macroblock

Note that the MV of a MB does not really describe the motion of the objects in that MB, but merely represents the movement of pixels closest to the pixels in MB.

(26)

Proposed method

This section describes the processing of proposored moving object detection method. The processing includes three phases: Macroblock-based segmentation, Object-based segmentation, and Object refinement.

2.2.1. Process video bitstream

The video data is taken directly from the surveillance camera, in the form of a H264 bitstream. Then it is transported to process device. To get the MVs and MBs information, I use the library LIVE555 [28] and JM 19.0 [29]. LIVE555 is a free, open-source C ++ library that allows to send and receive streams of information through RTP / RTCP, RTSP, and SIP protocols. The LIVE555 Streaming Media module is responsible for connecting, authenticating and receiving data from the RTSP stream taken directly from the surveillance camera. In addition to receiving packets, LIVE555 Streaming Media also disassembles the header of packets. The results from this module are therefore NALUs (refer to ISO/IEC 14496-10 [26]).

Then the NALU will be transferred to JM 19.0, a free H264 decode tool commonly used in study and research, for processing. The original JM 19.0 input decoder module is a compressed video file with the H264 compression format (with the format described in Annex B of ISO/IEC 14496-10). The original output is the decompressed video file in YUV format. However, in order to reduce the time and volume of computation as originally planned, I made a modification to this library that stopped just extracting the required information without fully decoded the video.

Then, the MVs and MBs will be used to detect the moving object. I propose a method that uses a combination of both MVs and MBs to determine the motion in the video. This method can be applied to both in-house video and off-road environment. Because using the data from compressed domain, it is easy to reduce the processing time of the method compare with the methods use the data in the pixel domain. The moving object detection method consists of 3 phases:

Macroblock-based segmentation, Object-based segmentation, and Object refinement, as shown in Fig. 2.6.

(27)

Figure 2.6. The process of moving object detection method 2.2.2. Macroblock-based Segmentation

This phase is based on Poppe's approach [24]. I use the storage size of MB after encoding (in bit) to determine that which MBs contain movement. It can be done because MBs containing moving objects are often more detailed than others.

Therefore, the compression ratio of these MBs is usually lower, making the size of these MBs much higher than that of the MBs in the background. Fig. Figure 2.8 is an example of the out-door frame and in-door frame. It shows the correlation between the motion information and the size of MBs. Fig. Figure 2.8 (a) is the original frames (first is an outdoor frame, second is an in-door frame), Fig. Figure 2.8 (b) is the map of the size of the MBs in those frames. Each square in Fig.

Figure 2.8 (b) represents the size of one MB. The larger the size, the more white square. As we can see, the size of MBs is larger at the moving regions (e.g. the vehicles, the leaves of shaking).

I use the size of MB to classify MBs into 2 types: the MBs that can belong to the moving object and MBs that can belong to the background. To do this, I compare the size of MBs with a threshold Ts. If the size of MB is greater than Ts, I mark the MB as "can be the moving object". Otherwise, I mark it as "can be the background".

(28)

Figure 2.7. Skipped Macroblock

However, the Poppe’s approach [24] still has an important constraint that is “A general conclusion is that MBs corresponding to (the edges of) moving objects will typically contain more bits in the bitstream than those representing BG”. This mean the algorithm works good only on the MBs that contain the edges of moving objects. The reason is H264 provides “skip_mode” for some special MBs. If a region has uniform color such as a shirt, a wall, a car door, etc. the encoder doesn’t need to send the information of MBs in that region to decoder. The decoder will estimate a the value for the skipped MBs from neighboring coded MBs and uses this to calculate a motion compensated prediction for the skipped MBs. Since there is no residual information, the motion compensated prediction is directly inserted into the decoded frame or field. That makes some MBs that may be considered to be moving objects but their size is equal to zero. To solve this problem, we apply a preprocessing step that recalculates the size of all skipped MBs as the average of the size of MB on the left, above and on the right above (Fig. Figure 2.7). All MBs considered to be moving object are merged together using the 8-neighbor algorithm to yield segments before applying the next macroblock-based segmentation phase.

(29)

Figure 2.8. (a) An outdoor and in-door frames (b) The "size-map" of frames, (c) The "motion-map" of frames

2.2.3. Object-based Segmentation

It is desirable that the background model can adapt to gradual changes in the appearance of the scene. For example, in an outdoor environment or through the view of the window of a house, the light intensity typically varies during the day;

dynamic background such as raining, movements of clouds, swaying tree branches, etc. can be seen anywhere. Observations from the motion vector field have shown that the motion vectors of rigid moving objects usually have similar direction and length, while motion vectors of uninterested moving objects (in the following sections, we consider them as noise motion) such as swaying tree branches usually have various direction and length. For noise motion like leaves or tree branches, they still have large size MB, but usually contains holes in the segments while the moving of human sometimes has various direction and length

(30)

but without holes in the segment. As shown in Fig. Figure 2.9, the car, motorbike and human (in the rectangles) are the interested moving objects, while the waves of water, lights (in the circles) are the uninterested moving objects or noise.

The level of consistency in MV field and density are exploited to identify the interested motions as the movement of human, vehicles and to remove noise motion, especially swaying branches. I define a segment that has “consistent” MV field if its MV direction and MV length are “consistent”. The motion vector directions are “consistent” if there are existed TC (90%) motion vectors whose angle between any two vectors stay smaller than TA (10◦). And the motion vector lengths are “consistent” if there are existed TC (90%) motion vectors whose length difference between any two vectors stay smaller than TL (20). The density of a segment is the ratio between the number of MBs in each segment and the number of MBs in the margin of a segment. The process of object-based segmentation consists of two steps which are level of consistency of motion vectors field and level of segment’s density.

(31)

Figure 2.9. Example about the “consistent” of motion vector

For the level of consistency of motion vectors, because of various direction and length but the small difference of motion vector fields of segments, we first normalize MV directions to angle (in degrees) between MVs and positive X-axis (in Cartesian coordinate) and the length of motion to integer values. In order to appropriately analyze the MV field of each segment, specifically, a MV (x,y) where the direction is Md and the length is Ml, given (𝑥, 𝑦 ≠ 0), is normalized as follows:

(32)

𝑀_𝑑 = {

𝑟𝑜𝑢𝑛𝑑 (^arctan(^𝑦^𝑥⁾

𝜋 ∗ 180) with 𝑥, 𝑦 ≠ 0, 90 with 𝑥 = 0, 𝑦 ≠ 0,

0 with 𝑥 ≠ 0, 𝑦 = 0.

(1)

𝑀_𝑙 = 𝑟𝑜𝑢𝑛𝑑(√𝑥²+ 𝑦²). (2)

After that, Chebyshev’s inequality is applied to ensure of consistency of MVs field:

𝑝(|𝑋 − 𝜇_𝐴| ≥ 𝑘𝜎_𝐴) ≤ ¹

𝑘² (3)

where X is a random variable that represents the direction of a motion vector, μA

and σA are mean and standard deviation of the distribution of motion vector directions. From equation 3, in order to confirm that MV direction is consistent, we have 𝑘 = ^𝑇^𝐴

2𝜎_𝐴 and ¹

𝑘² ≤ 1 − 𝑇_𝐶, therefore, 𝜎_𝐴 ≤ √^𝑇^𝐴^(1−𝑇^𝐶⁾

4 . So, if 𝜎_𝐴 ≤

√^𝑇^𝐴^(1−𝑇^𝐶⁾

4 , the segment is considered to have consistency in MV direction. The same condition is applied to MV lengths. If 𝜎_𝐿 ≤ √^𝑇^𝐿^(1−𝑇^𝐶⁾

4 where μL and σL are mean and standard deviation of the distribution of motion vector length, the MV lengths are consistent. However, we normalize MV directions to angles between MV and the positive X-axis. That means 0° and 359° are next to each other. Thus, we need to check the σA of each distribution when using the angle from 0° and 359° as the middle of X-axis and if there exists any 𝜎_𝐴_𝑖° that satisfies 𝜎_𝐴_𝑖° ≥

√^𝑇^𝐴^(1−𝑇^𝐶⁾

4 . The MV directions of the segment are considered to be consistent.

For the level of density of a segment, we calculate the ratio between the number of MBs in each segment 𝑁_𝑆_𝑖 and the number of MBs lying on its margin 𝑀_𝑆_𝑖. As discussed above, if a segment is considered to be noise (not containing true motion) then it usually contains holes (no information). So, the density ratio of a noise segment is much smaller than that value of a true motion segment.

Therefore, we can classify between noise and true motion segments based on the density level (or ratio):

𝐷𝑒𝑛𝑠𝑖𝑡𝑦 = 𝑀_𝑆_𝑖

𝑁_𝑆_𝑖 ≤ 𝑇_{𝐷𝑒𝑛𝑠𝑖𝑡𝑦}

Finally, a segment is considered as an interested moving object when it satisfies:

(33)

(𝜎_𝐿 ≤ √^𝑇^𝐿^(1−𝑇^𝐶⁾

4  𝜎_𝐴 ≤ √^𝑇^𝐴^(1−𝑇^𝐶⁾

4 )^𝑀^𝑆𝑖

𝑁_𝑆𝑖 ≤ 𝑇_{𝐷𝑒𝑛𝑠𝑖𝑡𝑦}. (4) 2.2.4. Object Refinement

As discussed above, MBs containing moving objects are often with more details than others. That means the block size of the moving objects or the motion noise (excepting “skip_mode”) could be larger than that of the background size.

However, in case of moving object containing flat regions, these regions can be estimated appropriately. Therefore, the block size becomes smaller and that makes the Macroblock-based segmentation process removes some parts of an object. In this step, we tend to recover these parts of an object. Furthermore, observation has shown that this case only occurs in objects that have consistent motion. Based on an analysis of motion directions and motion lengths, we can check MB around the object to see if it is a part of the object or not. Starting from the MBs which are marked as a moving object, we use the breadth-first search algorithm to recover each layer around the segments from near to far and check each MB using hypothesis testing: a MB with motion direction A and motion length L is considered to belong to the moving object if:

([𝜇_𝐴− 𝜎_𝐴] ≤ 𝐴 ≤ [𝜇_𝐴+ 𝜎_𝐴])([𝜇_𝐿 − 𝜎_𝐿] ≤ 𝐿 ≤ [𝜇_𝐿 + 𝜎_𝐿]). (5) Chapter Summarization

This chapter describes some basic informations about video compression standard H264. The detail of standard can be found in the document of ISO/IEC Moving Picture Experts Group [26]. In the thesis, to receive the video bit stream from camera and parse it into NALU, I use the Live555. This is a open source and free library to process the H264 bit stream. After that, the JM 19.0 will be used to handle the received video and extract the MVs and size of MBs of each frame.

MVs and MBs are the inputs of the object detection method described in the following chapter.

This chapter also proposes a new moving object detection using size of MBs and MVs. The method include three phases. The 1^st phase, macroblock-based segmentation, is used to detect the “skeleton” of movement region by using size of MBs compare with a threshold Ts. After that, in the object-based segmentation phase, I try to determine that in the moving regions which is belong to interested moving object, which is belong to noise. Finaly, in the object refinement phase,

(34)

some missing movement MBs of flat regions has been recovered. In next chapter, I will show the experimental results and a application built by using the method.

(35)

CHAPTER 3.

RESULTS

The thesis is done within the framework of the research project “Nghiên Cứu Công Nghệ Tóm Tắt Video” by the cooperation between the University of Engineering and Technology (UET) and VP9 Vietnam. Therefore, apart from the experimental results, my team and I have built an application using the proposed method. This application was handed over and approved by VP9 Vietnam. In the application, in order to aid in quickly searching the moments contain movement in video, we provide a good data struct to store the motion informations. With this data structure, the users instead of having to search for motion on the whole frame, they can search the motion in a region of interest to get better results.

The moving object detection application

In the process of using surveillance cameras, the need to store and search the moments which happen movement is very important. When there is movement, the moving image area is the area of interest, other static regions are called the background. When the background is static (almost no change in the value of the pixels), motion detection can be performed simply by subtracting the current frame from the reference frame (such as methods determines motion in the pixel domain mentioned above). However, in reality, the background is often changed due to noise or unwanted movements (such as camera noise, shaking leaves or exotic light). Thus, real-time motion detection in a video frame from which to detect and locate events in a specific segment of a long video file is a challenge.

The problem of searching for events in large volumes, especially long-time video surveillance, is a time-consuming and laborious task for users and processors. In related studies published, there are several solutions for automated searching to detect and locate the time in the video where the event occurred. However, the fast and efficient search of the video segment containing the event has not been satisfactorily resolved. The process of processing video data to find out where the event occurred in the video is still limited. Therefore, the problem of analyzing and synthesizing the data to summarize the video data so that the search is convenient and effective still requires better solutions. There are many results related to the video storing applying in industry.

The invention US6697523 [24], named: "Method for summarizing a video using motion and color descriptors", relates to the method of extracting the motion information of a video for the purpose of automatic capturing. The method of this

(36)

invention uses a partially compressed video data stream and also image information (full decompression). This consumes the computing resources of the device. A video summary can be made simply by retrieving a frame that represents a video clip, or color analysis. This causes or loses information, decreases the accuracy of the results when searching, or calculates complexity on the domain of the image. Furthermore, the invention does not propose an effective storage solution for information that has been synthesized.

The invention US5956026A [25], named: "Method for hierarchical summarization and browsing of digital video," relates to summarization and browsing by creating a simplified hierarchical representation of the video using some wildcard. Each image represents a video shot, and the system must determine the scene and frame number. Browsing is done through the avatar frame. The invention uses both the extracted audio data to compute the videotape.

The invention does not offer a simplified method for storing the information of a video.

The invention US7751632B2 [26], named: "Intelligent, dynamic, long-term digital surveillance media storage system," provides a method of analyzing multimedia data streams for encrypting and indexing data stored in the real requirements of the monitoring system. In particular, video content analysis is done based on the classification of the motion data of each frame. From there, the system chooses the optimal encoding technique for each frame to create its own descriptors for efficient storage. After the analysis of the video segment to choose the optimal encoding will be deleted from the original file, only the descriptor is saved. The invention does not propose an integrated analysis of frame-frame motion information, does not support frame-based motion search, and does not have a hierarchical storage system for capturing motion video information.

In the following section, I will describe some information related to the application is built by using the proposed method.

3.1.1. The process of application

The process of application is in the Fig. Figure 3.1. As mentioned above, first, the video data will be taken directly from the surveillance camera. This data is in the form of a H264 bitstream. Basically, this is actually a real-time H264 file. Library LIVE555 and JM19.0 are used to implement step (1) Entropy decode. LIVE555 is a free, open source C ++ library that allows you to send and receive streams of information through RTP / RTCP, RTSP, and SIP protocols. The LIVE555

(37)

Streaming Media module is responsible for connecting, authenticating and receiving data from the RTSP stream taken directly from the surveillance camera.

In addition to receiving packets, LIVE555 Streaming Media also disassembles the header of packets. The results from this module are therefore NALUs (refer to ISO/IEC 14496-10). Then the NALU will be transferred to JM 19.0, a free H264 decode tool commonly used in study and research, for processing. The original JM 19.0 input decoder module is a compressed video file with the H264 compression format (with the format described in Annex B of ISO/IEC 14496- 10). The output is the extracted video file in YUV format. However, in order to reduce the time and volume of computation as originally planned, I made a modification to this library that stopped just extracting the required information without fully decode video.

Then, this information will be used to perform the process (2) Moving object detection. To implement this process, I use the proposed method in chapter 2. The result obtained after this process is a matrix that describes the position of motion in each frame called the Movement map. The position of motion will have a value of 1, otherwise the value is 0.

(38)

Figure 3.1. The implementation process of the approach

The information in the Movement map will be used to perform (3) Synthesizing movement. This process will evaluate and classify motions to varying degrees

(39)

depending on the frequency and appearance of the motion to obtain the motion information.

The motion description information obtained from the above steps will be reshaped and stored in a convenient data structure for later retrieval and use in Storage Movement Information (4).

The details of the step (3) and step (4) will be described later.

3.1.2. The motion information

The motion information in the thesis is understood as a value representing the level of motion of the object in the video. In order to obtain information describing motion, we can first classify the motion in the video into real motion (caused by objects such as human beings, vehicles, etc.) and motion due to interference. The types of observations that can be observed are:

• Noise due to camera shake: The characteristic of this noise is the large motion on the entire frame, with the cycle.

• Noise due to camera quality: This is caused by the low light intensity, usually a form of noise is small, no cycles but fairly distributed.

• Noise due to light: blinking light (cyclic noise), tube lights, etc. These types of interference are cyclical, large hard to determine.

• Noise due to weather factors such as rain, clouds, etc.

With real motion, we can divide into two types: normal movement and meaningful movement. The concept of normal and meaningful here depends on the circumstances of the video. For example, with home video, shaking curtains cause visible movement, but movement means human movement in the scene; With the motion on the road, the types of motion are more difficult to define. With general types of motion, we can divide as follows:

• Movement of cyclic motion equipment (such as rotor blades, rotating wheels)

• Wind motion caused by the wind (leaves, curtain fabric). These movements are usually large movements and can have cycles.

• Movements of foreign objects such as sun shining, lights (motorcycle lights, automobile lights from remote). These movements are often difficult to determine. However, they usually appear in night-time video.

• Lastly, real motions are like moving people, moving vehicles in the observation area.

(40)

3.1.3. Synthesizing movement information

Synthesis method, classification of motion begins with the step of calculating the weight of motion for each position in the frame (each position corresponds to one MB) in the time interval T. For a position, we weight the number of megabytes of motion at times (by frame) during T is as follows:

• If the MB is moving at the time of review, the weight of motion at that moment is equal to the count of the moments of preceding consecutive motion.

• Other cases, if the MB at the time of review has no motion, the weight is zero.

Then, the moving weight of each position in the composite frame after the time T is equal to the sum of the time weightings at all times in the period T.

After calculating the moving weight, we proceed to evaluate the motion level to perform the motion classification for each position in the composite frame after the time T based on the weight calculated in the previous step. The level of motion is divided into four levels by the binary symbol, namely: no movement (00), few movement or noisy (01), movement (10), and many movement (11). Movement level values are then saved to two-dimensional arrays and stored in a two- dimensional array.

Figure 3.2. Data struct to storage motion information

(41)

3.1.4. Storing Movement Information

This step will store the movement information obtained after the synthesis step described by the motion described above. The movement information data is stored according to the hierarchy of space and time of the video. The structure that stores the motion description information is depicted in the Fig. Figure 3.2.

Where:

• Level 1 is a folder that contains aggregate data for each video storaged time by time.

• Level 2 is the folder that contains the files that contains the information data according the horizontal of frame in a temporal dimension.

• Level 3 is the files that contains movement information data of blocks in columns of the frame in a temporal dimension.

• Level 4 is the contents of the files in level 3. These files contain binary values from 0 to 3. The value is the level of motion of the block in a time T (may be 1 seconds, 2 seconds, 3 seconds, 10 seconds, etc.). The user can modify T through using the parameter.

The advantage of this data structure is when you want to search the moments when movement happen, you can choose an area (corresponding to some MBs). In that case, the time to searching is shorter because the application has only searched in the files correspond with the MBs you choose. Moreover, predefining the searching region (region of interested) will make the accuracy of the result is higher than the searching on full frame.

Experiments

3.2.1. Dataset

The proposed method is designed to operate with a fixed, downward-facing camera. The maximum resolution of videos is 1920x1080 pixels. The program can be installed directly on a device attached to the camera like Raspberry Pi, running Linux operating system that guarantees real-time processing.

The experimental data was provided by VP9 Vietnam company and processed by HMI laboratory, University of Engineering and Technology. The data set includes 43 videos with resolutions of 1280x720 and 1920x1080. In addition, the method uses live data from more than 100 cameras installed in the city of Hanoi and Da Nang City which are provided by VP9 including indoor data and outdoor. The

(42)

videos with various lighting and environmental conditions including outdoor light (sunlight, low sunshine), artificial light (tube, led), wind, rain, etc. It can be said that the data set satisfies the supply of different situations and environments for the moving object detection problem.

Figure 3.3. Example frames of test videos

For gathering and statistics for the report, I made the ground truth for 7 videos with a resolution of 1280x720 and 1920x1080 and used these videos to perform the experimental results. Table 3.1 describes the information about the videos used for the experimental results. In Fig. Figure 3.3, we have some example frames of the test videos (Figure 3.3a is a frame of TrongNha_02, Figure 3.3b is a frame of DNG8_1708, Figure 3.3c is a frame of NEM1_131, Figure 3.3d is a frame of HMI_WetRoad, Figure 3.3e is a frame of CuaHang_01 and Figure 3.3f is a frame of HMI_OutDoor). These videos are captured in different environments

(43)

and circumstances to perform the experiements. Fig. Figure 3.4 depicts some of their respective frames and ground truth.

Table 3.1. The information of test videos

Video Information

Resolution Place

HMI_WetRoad 1920 × 1080 Outdoor

HMI_OutDoor 1280 × 720 Outdoor

GVO2_0308 1280 × 720 Outdoor

NEM1_131 1920 × 1080 In-door

DNG8_1708 1920 × 1080 Outdoor

CuaHang_01 1280 × 720 In-door

TrongNha_02 1280 × 720 In-door

In addition, to compare with the approach of Poppe [24] that we base on in macroblock-based segmentation phase, we use the second dataset from IEEE Change Detection Workshop 2014 [30]. So, the experimental process will carry out on 2 datasets, including 11 test sequences, which are divided into 2 groups.

First group consists of 4 test sequences: PETS2006, Pedestrians, Highway and Office from the baseline profile of the IEEE Change Detection Workshop 2014.

Both video frames and motion ground truth can be downloaded on the homepage of Changedetection. We use ffmpeg [31] to create compressed video from given frames with all of encoding parameters set to default. Fig. Figure 3.5 shows an example frame of Pedestrians test sequence (a) and its motion ground truth (b).

Table 3.2 shows the information of four videos: the 1^st column is the name of videos, the next three columns are the resolution, frame rate value, and quantization parameter (qp) value, respectively, of each video. As we can see, the videos in the 1^st group have difference resolution but they are all low resolution videos. The frame rate of videos is 25 fps and qp value depends on each video.

These videos are quite similar to the videos in Poppe’s experiment.

(44)

Figure 3.4. Example frames and their ground truth Table 3.2. The information of test sequences in group 1

Video Information

Resolution fps qp pedestrians 360 × 240 25 25

PETS2006 720 × 576 25 27

Highway 320 × 240 25 23

Office 360 × 240 25 23

The videos in the 2^nd group are 7 videos mentioned above. These videos from actual indoor and outdoor surveillance cameras without scripting and prior arrangement. The motion ground truth are made by ourself by investigating the video frame by frame. They are all the high spatial resolution videos.

(45)

Figure 3.5. An example frame of Pedestrians (a) and ground truth image (b) 3.2.2. Evaluation methods

The efficiency of the method is evaluated by the recall value, the precision value and F1 score. In which, the precision value is calculated by:

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 recall value is calculated by:

Recall = TruePositive

TruePositive + FalseNegative and the F1 score is calculated by:

𝐹1 = 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

(46)

where:

• TruePositive: The total number of Macroblocks correctly detected as a moving object

• FalsePositive: The total number of Macroblocks are background but detected as a moving object

• FalseNegative: The total number of Macroblocks are a moving object but not detected

High precision means that the accuracy of the method is good. High recall means that the percentage of the missing moving object is low. A perfect system is a system with precision and recall is both 100%. However, this is impossible.

Normally, when adjusting the system for precision priority, it will reduce recall and vice versa. In that case, we can use the F1 score. This allows for a balance between precision and recall.

3.2.3. Implementations

The proposed method in this thesis is set up in C++ language. Our experiments were done on Windows PC of the Intel Core i5-3337U, 1.8GHz, and 8 GB RAM.

Base on observation, we’ve seen that Ts should be chosen empirically base on each test video. The other parameters should be Tc = 90%, TA = 10◦, TL = 20, and Tdensity = 80%.

3.2.4. Experimental results

The videos in 1^st group are performed experiment many times and select the best result. Table 3.3 shows the comparison experimental results of 2 approachs these videos. In the case of using the proposed method, the average value of precision of the four videos is 80%, the average of recall is 84%, and the F1 score is 81.9878. If using Poppe’s method, average of precision is 81%, average of recall is 83%, and the F1 score is 81.95122. We can see that the performance of our method is equivalent to that of Poppe’s method when applying on a low-resolution video.

(47)

Table 3.3. The performance of two approachs with Pedestrians, PETS2006, Highway, and Office

Video

Our approach Poppe’s approach

Precision (%)

Recall

(%) F1 Precision (%)

Recall

(%) F1

pedestrians 84 95 89.16201 80 90 84.70588

PETS2006 87 80 83.35329 88 78 82.6988

Highway 77 81 78.94937 78 80 78.98734

Office 72 82 76.67532 75 83 78.79747

Average 80 84 81.95122 81 83 81.9878

With the 2^nd video group, the high resolution videos, the proposoed method is used to perform experiment many times with different Ts parameters and selected 4 best results. Table 3.4 is the experimental result when using Poppe’s approach and Table 3.5 is the experimental result of the proposed method on these videos.

The results show that the recall values of Poppe’s approach are usually smaller than the values of proposed method, meaning the number of missing moving objects detected by Poppe’s approach greater than the proposed method. This happen because there are many “skip_mode” MBs in a frame of a high resolution video.

Table 3.4. The experimental result of Poppe’s approach on 2^nd group Video Precision Recall F1

HMI_WetRoad 0.4954 0.8943 0.6376 HMI_OutDoor 0.5145 0.7711 0.6172 GVO2_0308 0.6821 0.6016 0.6393 NEM1_131 0.6055 0.7602 0.6741 DNG8_1708 0.8777 0.7489 0.8082 CuaHang_01 0.7468 0.8339 0.788 TrongNha_02 0.8341 0.7247 0.7756

In additional, the experimental results in Table 4.5 show that the videos which have good results are the videos have less noise, and there is a clear distinction between the background and moving objects. And, the results do not depend on videos capture from outdoor or indoor cameras. As in the results table, the best result is the TrongNha_02 video (Fig. Figure 3.3a) with F1 score = 0.8771. This is a video obtained in a working room (namely a police station). Good environmental conditions with low noise. A moving object is a person who clearly