• Tidak ada hasil yang ditemukan

Convolutional Neural Network for Pornographic Images Classification

N/A
N/A
Nguyễn Gia Hào

Academic year: 2023

Membagikan "Convolutional Neural Network for Pornographic Images Classification "

Copied!
5
0
0

Teks penuh

(1)

978-1-5386-7167-2/18/$31.00 ©2018 IEEE

Convolutional Neural Network for Pornographic Images Classification

I Made Artha Agastya1,2, Arief Setyanto2, Kusrini2, Dini Oktarina Dwi Handayani1

1School of Computing and IT, Taylor’s University, Subang Jaya, Malaysia

2AMIKOM Yogyakarta, Yogyakarta, Indonesia

[email protected]1,2, [email protected]2, [email protected]2, [email protected]1

Abstract—The pornographic content that spreads across the internet can potentially harm underage users. Consequently, pornographic images detection has become a popular topic for many researchers. In previous studies on image content detection, features were selected by an expert. In some cases, the features cannot represent a significant attribute or characteristic.

Convolutional Neural Networks are among the deep learning approaches that have been applied to tackle this problem. In this research, a modification on the last layer is proposed to fit the pornographic detection. According to the final results, the pornographic image classification accuracy reached 93.8%.

Keywords—CNN, Pornographic, Image Detection, ConvNet I. INTRODUCTION

Negative content such as pornography [1] that spreads over the internet has significantly increased in the past ten years.

This rapid rise of indecent images or videos is driven by many people sharing their lives without thinking, and it is considered pornography. The distribution of internet porn content has disturbed social life in general. To illustrate, many underage children can access obscene images or videos without control by their parents. Being exposed to porn content can harm their mentality or moral values.

To filter out pornographic contents, we proposed a method to detect skin from images. Although the accuracy of this method is high, the skin method [2] is not suitable for detecting pornographic content, since images that are taken at the beach or swimming pool can be falsly detected as pornographic. The classifier can predict false positives in this case. This is because the human in the image will be seen as similar to the nude content. Another method is to mimic a text-based classification approach, which is Bag of Words (BOWs). It was reported that the result outperforms the method which relies on skin color. In addition, the combination of skin based and BOW inspired method does not produce improvement.

The convolutional neural network (CNN) [3] as one method in deep learning has shown superiority in classifying these images. The CNN can beat the human classification accuracy in the ImageNet Large Scale Visual Recognition Competition (ILSVRC). Though the CNN is excellent in

particular image classification, CNN is not used to filter pornographic content by many researchers.

The aim of this study is to implement CNN to classify pornographic images that are extracted from 800 videos, with a total of 16,727 keyframes images. We slightly modified the final layer of the CNN to match with our case. The result was satisfying, since the classifier had a 6.2% error rate on average.

II. RELATED WORK

There are many pornographic definitions [4] that correspond with the focus of the research. The definition of pornography has become fundamental information in order to obtain the effective extracted features. For example, the pornographic definition can be an exposed genital or naked man or women. However, the other pornographic denotation needs some act of erotic or intimate representation to be considered pornographic. In a study done by M. B. Short et al.

(2012), it was stated that pornography is “any sexually explicit material with the aim of sexual arousal or fantasy” [5]. In that case, it becomes the actually used definition by many researchers [6].

Before the image base is used, the first method to understand pornographic content [7] is detecting particular sexual keywords or terms, such as sex, porn, nude, hardcore, etc. The string will be matched to the keyword of that considered pornographic. The string method cannot filter the image based pornography.

The fundamental image-based [8], [9] methods sense skin color to distinguish nude images against normal images. The skin is detected automatically using color matching. Moreover, the advanced approaches [10], [11], [12], [13], [14] in pixel- based method performs a searching Region of Interest (ROI), utilizing skin detection, then performs features extraction in the ROI such as color moment, histogram, energy, contrast, as well as any other criteria. Therefore, the features are injected to Support Vector Machine (SVM) or Neural Network (NN) based classifiers in order to predict the label. The essential features are hard to be recognized by human experts. So the model cannot be passed on a different pornographic dataset.

The other pornographic image detection [15] approach uses SIFT descriptors-based BOW model. A similar technique is

(2)

proposed in [16], which performs a description with Hue-SIFT, a reduction using Principal Component Analysis (PCA), and a histogram quantitation utilizing K-Means. The procedure is finished with the SVM classifier. It is stated that the performance is greater compared to previous methods. The main drawback of this approach is a redundancy of features when the image has a dominant background such as a panorama image.

As mentioned previously, the hand-engineered feature has been used extensively in image-based pornographic detection.

Researchers struggle to develop an effective method. And this occurs because each image category has different significant attributes. The hardship is becoming more significant as the size of the region of interest or a variety of the images increases.

The current pornographic image detection method uses deep learning. The basic deep learning approaches are two types, namely, convolutional and recurrent. The convolutional method is commonly used in computer vision cases. For example, the CNN [17] can surpass humans for image classification tasks. On the other hand, the recurrent method is used in sequential data such as time series data and 1D signal (e.g., voice). In this work [18], the CNN is altered to be able to model the fusion of sensitive objects in the images. The local and global context is considered and fused in hierarchy deep CNN. The other CNN method [19] has been carried out with the combination of AlexNet [3] and GoogLeNet [20]. It is shown that the approach provides better results than any state of the art pornographic image detection methods.

III. METHOD A. NPDI dataset

NPDI as the most popular pornographic dataset which is available in a public repository and contains almost 80-hour videos. The videos consist of 400 non-pornographic videos and 400 pornographic videos. For non-pornographic videos, there are two types, which are easy-to-classify and difficult-to- classify. The difficult type was searched from a website using a keywords such as wrestling, swimming, and beach. The pornographic videos are collected from a porn site. The image is extracted using STOIK Video Converter, which is an industry-standard segmentation software. A total of 16,727 keyframes were obtained. The average of porn, non-porn ("easy"), and non-porn ("difficult") keyframes are 15.6, 33.8, and 17.5 respectively. The dataset contains many ethnicities which are Asian, Black, White, and Multi-ethnics. The NPDI dataset can be found on the following site [21].

B. VGG-16

One of the CNN architectures is VGG-16 [22]. This architecture was developed by the Visual Geometry Group (VGG) at Oxford University. The VGG-16 model consists of 16 layers, as shown in Fig. 1. The model was trained using ImageNet database, which includes ~1.2 million training images, with another 50,000 images for validation and 100,000 images for testing. The prediction labels are 1000 labels. The VGG-16 has shown excellent performance to classify different

image classification or image recognition datasets. VGG-16 can be implemented in Keras framework, which is a high-level library that supports Tensor Flow and Theano Backend.

The VGG16 model was formed with 13 convolutional layers and three dense or fully connected layers. Every layer in the 16 layers contains weight that is transferred step by step until the prediction dense layer. The VGG-16 produces 38,357,544 parameters. The convolutional layers use a 3x3 window, and the pooling layer uses a 2x2 window. In the prediction dense layer or the final layer, the softmax gradient descent algorithm is applied. In each hidden layer, rectification

Input Image

Convolutional 2D

Convolutional 2D

Max Pooling 2D

Convolutional 2D

Convolutional 2D

Max Pooling 2D

Convolutional 2D

Convolutional 2D

Convolutional 2D

Max Pooling 2D

Convolutional 2D

Convolutional 2D

Convolutional 2D

Max Pooling 2D

Convolutional 2D

Convolutional 2D

Convolutional 2D

Max Pooling 2D

Flatten

Dense

Dense

Prediction Dense

Fig. 1. The schematic of the VGG-16

(3)

nonlinearity (ReLu) is used as the activation function. The fully connected layer gets the bottleneck feature from the last activation layer. The bottleneck feature can be used as a pre- training weight to reduce training time. The transfer learning occurs in the fully connected layer.

In our experiment, we performed some modification in the VGG-16 model, as follows:

1. Remove the prediction dense layer which uses softmax for 1000 classes.

2. Substitute the prediction dense layer with two classes softmax.

3. Our modified VGG-16 is only trained on fully connected layers, since all layers before the fully connected layers are in a frozen state.

4. Use Adam [23] learning algorithm to increase the efficiency of the training.

5. Lastly, the Adam learning rate is 0.0001, and the losses evaluation uses categorical cross-entropy.

IV. RESULT AND DISCUSSION

The experiment was conducted in individual image classification only. The dataset was split up to 5 folds. The experiment corresponds to the original paper [24], which is the first using the NPDI dataset. The experiment scenarios involve using one independent fold as testing data and the remaining as training data, in five separate scenarios whereby each fold was used as testing. The average was then considered as the final result.

The scenarios are called 5-fold Cross-validation. The average of all folds experiment is considered as the true testing accuracy.

In Fig. 2, the training accuracy starts below 0.82, and rises significantly up to 5 epoch. The testing accuracy has a head start in 0.85. The testing tendency fluctuates, but is constantly increasing. Hence, compared with testing, the training accuracy becomes more stable. Both training and testing

accuracy reach a similar peak, which is 0.951. Fig. 3 has shown a similar tendency to Fig. 2. The difference is the final

accuracy. The testing accuracy is slightly higher than the training accuracy. The testing reaches 95.4% accuracy, and the training extends to 95.2%.

As can be seen in Fig. 4 and Fig. 5, there is contradiction phenomenon. A part of the testing dataset, the 3rd fold, yields worse testing accuracy than training accuracy. On the other hand, The 4th fold gives better testing accuracy than the opposite. All the accuracy (i.e training and testing) is more than 0.94. The peak is obtained in testing 4th fold which is 96.1% accuracy.

The curve in Fig. 6 is interesting because the testing accuracy cannot improve as the epoch increases. This happens in the 5th fold. This unusual fact is related to combination in Fig. 2. Training: 2nd fold, 3rd fold, 4th fold, 5th fold, and Testing: 1st fold

Fig. 3. Training: 1st fold, 3rd fold, 4th fold, 5th fold, and Testing: 2nd fold

Fig. 4. Training: 1st fold, 2nd fold, 4th fold, 5th fold, and Testing: 3rd fold

(4)

testing data. The non-pornographic images are dominated by easy to classify images. The unbalanced distribution of non- porn images gives a bad influence on the testing results.

The overview of the experiment result is shown in Table I.

The average training accuracy is higher than the testing accuracy. This situation is acceptable because the model developed during the training process is more familiar than the training data. The standard deviation of the training accuracy is smaller compared to the testing result. This means that the training outcome is more constant than the testing accuracy.

Compared to the experiment that has been done in previous work [19], we have better precision, as shown in Table II.

Mustofa uses a combination between AlexNet and GoogLeNet, so it is called AGNet. The score from AlexNet and GoogLeNet is combined and the class is determined using a specific threshold.

TABLE I. PORNOGRAPHIC CLASSIFICATION ACCURACY

Testing Fold Training

Accuracy Testing Accuracy

1st fold 0.951 0.951

2nd fold 0.952 0.954

3rd fold 0.958 0.949

4th fold 0.954 0.961

5th fold 0.954 0.876

Average 0.954 0.938

Std. Dev. 0.002 0.031

TABLE II. PORNOGRAPHIC CLASSIFICATION COMPARISON

Description Mustofa [19] Proposed Method Average

Precision 0.910 0.938

V. CONCLUSION

Wrapping up our experiment, The CNN, especially VGG- 16 network, has shown strong performance. It is supported by the fact that the testing accuracy reached 0.938 on average.

Compared to the state of the art method, our proposed method has better precision, with a 2% difference. So it is possible to be implemented in image filtering at government proxies like

"Internet Positif" in the context of Indonesia. To improve the accuracy, we could change the learning rate or algorithm, change the fully connected layer's structure, or even combine the VGG-16 with another variant of deep learning such as Recurrent Neural Network (RNN) or Long Short Time Memory (LSTM). In the future, the pornographic content will be greatly reduced with the maturity of deep learning technology.

ACKNOWLEDGMENT

We gratefully acknowledge the support of Google Corporation for providing free access to Tesla K80 GPU via Google Collaborator. Without the hardware, the research cannot have been done due to expensive computation.

REFERENCES

[1] F. Nian, T. Li, Y. Wang, M. Xu, and J. Wu, “Pornographic image detection utilizing deep convolutional neural networks,”

Neurocomputing, vol. 210, pp. 283–293, 2016.

[2] S. Nuraisha, F. I. Pratama, A. Budianita, and M. A. Soeleman,

“Histogram at Image Recognition for Pornography Detection,” in 2017 International Seminar on Application for Technology of Information and Communication (iSemantic), 2017, pp. 5–10.

[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Adv. Neural Inf. Process.

Fig. 6. Training: 1st fold, 2nd fold, 3rd fold, 4th fold, and Testing: 5th fold

Fig. 5. Training:1st fold, 2nd fold, 3rd fold, 5th fold, and Testing: 4th fold

(5)

Syst., pp. 1–9, 2012.

[4] M. L. Perez, “Video pornography detection through deep learning techniques and motion information,” 2016.

[5] M. B. Short, L. Black, A. H. Smith, C. T. Wetterneck, and D. E. Wells,

“A Review of Internet Pornography Use Research: Methodology and Content from the Past 10 Years,” Cyberpsychology, Behav. Soc. Netw., vol. 15, no. 1, pp. 13–23, 2012.

[6] C. Caetano, S. Avila, W. R. Schwartz, S. J. F. Guimarães, and A. de A.

Araújo, “A mid-level video representation based on binary descriptors:

A case study for pornography detection,” Neurocomputing, vol. 213, pp.

102–114, 2016.

[7] W. Hu, O. Wu, Z. Chen, Z. Fu, and S. Maybank, “Recognition of pornographic Web pages by classifying texts and images,” IEEE Trans.

Pattern Anal. Mach. Intell., vol. 29, no. 6, pp. 1019–1034, 2007.

[8] L. Duan, G. Cui, W. Gao, and H. Zhang, “Adult Image Detection Method Base-on Skin Color Model and Support Vector Machine,”

Comput. Vision, 5th Asian Conf. on. ACCV 2002. Proceedings., no.

January, pp. 1–4, 2002.

[9] C. Y. Jeong, J. S. Kim, and K. S. Hong, “Appearance-based nude image detection,” Proc. - Int. Conf. Pattern Recognit., vol. 4, pp. 467–470, 2004.

[10] J. S. Lee, Y. M. Kuo, P. C. Chung, and E. L. Chen, “Naked image detection based on adaptive and extensible skin color model,” Pattern Recognit., vol. 40, no. 8, pp. 2261–2270, 2007.

[11] H. A. Rowley, “Large Scale Image-Based Adult-Content Filtering,”

Proc. First Int. Conf. Comput. Vis. Theory Appl., pp. 290–296, 2006.

[12] A. N. Ghomsheh and A. Talebpour, “A new skin detection approach for adult image identification,” Res. J. Appl. Sci. Eng. Technol., vol. 4, no.

21, pp. 4535–4545, 2012.

[13] H. Yin, X. Xu, and L. Ye, “Big skin regions detection for adult image identification,” Proc. - Work. Digit. Media Digit. Content Manag.

DMDCM 2011, pp. 242–247, 2011.

[14] H. Zhu, S. Zhou, J. Wang, and Z. Yin, “An algorithm of pornographic image detection,” Proc. 4th Int. Conf. Image Graph. ICIG 2007, pp.

801–804, 2007.

[15] T. Deselaers, L. Pimenidis, and H. Ney, “Bag-of-visual-words models for adult image classification and filtering,” 19th Int. Conf. Pattern Recognit. 2008, pp. 1–4, 2008.

[16] A. P. B. Lopes, S. E. F. De Avila, A. N. A. Peixoto, R. S. Oliveira, and A. De A. Araújo, “A bag-of-features approach based on Hue-SIFT descriptor for nude detection,” Eur. Signal Process. Conf., no. Eusipco, pp. 1552–1556, 2009.

[17] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:

Surpassing human-level performance on imagenet classification,” Proc.

IEEE Int. Conf. Comput. Vis., vol. 2015 Inter, pp. 1026–1034, 2015.

[18] X. Ou, H. Ling, H. Yu, P. Li, F. Zou, and S. Liu, “Adult Image and Video Recognition by a Deep Multicontext Network and Fine-to-Coarse Strategy,” ACM Trans. Intell. Syst. Technol., vol. 8, no. 5, pp. 1–25, 2017.

[19] M. Moustafa, “Applying deep learning to classify pornographic images and videos,” Comput. Vis. Pattern Recognit., 2015.

[20] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D.

Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 07–12–June, pp. 1–9, 2015.

[21] S. Avila, E. Valle, and A. de A. Araújo, “NPDI Porn Dataset,” the Institute of Computing at UNICAMP, 2018. .

[22] K. Simonyan, Andrew Zisserman, and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” Int. Conf.

Learn. Represent., pp. 1–14, 2015.

[23] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” 3rd Int. Conf. Learn. Represent., pp. 1–15, 2015.

[24] S. Avila, N. Thome, M. Cord, E. Valle, and A. De A. Araújo, “Pooling in image representation: The visual codeword point of view,” Comput.

Vis. Image Underst., vol. 117, no. 5, pp. 453–465, 2013.

Referensi

Dokumen terkait

While learning is change behavior or response caused by experience.22 According to Nana Sudjana, learning outcomes are a result of process learning by using a measurement tool in the

In Sitanala Tangerang City, the proportion of nurses who have experienced needle sticks in nurses in inpatient rooms is greater than nurses who have never experienced needles Based on