An architectural framework for information integration using machine learning approaches for smart city security profiling

(1)

International Journal of Distributed Sensor Networks

2020, Vol. 16(10) ÓThe Author(s) 2020 DOI: 10.1177/1550147720965473 journals.sagepub.com/home/dsn

An architectural framework for

information integration using machine learning approaches for smart city

security profiling

Adnan Abid¹ , Ansar Abbas¹, Adel Khelifi²,

Muhammad Shoaib Farooq¹ , Razi Iqbal³and Uzma Farooq¹

Abstract

In the past few decades, the whole world has been badly affected by terrorism and other law-and-order situations. The newspapers have been covering terrorism and other law-and-order issues with relevant details. However, to the best of our knowledge, there is no existing information system that is capable of accumulating and analyzing these events to help in devising strategies to avoid and minimize such incidents in future. This research aims to provide a generic architectural framework to semi-automatically accumulate law-and-order-related news through different news portals and classify them using machine learning approaches. The proposed architectural framework discusses all the important components that include data ingestion, preprocessor, reporting and visualization, and pattern recognition. The information extractor and news classifier have been implemented, whereby the classification sub-component employs widely used text classifiers for a news data set comprising almost 5000 news manually compiled for this purpose. The results reveal that both support vector machine and multinomial Naı¨ve Bayes classifiers exhibit almost 90% accuracy. Finally, a generic method for calculating security profile of a city or a region has been developed, which is augmented by visualization and reporting components that maps this information onto maps using geographical information system.

Keywords

Human loss news, news classification, security profiling, machine learning, geo mapping

Date received: 24 April 2020; accepted: 17 September 2020 Handling Editor: Yanjiao Chen

Introduction

Maintenance of law-and-order is an important issue for every country. There are several different forms of such issues including crime, terrorism, and accidents.¹ Furthermore, natural disasters also result into human loss. Apart from this, counter terrorism operations also affect the security situation of a place. Different strategies are being devised and applied globally to counter these different types of menace. One important mechanism to curb this issue is to keep a record of all such incidents, possibly with the help of news reports and devise pro-active strategies²and perform security profiling of

different locations. The news reports related to the law- and-order situation or the ones involving human loss, that is, involving injury or death of people, are very

1Department of Computer Science, University of Management and Technology, Lahore, Pakistan

2Abu Dhabi University, Abu Dhabi, United Arab Emirates

3Dundas Data Visualization, Toronto, ON, Canada

Corresponding author:

Adnan Abid, Department of Computer Science, University of Management and Technology, C II, Johar Town, Lahore 54000, Pakistan.

Email: [email protected]

Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages

(https://us.sagepub.com/en-us/nam/open-access-at-sage).

(2)

crucial source to serve this purpose. The statistics gathered from such news can be used for pre-emptive measures to avoid such incidents; similarly, the data collected from these news can be used in computing security factor of a city, country, or a region. Likewise, timely reporting and analysis for different patterns of such events and then forwarding this information to take decisions is very crucial to the concerned field for- mations so that they can take appropriate measures to avoid such incidents in future. In order to make all these effective, this information must be regularly updated and made readily available to the involved stakeholders.³

Motivation

The motivation behind this work is to come up with a system to collect human loss news from authentic resources, build a repository, and use it for several pur- poses including extraction of useful information from each news, identifying the patterns in the occurrence of events, maintenance of security profiles for different cities and regions, and so on. In order to understand the idea, an sample layered map of Pakistan, a country enormously hit by terrorism, shows the districts affected by law-and-order situation, as shown in Figure 1. In the figure, the darker color reflects higher number of incidents in a given area, while the lighter color shows lesser number of incidents. The figure shows that mostly the western region of the country

which is near Afghanistan border has been affected by incidents involving human loss.

To the best of our knowledge, no such architectural framework exists that can help. The proposed framework involves a module for data acquisition using a web crawler, other components perform preprocessing on the collected news, which are then passed to a classifier module that classifies these news into predefined news categories. Each news along with its meta- information is saved into a news repository. Finally, separate components are used for security profiling using statistical analysis, data visualization, and pattern recognition from the news stored in the repository.

Three important components of the proposed architectural framework have been discussed in detail. The first one is about data acquisition that discusses the news crawler to collect the relevant news. The challenges and process of developing such news crawler have been discussed, which are followed by the statistics of the news repository accumulated by implementing this crawler.

The second component is the one that automatically categorizes the news in the repository using widely used text classifiers. While another perspective of security profiling, reporting, and data visualization has also been presented.

Contribution

This research work provides the following contributions:

Figure 1. A sample heat-map for terrorism and other human loss incidents in Pakistan.

(3)

Defines an architectural framework for human loss news data integration that involves collection, processing, analyzing, and visualization of human loss news involving terrorism and law- and-order situations.

Provides detailed design and implementation issues and solution for its three major components, namely, information extractor, news classifier, and visualization and reporting.

Presents a data set of human loss new gathered and classified manually for the evaluation of classification component.

Empirically identifies a suitable news classifier for the categorization of human loss news.

Proposes a mechanism for security profiling of a city based on the statistics gathered from the proposed data integration framework.

The rest of the article is structured in the following manner. The relevant literature is discussed in section

‘‘Related work.’’ While, the proposed architectural framework and some details of its components are discussed in section ‘‘Architectural framework,’’ which is followed by the section ‘‘News repository’’ that explains the process and relevant details of building human loss news repository. Section ‘‘Classification methodology’’

presents an empirical comparison of widely used text classifiers to automatically categorize obtained news into appropriate human loss news category. Section

‘‘Experiments and results’’ presents experimental setup, evaluation measures, and result and discussion. Finally, security profiling and reporting and visualization of news is discussed in section ‘‘Security profiling and data visualization.’’ The conclusion and future directions is presented in section ‘‘Conclusion.’’

Related work

There exist several data services and systems which capitalize on acquisition and processing of data from these services.⁴ Web crawler is the best tool for news extraction. Guo et al.⁵ provide an effective and easy way, Extract COtent from web News (ECON), to extract content from any news web page written in any language automatically. It exploits document object model (DOM) tree structure of news web page and uses features of DOM tree to do its job.

Organization and management of large volumes of electronic text information are a great challenge.⁶Text classification could be used as an essential technique to handle this issue. Categorization of textual data to predefined categories is known as text classification.

Several text classification applications, such as predic- tion of user preferences, news filtering, email filtering, and many more, exist^7–9and are used for different pur- poses. A number of machine learning techniques have

been used to classify texts including rule induction, Naı¨ve Bayes (NB),¹⁰decision tree induction, K-nearest neighbors (KNN), random forest (RF),^11,12 and support vector machine (SVM).^9,13

Yang and Liu⁷studied five classifiers with statistical significance and show that SVM, KNN, and linear least square fit (LLSF) perform significantly better than NNet and NB against categories having less than ten instances and all the classifiers perform equally well when instances per category are more than 300. Lewis and Ringuette¹⁴present a study regarding performance of an NB and a decision tree classifier on two textual data sets. They show that both classifiers performed reasonably. They also demonstrate the effects of tem- poral nature of definitions of categories.

Moral et al.¹⁵and Hull and Grefenstette¹⁶discussed pros and cons of stemming. They show that benefits of stemming are specific to context. Nature of language also influences the performance of a stemmer.

Furthermore, effects of stemming are not always positive, a warning note on the exercise of stemming words, which can have mixed effects on classification performance in information retrieval.

Ting et al.¹⁷aim at highlighting NB classifier’s performance for document classification. It shows that NB is accurate and computationally efficient classifier for document classification due to its simplicity. Rish¹⁰discusses assumptions made by NB classifier about features such as class independence. It shows that NB is accurate because of independent features and function- ally dependent features. Kibriya et al.¹⁸ demonstrate that performance of multinomial NB classifier can be improved using term frequency–inverse document frequency (TF–IDF) conversion and normalization of document length. Frank and Bouckaert¹⁹ identified a deficiency of multinomial Naı¨ve Bayes (MNB) when data set is unbalanced and shows that this deficiency can be removed by employing classwise word vector normalization.

Dumais⁸shows that using linear SVMs against training examples, an accurate text classifiers can be learned.

It reports that SVMs are robust for preprocessing, and sequential minimal optimization (SMO) method is quite efficient for learning SVMs even for large textual data sets. Joachims⁹explores usage of SVMs for classification from text examples. It analyzes properties of textual data and identifies applicability of SVMs for this kind of task.

It shows that there is no need for manual tuning of parameters for SVMs. Hsu et al.¹³provide a guide for SVM classification technique. They propose a simple procedure which usually gives reasonable results. It discusses usage of appropriate kernels in different situations.

Breiman¹¹ proposed RFs. In an RF, each node is split using the best among a subset of predictors randomly chosen at that node. These predictors are chosen with replacement. Results of RFs are comparable to

(4)

other classifiers like SVM and NNet. Amaratunga et al.²⁰elaborate that when number of features in a data set are huge and truly informative features is small, then performance of RF degrades significantly. In such situations, results can be improved by decreasing the number of trees generated for non-informative features.

Xu et al.²¹developed an improved version of RF classifier for classification of textual data. It is designed for high dimensional data with multiple classes. A feature weighting procedure and tree selection procedure are implemented for creating RF suited to text documents’

classification. Biau¹² offers an improvement to RFs suggested by Breiman. This procedure is consistent and adapts to sparsity. Xu et al.²²propose a model selection method which aims to optimize the tree selection process so that only good trees are included in an RF.

Recently, some work has been published related to visualization and reporting of news. For instance, in Watanabe,²³ the authors present a semi-supervised approach to geographical news classification. Similarly, some other work has been accomplished on classification of UN news.²⁴ Another work related to adding semantics to news data has been reported in Rodosthenous and Michael.²⁵ The work done in this research presents a complete architectural framework for maintaining security profiles of cities while compil- ing data from news reports. It also implements and presents the results of its three important components.

Architectural framework

The proposed system intends to collect the news from famous news websites and stores this information to

generate different useful reports for relevant agencies.

It further intends to identify different crime pockets, the patterns, and possible connections between different events. The overall architecture of the proposed system is shown in Figure 2.

News crawler

The news will be automatically collected through a crawler that will extract crime- and terrorism-related news from different famous news portals. This component is responsible for browsing the news portals effi- ciently so as to collect news stories.

Preprocessor

The collection of news invites certain preprocessing of the collected news. The preprocess in turn comprises three main sub-components, namely, duplicate detector, information extractor, and the news classifier.

Duplicate detector

This extraction of information from multiple sources involves two types of concerns: first, system may encounter same news from different news portals on the same day. Second, there are follow-up news for major incidents for many days which generally involve updates in statistics, condemnation of events by different people, and so on. This sets up another important requirement of detecting follow-up news stories. The duplicate and follow-up news, if not identified prop- erly, may result into adding the same incident multiple times, hence affecting the credibility of the gathered Figure 2. Overall architecture of the proposed system.

(5)

statistics. Therefore, it requires the design and development of a separate component to process input news so as to detect duplicate and follow-up news.

Information extractor

Apart from the categorization of the news story, useful information shall be extracted from the news story including date and time of the event, location of the event, number of injured people, and number of casualties in the incident. Part of this information shall be

used to build the context of news, which can be useful in analyzing the news.

News classifier

This news will be further classified based on the text in the news story. To this end, appropriate text classification algorithm shall be used to automatically classify the news report.

News input manager

The news input manager is responsible for manipulat- ing the gathered news so as to store a single news and its variants along with the source of information in an efficient storage system. Similarly, it also manages the linking of follow-up news with the principal news already present in the system.

Furthermore, this news classification and extraction of certain useful information can lead to an interface that would help semi-automatic news input system.

Where, most of the information is furnished by the news preprocessing system and is then reviewed and endorsed by a human. This semi-automatic system will help maintaining the credibility of the news repository.

Figure 3 shows the steps involved in preprocessing and news classification. While, a sample input screen is shown in Figure 4.

Figure 3. Algorithm for news processing.

Figure 4. Semi-automatic data entry screen.

(6)

Reports and graphs

The collection of such news will help us creating different useful reports showing the type and number of incidents in different time intervals in specific locations. A sample heat-map report showing frequency of incidents in different districts in Pakistan is shown in Figure 1.

Pattern recognition

Once the system will have sufficient news, it would set the platform to analyze the stored information so as to identify crime pockets and crime patterns using the stored information.

News repository

This research requires a repository data set comprising human loss news reports. Such work cannot be accomplished in an effective manner in the absence of real and correct data. Thus, a real news repository has been created from scratch by taking advantage of the World Wide Web. News reports are gathered and processed from the websites of famous and top-ranked newspapers of Pakistan to generate this repository.

News categories: A specialized set of news categories are defined on the recommendations of a group of experts from private security agencies of Pakistan (https://hesecurity.com.pk/). A brief description for the news of each category is presented in Table 1.

Automatic news repository generation

The general process for the creation of this purposeful news repository is presented in Figure 5. It comprises a news crawler that extracts the news from the websites of news agencies. It preprocesses each news page to extract the main story from it and then checks for its conformance to human loss news. If a news falls under

the category of human loss news, then it is added to the repository and is discarded otherwise.

News extractor. A crawler is developed that collects human loss–related news stories from the news websites and saves them into a database. It consists of two mod- ules, first one visits home page of the newspaper’s web- site and extracts all URLs present on it along with their text. These URLs are actually the links to the individual news reports. Then, it checks these URLs against a predefined list of words called keywords. If any of the keywords is present in the text or source of URL, it is saved otherwise dropped. Whereas, the second module visits web page of each link of these saved URLs in the list and extracts story presented on it. It stores both story and URL into database for future ref- erence. Thus, the crawler application explores the web sites using breadth first graph traversal algorithm.

The extraction of actual news story from the web page is not trivial. As the issue with the extracted story data is that apart from the actual content of news Table 1. Categories of human loss news.

Category Abbreviation Description

Accident A All reports in which road accidents, plane crashes, train accidents, incidents of drowning, and so on are reported and placed in accident category

Crime C All suicidal, rivalry related, honor killing related, political clashes resulting in causalities, and so on are reported in crime category

Disaster D Rains, floods, earthquakes, storms, lightening, and so on are placed in disaster category

Operation O All kinds of operations, raids, encounters conducted by law and enforcement agencies are under operation category

Terrorism T Any activity that spreads terror in the society, for example, suicide bombing, bomb blast, or any movement conducted by declared terrorist organization, cross-border firing (state terrorism), any kind of use of force by government considered by human activists unnecessary (state terrorism), and so on fall in this category

Figure 5. Process for generating human loss news repository.

(7)

story, it also constrains some advertisements, links to related and recent news stories, menu bars, headers and footers of web page, and other irrelevant data. Figure 6 shows a sample of a news story web page. Actual content of story is in rectangle and ellipses show noisy areas. Almost 70% data of every news story web page are occupied with irrelevant material. In order to separate and extract actual content of story from a news story web page, the technique used in Guo et al.⁵has been employed. Thus, the crawler application uses breadth first search to explore all the news while ignor- ing the noisy data.

News report preprocessing. More than 60,000 news reports are collected by this web crawler application for the period between 1 January 2010 and 31 March 2017.

These news reports are passed through different filters, as shown in Figure 7, as all of these reports are not true reports in perspective of human loss. Certain keywords used in human loss–related news have been used to apply an initial filter so as to collect the human loss news. All the news collected are then processed

manually and the reports which are extracted but are not required are discarded from the repository manually.

Since news reports have been collected from five different sources, therefore, a lot of duplicated reports have been found in the repository, as same incident is reported by all these news agencies. Similarly, some high-impact news are continuously reported for several days, and thus need to be identified as another type of duplicate news, which we refer to as follow-up news.

This requires the detection and removal of duplicate reports. Thus, only one instance of a news story is retained in the final repository. It is pertinent to mention that duplicate news detection is itself a complete research problem; therefore, in this research, duplicate news have been manually detected, and only a single copy of each duplicate news has been used for visualization and security profiling to avoid any over statement of facts. However, we intend to address this problem as one of the promising future directions. After the initial scrutiny, nearly 5000 unique reports are left. All these news are already manually categorized into the relevant classes.

Figure 6. A news story web page showing news story and noisy data.

(8)

Statistics of the repository. The news repository is com- posed of nearly 5000 news stories related to human loss (dead and/or injured subjects). Out of these, the most frequent news stories belong to the categorycrime (C) and almost 30% of the news stories are from this category. It is quite natural as crime-related incidents in any society are greater than any other human loss–

related incidents. Interestingly, the next two categories of most frequent stories are of terrorism (T), nearly 29%, and operations(O), nearly 23%. This is because of the fact that Pakistan is facing terrorism issue for more than a decade, and terrorism incidents have happened quite frequently in this news collection period.

Similarly, during this period, the military and other supporting forces have conducted several operations against the terrorists during this time period. Number of news stories belonging toaccidents(A) are 12% and those ofnaturaldisaster(D) are almost 4% of the collected news. These statistics are summarized in Table 2.

These statistics reflect peculiar situation that Pakistan has been facing during these years and makes this repository special and different from any such news data sets.

Classification methodology

An important objective of this research is to build a classifier to automatically categorize the news stories into appropriate class tag. To this end, a conventional

process has been adopted, as shown in Figure 8. The whole news repository is divided into two different sets, one for training the classifiers and second for testing purpose. It is pertinent to mention that a large number of text classifiers have been used for many different pur- poses. Therefore, it is not trivial to take any of them as suitable classifier for the categorization of human loss news. Most prominent and effective classification techniques for text data are NB, SVM, and RFs. Therefore, this research involves all these classifiers to choose the best one for the categorization of human loss news.

Multinomial naive bayes

This is a feature independent and probabilistic model based on Bayes’ theorem that assumes strong independence. It makes use of prior and posterior probabilities.

NB classifier calculates the probability with which a news report may fall in a given category and assigns it Figure 7. Preprocessing steps for a news.

Table 2. Categories of human loss news.

Category Abbreviation No. of stories Percentage

Accident A 597 12

Crime C 1511 30

Disaster D 237 5

Operation O 1246 24

Terrorism T 1432 29

Figure 8. Process for categorization of news reports.

(9)

the category that has the highest probability. The strength of this approach is that it is simple, efficient as it consumes less computational time and does not require large memory for execution. Every news story consists of English language text, and every distinct word in text represents a feature. MNB uses these features to calculate probabilities of words or terms for all classes using the training data set. These probabilities are in turn used to compute the probability scores for a newly encountered news story.

Support vector machine

SVM is a discriminative classifier that makes use of separating hyperplane. By providing labeled trained news reports as input to the algorithm, it outputs an optimal hyperplane that is used to categorize future or test news reports. News reports may or may not be linearly separable. If news reports are not separable, then algorithms transforms it to higher dimension and try to find hyperplane in that dimension. Different kernels are used for transformation. This research employs linear kernel and radial basis function (RBF) kernel for transformation. Linear kernel is used if data are linearly separable among classes, whereas RBF (Gaussian) kernel is used if data are not linearly separable. SVM is computationally complex method as compared to NB, but is considered to be more accurate classification method.

Random forest

RF has been reported as successful method for text classification in recent literature. It is an ensemble learning method for text and other kind of classification, which compiles its decision based on involved several individual decision trees. From training set, more than one decision trees are constructed and then class is output as mode of the classes. These decision trees are constructed based upon randomly selected features (terms/words) from news reports. These features are selected with replacement. For a given news story, the class which is most frequently resulted by the trees in the forest is considered to be the most appropriate class.

Variants of selected classifiers

In general, the text classification problems are of empirical nature and most of the text classifiers apply certain preprocessing techniques on the data set before actually applying the classifiers on the considered data elements. Therefore, this research also involves com- monly used natural language processing (NLP) techniques, namely, stemming¹⁵ and TF–IDF weighting.²⁶ Stemming reduces a word to its base word, whereas

TF–IDF associates a weight to a given word based on its frequency of occurrence in a given document and in the whole corpus of data. Thus, for each approach, four different variants are considered, that is, simple version, stemmed words are represented by super- scriptedS, TF–IDF weighted words are represented by superscripted T, and both stemming and TF–IDF weighted words are represented by superscriptedTS.

Experiments and results

The principal objective of this research is to find the most accurate text classifier for human loss new reports.

In order to find the best classifier among 16 variants of three classification techniques, rigorous experiments have been conducted and evaluated using appropriate evaluation measures.

Experimental setup

Tenfold cross-validation has been employed to conduct experiments. That is, every experiment involves a training set consisting of 90% of instances in the data set, whereas, the test set consists of the rest of the 10% of instances in the data set. It is ensured that this division is stratified, that is, the percentage of instances for each category in the training and test sets remains the same.

This whole process is repeated 10 times. Average results of these 10 iterations have been reported. In the end, a discussion on the obtained results provides a synthesis of all the experiments.

Evaluation measures

The evaluation measures used to evaluate this work are presented in Table 3. They include the accuracy, time to train and classify the instances, receiver operating curve, and area under receiver operating curve.

Accuracy. Accuracy is a measure to evaluate how correctly a classifier is classifying the news reports. It is the percentage of number of reports correctly predicted over the total number of reports in the news repository.

It is a simple measure but reflects the overall correct- ness of the proposed method. In this article, accuracy is denoted bya.

Table 3. Evaluation measures.

Evaluation measure Abbreviation

Accuracy a

Time t

Receiver operating characteristic curve ROC curve

Area under ROC AUROC

(10)

Time. Time is another measure to judge the performance of a classifying algorithm. It is measured in sec- onds. In this work, the reported time consists of the time taken by each classifier to train a model and to make all the predictions in aforementioned settings.

Time taken is denoted byt.

Receiver operating characteristic (ROC) curve. ROC curve is a good way to judge performance of two competitive classifiers. In an ROC curve, true positive rate (sensitiv- ity) is plotted as a function of false positive rate (speci- ficity). The curve that is closer to the ideal is considered to be the better that the other.

Area under ROC. Area under ROC tells how close a classifier is to perfection. It ranges from 0 to 1. An area of 1 represents perfection and an area of 0.5 or below is considered to be worse than even a random classifier.

Results and discussion

Variants of multinomial NB. Overall and classwise accuracies of all four variants of MNB are presented in Table 4. It is clear from Table 4 that all these variants performed equally well against news reports in the considered repository. Overall accuracy of all four is almost 90% and equal at 1% level of statistical significance. Classwise accuracies are almost same at 3% level of statistical significance for each class for all variants.

Also accuracies of all classes within each variant are

not differing more than 5%, which shows that MNBis not biased toward classes having greater number of news reports. Furthermore, the results show that the use of weighted terms and stemming have no significant impact on improving the accuracy of results. Thus, all the variants seem to be equally good. Hence, if any one of the variants needs to be picked from these results, then it should be the simplest variant as it does not involve preprocessing overheads. Thus, simple MNB is considered to be an overall best variant from these experiments.

Variants of SVM. Overall and classwise accuracies of all eight variants of SVM are shown in Table 5. It is clear from overall accuracy that all variants of SVM performed equally well. Accuracy of all eight is almost 90% and equal at 2% level of statistical significance.

Classwise accuracies of four variants of RBF kernel are significantly lesser than four variants of linear kernel for classes A and D. For rest of the classes, they are almost similar at 5% level of statistical significance.

Difference between accuracies of all classes within each variant is significant and more than 5%, which shows that SVM is biased toward those classes that have greater number of news reports in the repository.

SVM^{T RBF}^ð ^Þ is picked as the best variant as its overall class accuracy is best as compared to the rest of the variants. The TF–IDF representation helps assigning appropriate weights to the terms meaningful for classification purpose and also reduce the impact of stop- words. Therefore, SVM^{T RBF}^ð ^Þ performs better than other variants.

Variants of RF. The experimental results for the experiments conducted with the variants of RF are shown in Table 6. It is clear from overall accuracies that all variants of RF performed equally well. Accuracy of all four is almost 86% and equal at 1% level of statistical significance. Classwise accuracies are almost same at 5% level of statistical significance for each class for all variants except D. Difference between accuracies of all classes within each variant is significant and more than 5% which means RF is biased toward those classes Table 4. Classwise and overall accuracyaof all preprocessing

variants forMNB.

MNB MNB^S MNB^T MNB^TS

A 85.6 84.5 87.1 86.1

C 89.0 88.5 89.9 88.8

D 89.0 91.8 90.4 91.8

O 90.9 90.3 90.9 88.3

T 89.3 89.3 88.4 88.4

Overall 89.2 88.9 89.3 88.3

MNB: multinomial Naı¨ve Bayes.

Bold values show best performing approach for each category.

Table 5. Classwise and overall accuracyaof all variants of SVM.

SVM^{ð Þ}^R SVM^{S R}^{ð Þ} SVM^{T R}^{ð Þ} SVM^{TS R}^{ð Þ} SVM^{ð Þ}^L SVM^{S L}^{ð Þ} SVM^{T L}^{ð Þ} SVM^{TS L}^{ð Þ}

A 80.4 79.4 82.5 82.5 87.1 87.6 88.1 88.1

C 92.3 92.5 92.7 93.4 92.1 89.9 92.1 91.2

D 67.1 68.5 72.6 74 71.2 75.3 75.3 79.5

O 91.6 91.4 93 93 92.4 91.4 93.5 93

T 91.7 89.5 93 90.1 91.3 88.4 89 88.6

Overall 89.4 88.6 90.7 90.1 90.4 88.8 90.2 89.9

SVM: support vector machine.

(11)

having greater number of news reports.RF^T is selected as the best variants as its overall class accuracy is better than the rest of the variants.

Comparison for best variant

Results of all variants of each approach have shown that simple MNB, SVM^{T RBF}^ð ^Þ, and RF^T are selected as the best variants for each classifier. In order to choose the overall best classifier, a detailed comparative analysis for all the selected variants has been conducted. It is pertinent to mention that TF-IDF-based preprocessing performs better than stemming and simple preprocessing. The reason is that TF–IDF assigns meaningful weights to the terms which are meaningful for classification purpose. Similarly, it also tends to assign negligi- ble weights to the stop-words.

Figure 9 shows classwise accuracy of all three best variants. It is very clear that these three variants perform equally well at less than 5% level of statistical significance against C, O, and T classes. But performance ofRF^Tis degraded as compared toMNBandSVM^{T RBF}^ð ^Þ (also shown asSVM^{T R}^{ð Þ}due to lack of space) against A and D classes. It means that it is performing poorly against classes having fewer number of news reports. A comparison ofMNBandSVM^{T RBF}^ð ^Þshows both are performing equally well against A class at 5% level of statistical significance but MNB is outperforming SVM^{T RBF}^ð ^Þ against D class quite significantly. Hence, keeping all this in view, it can be claimed thatMNBis outperforming other two best competitors in terms of classwise accuracy.

Table 7 shows the comparison of all best variants in terms of accuracy, time, and area under the ROC curve.

It shows that the accuracy of all three selected variants MNB,SVM^{T RBF}^ð ^Þ, andRF^T is comparable and is almost the same.MNBandSVM^{T RBF}^ð ^Þare equal at 1% level of statistical significance, whereas all three are equal at 3% level of statistical significance. It can be inferred that this little difference is by chance. Hence, all three are equally well as for as accuracy is concerned.

Similarly, in terms of time taken to train and predict news reports, the statistics show thatMNBoutperforms

both SVM^{T RBF}^ð ^Þ and RF^T by a great margin. MNB is almost three times faster thanSVM^{T RBF}^ð ^Þand almost 27 times faster thanRF^T. This difference is huge and it will grow more as number of news reports increases. The reason is, by nature, MNB involves simple arithmetic and no complex procedure is involved in it that makes it computationally less complex and less expensive than SVM^{T RBF}^ð ^ÞandRF^T.

Table 7 shows the AUROC comparison for the selected variants. It is evident that all three best variants are performing equally well. The value of 0.9 or above is considered a very good result, and it is clear that all the selected variants have higher values than 0.9.

Table 8 shows classwise area under ROC of all three variants. Whereas Figure 10 shows ROC curves of all three best variants classwise. Performance of all three best is close to perfection as per statistics listed in Table 8. All are above 0.9 which is considered excellent and close to perfection. ROC curves shown in Figure 10 Table 6. Classwise and overall accuracyaof all variants of

random forest.

RF RF^S RF^T RF^TS

A 66 65.5 68.6 64.4

C 90.5 91 90.3 90.3

D 61.6 65.8 68.5 63

O 90.6 89 89 89

T 91.9 90.9 92.4 91.7

Overall 86.7 86.3 87.1 86.1

RF: random forest.

Bold values show best performing approach for each category.

Table 7. Comparison ofa,t, and AUROC for all best variants.

MNB SVM^{T RBF}^ð ^Þ RF^T

a 89.1% 89.9% 87.1%

t 3 s 10 s 82 s

AUROC 0.979 0.960 0.981

AUROC: area under receiver operating characteristic; MNB:

multinomial Naı¨ve Bayes; SVM: support vector machine; RF: random forest.

Figure 9. Comparison of classwise accuracy of all best variants.

Table 8. Classwise comparison of all best variants.

A C D O T

MNB 0.993 0.973 0.998 0.982 0.973

SVM^{T RBF}^ð ^Þ 0.975 0.954 0.989 0.981 0.939

RF^T 0.984 0.974 0.998 0.99 0.976

MNB: multinomial Nay¨ve Bayes; SVM: support vector machine; RF:

random forest.

(12)

also testify this as all curves are away from diagonal line (random classifier line) and toward and near to top left corner (0, 1) of the chart and that corner belongs to perfection.

After analyzing accuracy, time, areas under ROC, and ROC curves, it can be claimed that MNBis accurate, near to perfection and efficient in terms of time than the other two selected variants. Therefore, it can be concluded from this research thatMNBis overall the best from all the considered variants for the classification of human loss news classification.

Security profiling and data visualization The input data compile a useful repository for the further analysis. The analyzers can then generate expedient reports from this repository reflecting security situation in a given geographical or political unit (city, district, division, province, and so on) in a given period of time.

News classifier in data ingestion

In the proposed system, the news crawler gets news stories, and among these stories, the human loss news are retained, while the others are discarded. These news

are passed to the news classifier module discussed in this section which assigns it a category automatically.

Figure 11 shows how this classifier is integrated with the data ingestion system. It can also be observed that the data ingestion system also records other meta- information related to the extracted news including date, Islamic month and date, source of data, and location of data. It also extracts statistics with the help of preprocessor and fills the fields showing number of casualties and injured people during an incident. All this information is shown to a data entry operator, whose job is to just quickly vet the extracted information, thus resulting into a semi-automatic data acquisition system. The data entry operator may further enrich or correct the extracted information where required.

For instance, he may add some missing information, for instance, responsibility claim, target, and so on. The data entry operator may also correct the outcome of the classifier whenever required.

Security profiling based on gathered statistics

Based on the raw data, security profiles can be created for different geographical regions. The data shown in Table 9 are a comparison report showing the number of Figure 10. Comparison of ROC curves for best variants.

ROC: receiver operating characteristic.

(13)

different types of incidents happened in the provinces of Pakistan during 2013 and 2014. The number of different types of incidents is a significant parameter for security profiling. While at the same time, the meta-information extracted from the news, including number of deaths, injuries, type of event, and so on, helps enriching these parameters further. The whole information is subse- quently graded based on the number and types of the incidents and the number of injuries and deaths in those incidents in a given period, thus generating a security profile score of each city in a specific period.

Thus, the user can create a report that shows events of a specific type, in a given period of time, having

more than 10 casualties, in a given region by applying different filters. These reports can be materialized and archived in different suitable and useful file formats.

This involves assigning weightage to each category and number of dead and injured persons. This score can be used to draw a layered map showing peaceful areas and the crime and terrorism pockets in the country

t_r= Xⁿ

c=0

v_c:r^c_r

" #

+v_d:d^d_r+v_i:dⁱ_r ð1Þ

where v_c represents the weightage of incidents of a given type, r^c_r represents the number of events of Figure 11. Sample heat-map for incidents in Pakistan.

Table 9. Statistics of events took place in different provinces.

Crime Disaster Operation Terrorism Total

Year 13 14 13 14 13 14 13 14 13 14

AJK 1 1 2

Balochistan 26 28 11 4 3 1 2 1 3 79

Capital 2 1 3

Fata 12 2 14

KPK 25 4 12 6 2 7 12 2 5 75

Punjab 16 6 3 1 3 1 30

Sindh 17 5 1 8 1 39 24 1 2 24 122

AJK: Azad Jammu Kashmir; KPK: Khyber Pakhtunkha.

(14)

category ‘‘c’’ in region ‘‘r’’ in a given period of time.

Similarly,v_drepresents the weightage for a single casu- alty andvi represents the weightage for a single injury in a given event. Whereas, d^d_r and dⁱ_r represent total number of deaths and injuries in a region ‘‘r’’ in a given period of time, respectively. Higher score reflects that there are numerous incidents in a given city and its security profile is weak, whereas, low score reflects peace.

Visualization and reporting

The visualization and reporting module represents the data in tabular and graphical formats. It is supported by some customized reports that are frequently required. Furthermore, a flexible data extractor is also

part of this module that allows the users to apply filters on any type of attributes shown in visualization of number of incidents and the injured and dead people in those incidents in a given period of time in the whole country. Similarly, Figure 12 shows another graphs presenting the percentage of people who got injured in accidents happened in different provinces. It shows that nearly 82% of injuries happened in Sindh province which is second largest province in the country in terms of populations, while nearly 6% people got injured in Punjab, the largest province in terms of population.

Similarly, the visualization component helps generating the graphical form of many useful reports. For instance, Figure 13 shows different types of events occurred during a specified period; while Figure 14 shows the heat-map of incidents took place, where the radius of the circle reflects the human loss in the form of deaths and injuries took place in the incident, it also shows the details of the incident pointed to by the mouse.

Conclusion

This research has proposed an architectural framework for accumulating human loss news from news portals, so as to process them and build security index for different cities and regions. To this end, all the components have been discussed, while data accumulation, classification, and data visualization components have been developed. In order to implement the classification component widely used, text classifiers have been used to categorize human loss news. A repository of such Figure 12. Pie-chart showing percentage injured persons per

province for accident category.

Figure 13. Different types of incidents occurred during a specific period.

(15)

news has been generated by designing and developing a customized web crawler that crawled five top news websites of Pakistan. All the collected news are manually checked and categorized into appropriate classes that include accident, crime, disaster, operation, and terrorism. These classes have been provided by the domain experts. This repository served as a gold standard to empirically evaluate some widely used text classifiers including MNB, SVM, and RF. Based on certain preprocessing techniques, 16 variants of these classifiers have been tested and evaluated using appropriate evaluation measures. The study concludes that MNB is best of them as it achieved 89% overall accuracy and more than 85% accuracy against all classes. Similarly, in terms of time taken, it is the most efficient classification approach among the considered classifiers. Its overall area under ROC was greater than 0.9 that is considered excellent. Classwise area under ROC for all classes was also greater than 0.9.

Although the number of news reports in the repository was enough for this study, yet in near future, more reports will be collected to extend size of this set to ver- ify the results on bigger set. In this research, news reports were categorized to only one category but there is a need to classify some news reports to more than one category. Multiclass labeling of such reports is a possible extension to this work. In this study, the duplicate news reports were detected manually. There is a need to devise some automated mechanism for duplicate detection of news reports, which demands an

automatic duplicate detector for human loss news.

Similarly, another possible extension to this work involves the summarization of these news reports and the extraction of useful statistical and contextual information from the news report.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research work was supported by Abu Dhabi Faculty Research grant.

ORCID iDs

Adnan Abid https://orcid.org/0000-0003-2602-2876 Muhammad Shoaib Farooq https://orcid.org/0000-0002- 4095-8868

References

1. Iqbal R, Butt TA, Afzaal M, et al. Trust management in social internet of vehicles: factors, challenges, blockchain, and fog solutions.Int J Distrib Sensor Netw2019; 15(1):

1–22.

2. Barka E, Kerrache CA, Benkraouda H, et al. Towards a trusted unmanned aerial system using blockchain for the Figure 14. Heat-map of different incidents.

(16)

protection of critical infrastructure. Trans Emerg Tele- commun Technol. Epub ahead of print 29 July 2019.

DOI: 10.1002/ett.3706.

3. Saleem S, Dilawari A, Khan UG, et al. Stateful human- centered visual captioning system to aid video surveil- lance.Comput Electr Eng2019; 78: 108–119.

4. Ceri S, Abid A, Helou MA, et al. Search computing:

managing complex search queries.IEEE Internet Comput 2010; 14(6): 14–22.

5. Guo Y, Tang H, Song L, et al. ECON: an approach to extract content from web news page. In:Proceedings of the 2010 12th international Asia-Pacific web conference (APWEB’10), Busan, South Korea, 6–8 April 2010, pp.314–320. New York: IEEE.

6. Rathee G, Sharma A, Kumar R, et al. A secure commu- nicating things network framework for industrial iot using blockchain technology. Ad Hoc Netw 2019; 94:

101933.

7. Yang Y and Liu X. A re-examination of text categorization methods. In:Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’99), Berke- ley, CA, 15–19 August 1999, pp.42–49, New York:

ACM.

8. Dumais S. Using SVMs for text categorization. IEEE Intell Syst1998; 13(4): 21–23.

9. Joachims T. Text categorization with support vector machines: learning with many relevant features. In:Pro- ceedings of the 10th European conference on machine learning (ECML’98), Chemnitz, 21–23 April 1998, pp.137–142. London: Springer.

10. Rish I. An empirical study of the Naı¨ve Bayes classifier (Technical report), https://dominoweb.draco.res.ibm.

com/db24eb109a77428785256aff005d3df2.html

11. Breiman L. Random forests.Mach Learn2001; 45: 5–32.

12. Biau G. Analysis of a random forests model. J Mach Learn Res2012; 13: 1063–1095.

13. Hsu C-W, Chang C-C and Lin C-J. A practical guide to support vector classification (Technical report), 2003, https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf 14. Lewis DD and Ringuette M. A comparison of two learn-

ing algorithms for text categorization. In: Third annual symposium on document analysis and information retrieval,

Las Vegas, Nevada, 11 April 1994, p.33. Information Sci- ence Research Institute, University of Nevada.

15. Moral CD, De Antonio A, Imbert R, et al. A survey of stemming algorithms in information retrieval.Inform Res Int Electron J2014; 19(1): n1.

16. Hull DA and Grefenstette G. A detailed analysis of Eng- lish stemming algorithms (Technical report, Xerox Research and Technology), 1996, http://citeseerx.ist.p- su.edu/viewdoc/summary?doi=10.1.1.68.2870

17. Ting SL, Ip WH and Tsang AHC. Is Naı¨ve Bayes a good classifier for document classification? Int J Softw Eng Appl2011; 5: 37–46.

18. Kibriya AM, Frank E, Pfahringer B, et al. Multinomial Naı¨ve Bayes for text categorization revisited. In:Proceed- ings of the 17th Australian joint conference on advances in artificial intelligence (AI’04), Cairns QLD, Australia, 4–6 December 2004, pp.488–499. Berlin: Springer.

19. Frank E and Bouckaert RR. Naı¨ve Bayes for text classification with unbalanced classes. In: Proceedings of the 10th European conference on principle and practice of knowledge discovery in databases (PKDD’06), Berlin, 18–

22 September 2006, pp.503–510. Berlin: Springer.

20. Amaratunga D, Cabrera J and Lee Y-S. Enriched random forests.Bioinformatics2008; 24(18): 2010–2014.

21. Xu B, Guo X, Ye Y, et al. An improved random forest classifier for text categorization.JCP2012; 7(12): 2913–2920.

22. Xu B, Li J, Wang Q, et al. A tree selection model for improved random forest. Bull Adv Technol Res 2012;

6(2): 1.

23. Watanabe K. Newsmap: a semi-supervised approach to geographical news classification.Digit Journal2018; 6(3):

294–309.

24. Watanabe K and Zhou Y. Theory-driven analysis of large corpora: semisupervised topic classification of the UN speeches. Soc Sci Comput Rev. Epub ahead of print 21 February 2020. DOI: 10.1177/0894439320907027.

25. Rodosthenous C and Michael L. Using generic ontologies to infer the geographic focus of text. In:International conference on agents and artificial intelligence, Funchal, 16–

18 January 2018, pp.223–246. Berlin: Springer.

26. Manning CD, Raghavan P and Schu¨tze H. Introduction to information retrieval. Cambridge: Cambridge Univer- sity Press, 2008.