International Journal of Distributed Sensor Networks
2020, Vol. 16(10) ÓThe Author(s) 2020 DOI: 10.1177/1550147720965473 journals.sagepub.com/home/dsn
An architectural framework for
information integration using machine learning approaches for smart city
security profiling
Adnan Abid1 , Ansar Abbas1, Adel Khelifi2,
Muhammad Shoaib Farooq1 , Razi Iqbal3and Uzma Farooq1
Abstract
In the past few decades, the whole world has been badly affected by terrorism and other law-and-order situations. The newspapers have been covering terrorism and other law-and-order issues with relevant details. However, to the best of our knowledge, there is no existing information system that is capable of accumulating and analyzing these events to help in devising strategies to avoid and minimize such incidents in future. This research aims to provide a generic architectural framework to semi-automatically accumulate law-and-order-related news through different news portals and classify them using machine learning approaches. The proposed architectural framework discusses all the important components that include data ingestion, preprocessor, reporting and visualization, and pattern recognition. The information extractor and news classifier have been implemented, whereby the classification sub-component employs widely used text classi- fiers for a news data set comprising almost 5000 news manually compiled for this purpose. The results reveal that both support vector machine and multinomial Naı¨ve Bayes classifiers exhibit almost 90% accuracy. Finally, a generic method for calculating security profile of a city or a region has been developed, which is augmented by visualization and reporting components that maps this information onto maps using geographical information system.
Keywords
Human loss news, news classification, security profiling, machine learning, geo mapping
Date received: 24 April 2020; accepted: 17 September 2020 Handling Editor: Yanjiao Chen
Introduction
Maintenance of law-and-order is an important issue for every country. There are several different forms of such issues including crime, terrorism, and accidents.1 Furthermore, natural disasters also result into human loss. Apart from this, counter terrorism operations also affect the security situation of a place. Different strate- gies are being devised and applied globally to counter these different types of menace. One important mechan- ism to curb this issue is to keep a record of all such inci- dents, possibly with the help of news reports and devise pro-active strategies2and perform security profiling of
different locations. The news reports related to the law- and-order situation or the ones involving human loss, that is, involving injury or death of people, are very
1Department of Computer Science, University of Management and Technology, Lahore, Pakistan
2Abu Dhabi University, Abu Dhabi, United Arab Emirates
3Dundas Data Visualization, Toronto, ON, Canada
Corresponding author:
Adnan Abid, Department of Computer Science, University of Management and Technology, C II, Johar Town, Lahore 54000, Pakistan.
Email: [email protected]
Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages
(https://us.sagepub.com/en-us/nam/open-access-at-sage).
crucial source to serve this purpose. The statistics gath- ered from such news can be used for pre-emptive mea- sures to avoid such incidents; similarly, the data collected from these news can be used in computing security factor of a city, country, or a region. Likewise, timely reporting and analysis for different patterns of such events and then forwarding this information to take decisions is very crucial to the concerned field for- mations so that they can take appropriate measures to avoid such incidents in future. In order to make all these effective, this information must be regularly updated and made readily available to the involved stakeholders.3
Motivation
The motivation behind this work is to come up with a system to collect human loss news from authentic resources, build a repository, and use it for several pur- poses including extraction of useful information from each news, identifying the patterns in the occurrence of events, maintenance of security profiles for different cities and regions, and so on. In order to understand the idea, an sample layered map of Pakistan, a country enormously hit by terrorism, shows the districts affected by law-and-order situation, as shown in Figure 1. In the figure, the darker color reflects higher number of incidents in a given area, while the lighter color shows lesser number of incidents. The figure shows that mostly the western region of the country
which is near Afghanistan border has been affected by incidents involving human loss.
To the best of our knowledge, no such architectural framework exists that can help. The proposed frame- work involves a module for data acquisition using a web crawler, other components perform preprocessing on the collected news, which are then passed to a classi- fier module that classifies these news into predefined news categories. Each news along with its meta- information is saved into a news repository. Finally, separate components are used for security profiling using statistical analysis, data visualization, and pattern recognition from the news stored in the repository.
Three important components of the proposed architec- tural framework have been discussed in detail. The first one is about data acquisition that discusses the news crawler to collect the relevant news. The challenges and process of developing such news crawler have been dis- cussed, which are followed by the statistics of the news repository accumulated by implementing this crawler.
The second component is the one that automatically categorizes the news in the repository using widely used text classifiers. While another perspective of security profiling, reporting, and data visualization has also been presented.
Contribution
This research work provides the following contributions:
Figure 1. A sample heat-map for terrorism and other human loss incidents in Pakistan.
Defines an architectural framework for human loss news data integration that involves collec- tion, processing, analyzing, and visualization of human loss news involving terrorism and law- and-order situations.
Provides detailed design and implementation issues and solution for its three major compo- nents, namely, information extractor, news clas- sifier, and visualization and reporting.
Presents a data set of human loss new gathered and classified manually for the evaluation of classification component.
Empirically identifies a suitable news classifier for the categorization of human loss news.
Proposes a mechanism for security profiling of a city based on the statistics gathered from the pro- posed data integration framework.
The rest of the article is structured in the following manner. The relevant literature is discussed in section
‘‘Related work.’’ While, the proposed architectural framework and some details of its components are dis- cussed in section ‘‘Architectural framework,’’ which is followed by the section ‘‘News repository’’ that explains the process and relevant details of building human loss news repository. Section ‘‘Classification methodology’’
presents an empirical comparison of widely used text classifiers to automatically categorize obtained news into appropriate human loss news category. Section
‘‘Experiments and results’’ presents experimental setup, evaluation measures, and result and discussion. Finally, security profiling and reporting and visualization of news is discussed in section ‘‘Security profiling and data visualization.’’ The conclusion and future directions is presented in section ‘‘Conclusion.’’
Related work
There exist several data services and systems which capitalize on acquisition and processing of data from these services.4 Web crawler is the best tool for news extraction. Guo et al.5 provide an effective and easy way, Extract COtent from web News (ECON), to extract content from any news web page written in any language automatically. It exploits document object model (DOM) tree structure of news web page and uses features of DOM tree to do its job.
Organization and management of large volumes of electronic text information are a great challenge.6Text classification could be used as an essential technique to handle this issue. Categorization of textual data to pre- defined categories is known as text classification.
Several text classification applications, such as predic- tion of user preferences, news filtering, email filtering, and many more, exist7–9and are used for different pur- poses. A number of machine learning techniques have
been used to classify texts including rule induction, Naı¨ve Bayes (NB),10decision tree induction, K-nearest neighbors (KNN), random forest (RF),11,12 and sup- port vector machine (SVM).9,13
Yang and Liu7studied five classifiers with statistical significance and show that SVM, KNN, and linear least square fit (LLSF) perform significantly better than NNet and NB against categories having less than ten instances and all the classifiers perform equally well when instances per category are more than 300. Lewis and Ringuette14present a study regarding performance of an NB and a decision tree classifier on two textual data sets. They show that both classifiers performed reasonably. They also demonstrate the effects of tem- poral nature of definitions of categories.
Moral et al.15and Hull and Grefenstette16discussed pros and cons of stemming. They show that benefits of stemming are specific to context. Nature of language also influences the performance of a stemmer.
Furthermore, effects of stemming are not always posi- tive, a warning note on the exercise of stemming words, which can have mixed effects on classification perfor- mance in information retrieval.
Ting et al.17aim at highlighting NB classifier’s per- formance for document classification. It shows that NB is accurate and computationally efficient classifier for document classification due to its simplicity. Rish10dis- cusses assumptions made by NB classifier about fea- tures such as class independence. It shows that NB is accurate because of independent features and function- ally dependent features. Kibriya et al.18 demonstrate that performance of multinomial NB classifier can be improved using term frequency–inverse document fre- quency (TF–IDF) conversion and normalization of document length. Frank and Bouckaert19 identified a deficiency of multinomial Naı¨ve Bayes (MNB) when data set is unbalanced and shows that this deficiency can be removed by employing classwise word vector normalization.
Dumais8shows that using linear SVMs against train- ing examples, an accurate text classifiers can be learned.
It reports that SVMs are robust for preprocessing, and sequential minimal optimization (SMO) method is quite efficient for learning SVMs even for large textual data sets. Joachims9explores usage of SVMs for classification from text examples. It analyzes properties of textual data and identifies applicability of SVMs for this kind of task.
It shows that there is no need for manual tuning of para- meters for SVMs. Hsu et al.13provide a guide for SVM classification technique. They propose a simple proce- dure which usually gives reasonable results. It discusses usage of appropriate kernels in different situations.
Breiman11 proposed RFs. In an RF, each node is split using the best among a subset of predictors ran- domly chosen at that node. These predictors are chosen with replacement. Results of RFs are comparable to
other classifiers like SVM and NNet. Amaratunga et al.20elaborate that when number of features in a data set are huge and truly informative features is small, then performance of RF degrades significantly. In such situations, results can be improved by decreasing the number of trees generated for non-informative features.
Xu et al.21developed an improved version of RF classi- fier for classification of textual data. It is designed for high dimensional data with multiple classes. A feature weighting procedure and tree selection procedure are implemented for creating RF suited to text documents’
classification. Biau12 offers an improvement to RFs suggested by Breiman. This procedure is consistent and adapts to sparsity. Xu et al.22propose a model selection method which aims to optimize the tree selection pro- cess so that only good trees are included in an RF.
Recently, some work has been published related to visualization and reporting of news. For instance, in Watanabe,23 the authors present a semi-supervised approach to geographical news classification. Similarly, some other work has been accomplished on classifica- tion of UN news.24 Another work related to adding semantics to news data has been reported in Rodosthenous and Michael.25 The work done in this research presents a complete architectural framework for maintaining security profiles of cities while compil- ing data from news reports. It also implements and pre- sents the results of its three important components.
Architectural framework
The proposed system intends to collect the news from famous news websites and stores this information to
generate different useful reports for relevant agencies.
It further intends to identify different crime pockets, the patterns, and possible connections between differ- ent events. The overall architecture of the proposed sys- tem is shown in Figure 2.
News crawler
The news will be automatically collected through a crawler that will extract crime- and terrorism-related news from different famous news portals. This compo- nent is responsible for browsing the news portals effi- ciently so as to collect news stories.
Preprocessor
The collection of news invites certain preprocessing of the collected news. The preprocess in turn comprises three main sub-components, namely, duplicate detec- tor, information extractor, and the news classifier.
Duplicate detector
This extraction of information from multiple sources involves two types of concerns: first, system may encounter same news from different news portals on the same day. Second, there are follow-up news for major incidents for many days which generally involve updates in statistics, condemnation of events by differ- ent people, and so on. This sets up another important requirement of detecting follow-up news stories. The duplicate and follow-up news, if not identified prop- erly, may result into adding the same incident multiple times, hence affecting the credibility of the gathered Figure 2. Overall architecture of the proposed system.
statistics. Therefore, it requires the design and develop- ment of a separate component to process input news so as to detect duplicate and follow-up news.
Information extractor
Apart from the categorization of the news story, useful information shall be extracted from the news story including date and time of the event, location of the event, number of injured people, and number of casual- ties in the incident. Part of this information shall be
used to build the context of news, which can be useful in analyzing the news.
News classifier
This news will be further classified based on the text in the news story. To this end, appropriate text classifica- tion algorithm shall be used to automatically classify the news report.
News input manager
The news input manager is responsible for manipulat- ing the gathered news so as to store a single news and its variants along with the source of information in an efficient storage system. Similarly, it also manages the linking of follow-up news with the principal news already present in the system.
Furthermore, this news classification and extraction of certain useful information can lead to an interface that would help semi-automatic news input system.
Where, most of the information is furnished by the news preprocessing system and is then reviewed and endorsed by a human. This semi-automatic system will help maintaining the credibility of the news repository.
Figure 3 shows the steps involved in preprocessing and news classification. While, a sample input screen is shown in Figure 4.
Figure 3. Algorithm for news processing.
Figure 4. Semi-automatic data entry screen.
Reports and graphs
The collection of such news will help us creating differ- ent useful reports showing the type and number of inci- dents in different time intervals in specific locations. A sample heat-map report showing frequency of incidents in different districts in Pakistan is shown in Figure 1.
Pattern recognition
Once the system will have sufficient news, it would set the platform to analyze the stored information so as to identify crime pockets and crime patterns using the stored information.
News repository
This research requires a repository data set comprising human loss news reports. Such work cannot be accom- plished in an effective manner in the absence of real and correct data. Thus, a real news repository has been cre- ated from scratch by taking advantage of the World Wide Web. News reports are gathered and processed from the websites of famous and top-ranked newspa- pers of Pakistan to generate this repository.
News categories: A specialized set of news categories are defined on the recommendations of a group of experts from private security agencies of Pakistan (https://hesecurity.com.pk/). A brief description for the news of each category is presented in Table 1.
Automatic news repository generation
The general process for the creation of this purposeful news repository is presented in Figure 5. It comprises a news crawler that extracts the news from the websites of news agencies. It preprocesses each news page to extract the main story from it and then checks for its conformance to human loss news. If a news falls under
the category of human loss news, then it is added to the repository and is discarded otherwise.
News extractor. A crawler is developed that collects human loss–related news stories from the news websites and saves them into a database. It consists of two mod- ules, first one visits home page of the newspaper’s web- site and extracts all URLs present on it along with their text. These URLs are actually the links to the individual news reports. Then, it checks these URLs against a predefined list of words called keywords. If any of the keywords is present in the text or source of URL, it is saved otherwise dropped. Whereas, the sec- ond module visits web page of each link of these saved URLs in the list and extracts story presented on it. It stores both story and URL into database for future ref- erence. Thus, the crawler application explores the web sites using breadth first graph traversal algorithm.
The extraction of actual news story from the web page is not trivial. As the issue with the extracted story data is that apart from the actual content of news Table 1. Categories of human loss news.
Category Abbreviation Description
Accident A All reports in which road accidents, plane crashes, train accidents, incidents of drowning, and so on are reported and placed in accident category
Crime C All suicidal, rivalry related, honor killing related, political clashes resulting in causalities, and so on are reported in crime category
Disaster D Rains, floods, earthquakes, storms, lightening, and so on are placed in disaster category
Operation O All kinds of operations, raids, encounters conducted by law and enforcement agencies are under operation category
Terrorism T Any activity that spreads terror in the society, for example, suicide bombing, bomb blast, or any movement conducted by declared terrorist organization, cross-border firing (state terrorism), any kind of use of force by government considered by human activists unnecessary (state terrorism), and so on fall in this category
Figure 5. Process for generating human loss news repository.
story, it also constrains some advertisements, links to related and recent news stories, menu bars, headers and footers of web page, and other irrelevant data. Figure 6 shows a sample of a news story web page. Actual con- tent of story is in rectangle and ellipses show noisy areas. Almost 70% data of every news story web page are occupied with irrelevant material. In order to sepa- rate and extract actual content of story from a news story web page, the technique used in Guo et al.5has been employed. Thus, the crawler application uses breadth first search to explore all the news while ignor- ing the noisy data.
News report preprocessing. More than 60,000 news reports are collected by this web crawler application for the period between 1 January 2010 and 31 March 2017.
These news reports are passed through different filters, as shown in Figure 7, as all of these reports are not true reports in perspective of human loss. Certain keywords used in human loss–related news have been used to apply an initial filter so as to collect the human loss news. All the news collected are then processed
manually and the reports which are extracted but are not required are discarded from the repository manually.
Since news reports have been collected from five dif- ferent sources, therefore, a lot of duplicated reports have been found in the repository, as same incident is reported by all these news agencies. Similarly, some high-impact news are continuously reported for several days, and thus need to be identified as another type of duplicate news, which we refer to as follow-up news.
This requires the detection and removal of duplicate reports. Thus, only one instance of a news story is retained in the final repository. It is pertinent to men- tion that duplicate news detection is itself a complete research problem; therefore, in this research, duplicate news have been manually detected, and only a single copy of each duplicate news has been used for visualiza- tion and security profiling to avoid any over statement of facts. However, we intend to address this problem as one of the promising future directions. After the initial scrutiny, nearly 5000 unique reports are left. All these news are already manually categorized into the relevant classes.
Figure 6. A news story web page showing news story and noisy data.
Statistics of the repository. The news repository is com- posed of nearly 5000 news stories related to human loss (dead and/or injured subjects). Out of these, the most frequent news stories belong to the categorycrime (C) and almost 30% of the news stories are from this cate- gory. It is quite natural as crime-related incidents in any society are greater than any other human loss–
related incidents. Interestingly, the next two categories of most frequent stories are of terrorism (T), nearly 29%, and operations(O), nearly 23%. This is because of the fact that Pakistan is facing terrorism issue for more than a decade, and terrorism incidents have hap- pened quite frequently in this news collection period.
Similarly, during this period, the military and other supporting forces have conducted several operations against the terrorists during this time period. Number of news stories belonging toaccidents(A) are 12% and those ofnaturaldisaster(D) are almost 4% of the col- lected news. These statistics are summarized in Table 2.
These statistics reflect peculiar situation that Pakistan has been facing during these years and makes this repo- sitory special and different from any such news data sets.
Classification methodology
An important objective of this research is to build a classifier to automatically categorize the news stories into appropriate class tag. To this end, a conventional
process has been adopted, as shown in Figure 8. The whole news repository is divided into two different sets, one for training the classifiers and second for testing purpose. It is pertinent to mention that a large number of text classifiers have been used for many different pur- poses. Therefore, it is not trivial to take any of them as suitable classifier for the categorization of human loss news. Most prominent and effective classification tech- niques for text data are NB, SVM, and RFs. Therefore, this research involves all these classifiers to choose the best one for the categorization of human loss news.
Multinomial naive bayes
This is a feature independent and probabilistic model based on Bayes’ theorem that assumes strong indepen- dence. It makes use of prior and posterior probabilities.
NB classifier calculates the probability with which a news report may fall in a given category and assigns it Figure 7. Preprocessing steps for a news.
Table 2. Categories of human loss news.
Category Abbreviation No. of stories Percentage
Accident A 597 12
Crime C 1511 30
Disaster D 237 5
Operation O 1246 24
Terrorism T 1432 29
Figure 8. Process for categorization of news reports.
the category that has the highest probability. The strength of this approach is that it is simple, efficient as it consumes less computational time and does not require large memory for execution. Every news story consists of English language text, and every distinct word in text represents a feature. MNB uses these fea- tures to calculate probabilities of words or terms for all classes using the training data set. These probabilities are in turn used to compute the probability scores for a newly encountered news story.
Support vector machine
SVM is a discriminative classifier that makes use of separating hyperplane. By providing labeled trained news reports as input to the algorithm, it outputs an optimal hyperplane that is used to categorize future or test news reports. News reports may or may not be lin- early separable. If news reports are not separable, then algorithms transforms it to higher dimension and try to find hyperplane in that dimension. Different kernels are used for transformation. This research employs lin- ear kernel and radial basis function (RBF) kernel for transformation. Linear kernel is used if data are line- arly separable among classes, whereas RBF (Gaussian) kernel is used if data are not linearly separable. SVM is computationally complex method as compared to NB, but is considered to be more accurate classification method.
Random forest
RF has been reported as successful method for text classification in recent literature. It is an ensemble learning method for text and other kind of classifica- tion, which compiles its decision based on involved sev- eral individual decision trees. From training set, more than one decision trees are constructed and then class is output as mode of the classes. These decision trees are constructed based upon randomly selected features (terms/words) from news reports. These features are selected with replacement. For a given news story, the class which is most frequently resulted by the trees in the forest is considered to be the most appropriate class.
Variants of selected classifiers
In general, the text classification problems are of empirical nature and most of the text classifiers apply certain preprocessing techniques on the data set before actually applying the classifiers on the considered data elements. Therefore, this research also involves com- monly used natural language processing (NLP) tech- niques, namely, stemming15 and TF–IDF weighting.26 Stemming reduces a word to its base word, whereas
TF–IDF associates a weight to a given word based on its frequency of occurrence in a given document and in the whole corpus of data. Thus, for each approach, four different variants are considered, that is, simple version, stemmed words are represented by super- scriptedS, TF–IDF weighted words are represented by superscripted T, and both stemming and TF–IDF weighted words are represented by superscriptedTS.
Experiments and results
The principal objective of this research is to find the most accurate text classifier for human loss new reports.
In order to find the best classifier among 16 variants of three classification techniques, rigorous experiments have been conducted and evaluated using appropriate evaluation measures.
Experimental setup
Tenfold cross-validation has been employed to conduct experiments. That is, every experiment involves a train- ing set consisting of 90% of instances in the data set, whereas, the test set consists of the rest of the 10% of instances in the data set. It is ensured that this division is stratified, that is, the percentage of instances for each category in the training and test sets remains the same.
This whole process is repeated 10 times. Average results of these 10 iterations have been reported. In the end, a discussion on the obtained results provides a synthesis of all the experiments.
Evaluation measures
The evaluation measures used to evaluate this work are presented in Table 3. They include the accuracy, time to train and classify the instances, receiver operating curve, and area under receiver operating curve.
Accuracy. Accuracy is a measure to evaluate how cor- rectly a classifier is classifying the news reports. It is the percentage of number of reports correctly predicted over the total number of reports in the news repository.
It is a simple measure but reflects the overall correct- ness of the proposed method. In this article, accuracy is denoted bya.
Table 3. Evaluation measures.
Evaluation measure Abbreviation
Accuracy a
Time t
Receiver operating characteristic curve ROC curve
Area under ROC AUROC
Time. Time is another measure to judge the perfor- mance of a classifying algorithm. It is measured in sec- onds. In this work, the reported time consists of the time taken by each classifier to train a model and to make all the predictions in aforementioned settings.
Time taken is denoted byt.
Receiver operating characteristic (ROC) curve. ROC curve is a good way to judge performance of two competitive classifiers. In an ROC curve, true positive rate (sensitiv- ity) is plotted as a function of false positive rate (speci- ficity). The curve that is closer to the ideal is considered to be the better that the other.
Area under ROC. Area under ROC tells how close a clas- sifier is to perfection. It ranges from 0 to 1. An area of 1 represents perfection and an area of 0.5 or below is con- sidered to be worse than even a random classifier.
Results and discussion
Variants of multinomial NB. Overall and classwise accura- cies of all four variants of MNB are presented in Table 4. It is clear from Table 4 that all these variants performed equally well against news reports in the con- sidered repository. Overall accuracy of all four is almost 90% and equal at 1% level of statistical signifi- cance. Classwise accuracies are almost same at 3% level of statistical significance for each class for all variants.
Also accuracies of all classes within each variant are
not differing more than 5%, which shows that MNBis not biased toward classes having greater number of news reports. Furthermore, the results show that the use of weighted terms and stemming have no significant impact on improving the accuracy of results. Thus, all the variants seem to be equally good. Hence, if any one of the variants needs to be picked from these results, then it should be the simplest variant as it does not involve preprocessing overheads. Thus, simple MNB is considered to be an overall best variant from these experiments.
Variants of SVM. Overall and classwise accuracies of all eight variants of SVM are shown in Table 5. It is clear from overall accuracy that all variants of SVM per- formed equally well. Accuracy of all eight is almost 90% and equal at 2% level of statistical significance.
Classwise accuracies of four variants of RBF kernel are significantly lesser than four variants of linear kernel for classes A and D. For rest of the classes, they are almost similar at 5% level of statistical significance.
Difference between accuracies of all classes within each variant is significant and more than 5%, which shows that SVM is biased toward those classes that have greater number of news reports in the repository.
SVMT RBFð Þ is picked as the best variant as its overall class accuracy is best as compared to the rest of the variants. The TF–IDF representation helps assigning appropriate weights to the terms meaningful for classi- fication purpose and also reduce the impact of stop- words. Therefore, SVMT RBFð Þ performs better than other variants.
Variants of RF. The experimental results for the experi- ments conducted with the variants of RF are shown in Table 6. It is clear from overall accuracies that all var- iants of RF performed equally well. Accuracy of all four is almost 86% and equal at 1% level of statistical significance. Classwise accuracies are almost same at 5% level of statistical significance for each class for all variants except D. Difference between accuracies of all classes within each variant is significant and more than 5% which means RF is biased toward those classes Table 4. Classwise and overall accuracyaof all preprocessing
variants forMNB.
MNB MNBS MNBT MNBTS
A 85.6 84.5 87.1 86.1
C 89.0 88.5 89.9 88.8
D 89.0 91.8 90.4 91.8
O 90.9 90.3 90.9 88.3
T 89.3 89.3 88.4 88.4
Overall 89.2 88.9 89.3 88.3
MNB: multinomial Naı¨ve Bayes.
Bold values show best performing approach for each category.
Table 5. Classwise and overall accuracyaof all variants of SVM.
SVMð ÞR SVMS Rð Þ SVMT Rð Þ SVMTS Rð Þ SVMð ÞL SVMS Lð Þ SVMT Lð Þ SVMTS Lð Þ
A 80.4 79.4 82.5 82.5 87.1 87.6 88.1 88.1
C 92.3 92.5 92.7 93.4 92.1 89.9 92.1 91.2
D 67.1 68.5 72.6 74 71.2 75.3 75.3 79.5
O 91.6 91.4 93 93 92.4 91.4 93.5 93
T 91.7 89.5 93 90.1 91.3 88.4 89 88.6
Overall 89.4 88.6 90.7 90.1 90.4 88.8 90.2 89.9
SVM: support vector machine.
having greater number of news reports.RFT is selected as the best variants as its overall class accuracy is better than the rest of the variants.
Comparison for best variant
Results of all variants of each approach have shown that simple MNB, SVMT RBFð Þ, and RFT are selected as the best variants for each classifier. In order to choose the overall best classifier, a detailed comparative analy- sis for all the selected variants has been conducted. It is pertinent to mention that TF-IDF-based preprocessing performs better than stemming and simple preproces- sing. The reason is that TF–IDF assigns meaningful weights to the terms which are meaningful for classifi- cation purpose. Similarly, it also tends to assign negligi- ble weights to the stop-words.
Figure 9 shows classwise accuracy of all three best variants. It is very clear that these three variants per- form equally well at less than 5% level of statistical sig- nificance against C, O, and T classes. But performance ofRFTis degraded as compared toMNBandSVMT RBFð Þ (also shown asSVMT Rð Þdue to lack of space) against A and D classes. It means that it is performing poorly against classes having fewer number of news reports. A comparison ofMNBandSVMT RBFð Þshows both are per- forming equally well against A class at 5% level of sta- tistical significance but MNB is outperforming SVMT RBFð Þ against D class quite significantly. Hence, keeping all this in view, it can be claimed thatMNBis outperforming other two best competitors in terms of classwise accuracy.
Table 7 shows the comparison of all best variants in terms of accuracy, time, and area under the ROC curve.
It shows that the accuracy of all three selected variants MNB,SVMT RBFð Þ, andRFT is comparable and is almost the same.MNBandSVMT RBFð Þare equal at 1% level of statistical significance, whereas all three are equal at 3% level of statistical significance. It can be inferred that this little difference is by chance. Hence, all three are equally well as for as accuracy is concerned.
Similarly, in terms of time taken to train and predict news reports, the statistics show thatMNBoutperforms
both SVMT RBFð Þ and RFT by a great margin. MNB is almost three times faster thanSVMT RBFð Þand almost 27 times faster thanRFT. This difference is huge and it will grow more as number of news reports increases. The reason is, by nature, MNB involves simple arithmetic and no complex procedure is involved in it that makes it computationally less complex and less expensive than SVMT RBFð ÞandRFT.
Table 7 shows the AUROC comparison for the selected variants. It is evident that all three best variants are performing equally well. The value of 0.9 or above is considered a very good result, and it is clear that all the selected variants have higher values than 0.9.
Table 8 shows classwise area under ROC of all three variants. Whereas Figure 10 shows ROC curves of all three best variants classwise. Performance of all three best is close to perfection as per statistics listed in Table 8. All are above 0.9 which is considered excellent and close to perfection. ROC curves shown in Figure 10 Table 6. Classwise and overall accuracyaof all variants of
random forest.
RF RFS RFT RFTS
A 66 65.5 68.6 64.4
C 90.5 91 90.3 90.3
D 61.6 65.8 68.5 63
O 90.6 89 89 89
T 91.9 90.9 92.4 91.7
Overall 86.7 86.3 87.1 86.1
RF: random forest.
Bold values show best performing approach for each category.
Table 7. Comparison ofa,t, and AUROC for all best variants.
MNB SVMT RBFð Þ RFT
a 89.1% 89.9% 87.1%
t 3 s 10 s 82 s
AUROC 0.979 0.960 0.981
AUROC: area under receiver operating characteristic; MNB:
multinomial Naı¨ve Bayes; SVM: support vector machine; RF: random forest.
Figure 9. Comparison of classwise accuracy of all best variants.
Table 8. Classwise comparison of all best variants.
A C D O T
MNB 0.993 0.973 0.998 0.982 0.973
SVMT RBFð Þ 0.975 0.954 0.989 0.981 0.939
RFT 0.984 0.974 0.998 0.99 0.976
MNB: multinomial Nay¨ve Bayes; SVM: support vector machine; RF:
random forest.
also testify this as all curves are away from diagonal line (random classifier line) and toward and near to top left corner (0, 1) of the chart and that corner belongs to perfection.
After analyzing accuracy, time, areas under ROC, and ROC curves, it can be claimed that MNBis accu- rate, near to perfection and efficient in terms of time than the other two selected variants. Therefore, it can be concluded from this research thatMNBis overall the best from all the considered variants for the classifica- tion of human loss news classification.
Security profiling and data visualization The input data compile a useful repository for the fur- ther analysis. The analyzers can then generate expedient reports from this repository reflecting security situation in a given geographical or political unit (city, district, division, province, and so on) in a given period of time.
News classifier in data ingestion
In the proposed system, the news crawler gets news stories, and among these stories, the human loss news are retained, while the others are discarded. These news
are passed to the news classifier module discussed in this section which assigns it a category automatically.
Figure 11 shows how this classifier is integrated with the data ingestion system. It can also be observed that the data ingestion system also records other meta- information related to the extracted news including date, Islamic month and date, source of data, and loca- tion of data. It also extracts statistics with the help of preprocessor and fills the fields showing number of casualties and injured people during an incident. All this information is shown to a data entry operator, whose job is to just quickly vet the extracted informa- tion, thus resulting into a semi-automatic data acquisi- tion system. The data entry operator may further enrich or correct the extracted information where required.
For instance, he may add some missing information, for instance, responsibility claim, target, and so on. The data entry operator may also correct the outcome of the classifier whenever required.
Security profiling based on gathered statistics
Based on the raw data, security profiles can be created for different geographical regions. The data shown in Table 9 are a comparison report showing the number of Figure 10. Comparison of ROC curves for best variants.
ROC: receiver operating characteristic.
different types of incidents happened in the provinces of Pakistan during 2013 and 2014. The number of different types of incidents is a significant parameter for security profiling. While at the same time, the meta-information extracted from the news, including number of deaths, injuries, type of event, and so on, helps enriching these parameters further. The whole information is subse- quently graded based on the number and types of the incidents and the number of injuries and deaths in those incidents in a given period, thus generating a security pro- file score of each city in a specific period.
Thus, the user can create a report that shows events of a specific type, in a given period of time, having
more than 10 casualties, in a given region by applying different filters. These reports can be materialized and archived in different suitable and useful file formats.
This involves assigning weightage to each category and number of dead and injured persons. This score can be used to draw a layered map showing peaceful areas and the crime and terrorism pockets in the country
tr= Xn
c=0
vc:rcr
" #
+vd:ddr+vi:dir ð1Þ
where vc represents the weightage of incidents of a given type, rcr represents the number of events of Figure 11. Sample heat-map for incidents in Pakistan.
Table 9. Statistics of events took place in different provinces.
Crime Disaster Operation Terrorism Total
Year 13 14 13 14 13 14 13 14 13 14
AJK 1 1 2
Balochistan 26 28 11 4 3 1 2 1 3 79
Capital 2 1 3
Fata 12 2 14
KPK 25 4 12 6 2 7 12 2 5 75
Punjab 16 6 3 1 3 1 30
Sindh 17 5 1 8 1 39 24 1 2 24 122
AJK: Azad Jammu Kashmir; KPK: Khyber Pakhtunkha.
category ‘‘c’’ in region ‘‘r’’ in a given period of time.
Similarly,vdrepresents the weightage for a single casu- alty andvi represents the weightage for a single injury in a given event. Whereas, ddr and dir represent total number of deaths and injuries in a region ‘‘r’’ in a given period of time, respectively. Higher score reflects that there are numerous incidents in a given city and its security profile is weak, whereas, low score reflects peace.
Visualization and reporting
The visualization and reporting module represents the data in tabular and graphical formats. It is supported by some customized reports that are frequently required. Furthermore, a flexible data extractor is also
part of this module that allows the users to apply filters on any type of attributes shown in visualization of number of incidents and the injured and dead people in those incidents in a given period of time in the whole country. Similarly, Figure 12 shows another graphs presenting the percentage of people who got injured in accidents happened in different provinces. It shows that nearly 82% of injuries happened in Sindh province which is second largest province in the country in terms of populations, while nearly 6% people got injured in Punjab, the largest province in terms of population.
Similarly, the visualization component helps generating the graphical form of many useful reports. For instance, Figure 13 shows different types of events occurred during a specified period; while Figure 14 shows the heat-map of incidents took place, where the radius of the circle reflects the human loss in the form of deaths and injuries took place in the incident, it also shows the details of the incident pointed to by the mouse.
Conclusion
This research has proposed an architectural framework for accumulating human loss news from news portals, so as to process them and build security index for differ- ent cities and regions. To this end, all the components have been discussed, while data accumulation, classifi- cation, and data visualization components have been developed. In order to implement the classification component widely used, text classifiers have been used to categorize human loss news. A repository of such Figure 12. Pie-chart showing percentage injured persons per
province for accident category.
Figure 13. Different types of incidents occurred during a specific period.
news has been generated by designing and developing a customized web crawler that crawled five top news web- sites of Pakistan. All the collected news are manually checked and categorized into appropriate classes that include accident, crime, disaster, operation, and terror- ism. These classes have been provided by the domain experts. This repository served as a gold standard to empirically evaluate some widely used text classifiers including MNB, SVM, and RF. Based on certain pre- processing techniques, 16 variants of these classifiers have been tested and evaluated using appropriate eva- luation measures. The study concludes that MNB is best of them as it achieved 89% overall accuracy and more than 85% accuracy against all classes. Similarly, in terms of time taken, it is the most efficient classifica- tion approach among the considered classifiers. Its overall area under ROC was greater than 0.9 that is considered excellent. Classwise area under ROC for all classes was also greater than 0.9.
Although the number of news reports in the reposi- tory was enough for this study, yet in near future, more reports will be collected to extend size of this set to ver- ify the results on bigger set. In this research, news reports were categorized to only one category but there is a need to classify some news reports to more than one category. Multiclass labeling of such reports is a possible extension to this work. In this study, the dupli- cate news reports were detected manually. There is a need to devise some automated mechanism for dupli- cate detection of news reports, which demands an
automatic duplicate detector for human loss news.
Similarly, another possible extension to this work involves the summarization of these news reports and the extraction of useful statistical and contextual infor- mation from the news report.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial sup- port for the research, authorship, and/or publication of this article: This research work was supported by Abu Dhabi Faculty Research grant.
ORCID iDs
Adnan Abid https://orcid.org/0000-0003-2602-2876 Muhammad Shoaib Farooq https://orcid.org/0000-0002- 4095-8868
References
1. Iqbal R, Butt TA, Afzaal M, et al. Trust management in social internet of vehicles: factors, challenges, blockchain, and fog solutions.Int J Distrib Sensor Netw2019; 15(1):
1–22.
2. Barka E, Kerrache CA, Benkraouda H, et al. Towards a trusted unmanned aerial system using blockchain for the Figure 14. Heat-map of different incidents.
protection of critical infrastructure. Trans Emerg Tele- commun Technol. Epub ahead of print 29 July 2019.
DOI: 10.1002/ett.3706.
3. Saleem S, Dilawari A, Khan UG, et al. Stateful human- centered visual captioning system to aid video surveil- lance.Comput Electr Eng2019; 78: 108–119.
4. Ceri S, Abid A, Helou MA, et al. Search computing:
managing complex search queries.IEEE Internet Comput 2010; 14(6): 14–22.
5. Guo Y, Tang H, Song L, et al. ECON: an approach to extract content from web news page. In:Proceedings of the 2010 12th international Asia-Pacific web conference (APWEB’10), Busan, South Korea, 6–8 April 2010, pp.314–320. New York: IEEE.
6. Rathee G, Sharma A, Kumar R, et al. A secure commu- nicating things network framework for industrial iot using blockchain technology. Ad Hoc Netw 2019; 94:
101933.
7. Yang Y and Liu X. A re-examination of text categoriza- tion methods. In:Proceedings of the 22nd annual interna- tional ACM SIGIR conference on research and development in information retrieval (SIGIR’99), Berke- ley, CA, 15–19 August 1999, pp.42–49, New York:
ACM.
8. Dumais S. Using SVMs for text categorization. IEEE Intell Syst1998; 13(4): 21–23.
9. Joachims T. Text categorization with support vector machines: learning with many relevant features. In:Pro- ceedings of the 10th European conference on machine learning (ECML’98), Chemnitz, 21–23 April 1998, pp.137–142. London: Springer.
10. Rish I. An empirical study of the Naı¨ve Bayes classifier (Technical report), https://dominoweb.draco.res.ibm.
com/db24eb109a77428785256aff005d3df2.html
11. Breiman L. Random forests.Mach Learn2001; 45: 5–32.
12. Biau G. Analysis of a random forests model. J Mach Learn Res2012; 13: 1063–1095.
13. Hsu C-W, Chang C-C and Lin C-J. A practical guide to support vector classification (Technical report), 2003, https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf 14. Lewis DD and Ringuette M. A comparison of two learn-
ing algorithms for text categorization. In: Third annual symposium on document analysis and information retrieval,
Las Vegas, Nevada, 11 April 1994, p.33. Information Sci- ence Research Institute, University of Nevada.
15. Moral CD, De Antonio A, Imbert R, et al. A survey of stemming algorithms in information retrieval.Inform Res Int Electron J2014; 19(1): n1.
16. Hull DA and Grefenstette G. A detailed analysis of Eng- lish stemming algorithms (Technical report, Xerox Research and Technology), 1996, http://citeseerx.ist.p- su.edu/viewdoc/summary?doi=10.1.1.68.2870
17. Ting SL, Ip WH and Tsang AHC. Is Naı¨ve Bayes a good classifier for document classification? Int J Softw Eng Appl2011; 5: 37–46.
18. Kibriya AM, Frank E, Pfahringer B, et al. Multinomial Naı¨ve Bayes for text categorization revisited. In:Proceed- ings of the 17th Australian joint conference on advances in artificial intelligence (AI’04), Cairns QLD, Australia, 4–6 December 2004, pp.488–499. Berlin: Springer.
19. Frank E and Bouckaert RR. Naı¨ve Bayes for text classi- fication with unbalanced classes. In: Proceedings of the 10th European conference on principle and practice of knowledge discovery in databases (PKDD’06), Berlin, 18–
22 September 2006, pp.503–510. Berlin: Springer.
20. Amaratunga D, Cabrera J and Lee Y-S. Enriched ran- dom forests.Bioinformatics2008; 24(18): 2010–2014.
21. Xu B, Guo X, Ye Y, et al. An improved random forest clas- sifier for text categorization.JCP2012; 7(12): 2913–2920.
22. Xu B, Li J, Wang Q, et al. A tree selection model for improved random forest. Bull Adv Technol Res 2012;
6(2): 1.
23. Watanabe K. Newsmap: a semi-supervised approach to geographical news classification.Digit Journal2018; 6(3):
294–309.
24. Watanabe K and Zhou Y. Theory-driven analysis of large corpora: semisupervised topic classification of the UN speeches. Soc Sci Comput Rev. Epub ahead of print 21 February 2020. DOI: 10.1177/0894439320907027.
25. Rodosthenous C and Michael L. Using generic ontologies to infer the geographic focus of text. In:International con- ference on agents and artificial intelligence, Funchal, 16–
18 January 2018, pp.223–246. Berlin: Springer.
26. Manning CD, Raghavan P and Schu¨tze H. Introduction to information retrieval. Cambridge: Cambridge Univer- sity Press, 2008.