Pattern Mining From Unlabeled News Article Dataset Using Semi-Supervised Learning

The high speed and massive amount of news data makes it very optimal to perform any kind of associated pattern and trend analysis, which we also intend to do in this work. In the first section 1.1, we have highlighted the background for these tasks such as the task's current status, which factors are missing and the challenges that follow. In the second section 1.2, we present the goals for the work and our proposed solutions to overcome the increasing challenges.

In the next section 1.3 we have listed the contribution of the work done, and the organization of the rest of the articles is articulated in the last section 1.4.

Motivation

The following work is based on the Natural Language Processing domain and aims to reason with textual data. The form of textual data used in this study is news articles that grow rapidly with each unit of time. The imperative for such type of analysis is that the increasing amount of data must be put through some method or compression mechanism that will depict the full insight information of the entire data despite retaining the entire dataset.

In this chapter, we will highlight the problem we aim to tackle and also the proposed method for doing it.

Objectives of the Thesis

Thesis Contributions

Organization of the Thesis

In this section, we want to take a closer look at some of the most fascinating work done in the last decade on natural language processing and knowledge extraction from textual data. In the first part, we discussed semi-supervised text classification tasks, which means that we work by labeling or providing knowledge about a particular subspace of the whole dataset and later classifying the whole dataset. Throughout time, there have been several text classification works done by numerous researchers in [6] the authors conducted a study on such types of work from which the authors concluded that the nature of the data plays an important role rather than the selection of classification algorithms.

In [11], the authors proposed a heterogeneous graph attention network that can learn the importance of adjacent words by treating them as attention-based neighboring graph nodes which they used for classifying short texts on Twitter and other social media , one such example of semi-supervised learning. The problem of missing data is where the authors used Tmix an additive approach to generate a large amount of interlaced texts, which made their trained model gain better performance.

Active Learning Based Approach

Deep Learning Based Approach

In [23], the authors performed clinical text classification using active and deep learning techniques, and in [24], the authors implemented a malicious content detection trigger system to prevent backdoor attacks using LSTM architecture. In [25], the authors used a graph convolutional network to learn co-occurrence and document word relations for text classification, and in [26], the authors adapted an ensemble-based text classification model using SVM, na¨ıve Bayes and random forest, and a deep learning conventional network. In [27], the authors performed this work in three subtasks called entity recognition, entity classification, and relation extraction from medical texts.

In [31], the authors used four characteristic deep network architectures to extract interaction between drugs from biomedical texts, where they categorized the interactions into five classes.

Unsupervised and HMM Based Approach

Event and Knowledge Extraction from Text

In [47], the OpenIE approach was used to generate a binary relational knowledge graph from texts since the OpenIE method is able to handle complex sentences and their proposed approach outperforms the previous state-of-the-art methods with an F1 -score exceeded. of 0.8827. In [48], the authors proposed a new method and implemented their system in java to extract rules from scientific texts, specifically gynecology-related cases using WordNet dictionary, Gate platform, ProtegeOWL API to work with the OWL ontologies and SWRL rules, and the Jena API with excellent performance F-measure of 95.83%. In [49], the authors analyzed offline-to-offline trade texts by analyzing bilingual text mining between tweets from America and China to understand key entities and further understand relationships between those entities.

Below figure-1 is portrayed the workflow diagram of our system where all methods and phases of our work are shown in a fluid sequence from data collection, pre-processing and training classification to analysis of different time series data through a different number of tools and approaches. Therefore, in this chapter we would like to present the insightful details of our work, which will facilitate a clear understanding of our methodology. In the initial phase of our upfront work, we began by collecting textual news article data for the sole purpose of constructing the training set.

As our source, we chose the online news platforms, [50] which are the online representative counterpart of the actual physical newspaper and consist of a massive collection of news articles of daily events that occurred in both current and past periods.

Pre Processing Collected Data

Target Data points Extraction

Predictive Model Training Using Different Features

Large Data set Construction & Experimenting With different features

Test Data Collection & Comparing Model of different Volumes

Fetching Time Series Data & Mining Meaningful Knowledge

In this section, we aim to shed light on the performance of classifiers trained on datasets of various sizes and to select an optimal model for knowledge extraction from time series data. The model derived in [5] had an accuracy of 83.4%, which was a classifier based on logistic regression, and is the primary model used to produce a small volume data set. For investigation, we evaluated this primary logistic regression model with a recently constructed test data set (Dt) of 1000 samples, where the model managed to obtain 61% accuracy.

The most prominent reason for the performance decline of the model is that in this work we have defined the problem as a multi-classification problem as many of the news articles may contain several related violent events such as rape-murder and rape-kidnapping etc. In [5] the model was trained with respect to one violent event per article format, which was a major drawback of the proposed method, so in this work we trained the classifications taking into account the existence of multiple classes. The evaluation measure used in the experiment is accuracy, meaning the total number of correctly classified violent events divided by the total volume of the data set.

After producing a small scale data set (Dcsd) of 1200 data samples consisting of six different violent events, 200 data samples of non-violence articles were then inserted into the set. BERTs, Tf-IDFs, Word 2 Vecs, N-grams and Fasttexts are five different predictive classifiers that are trained on five different functions BERT, TF-IDF, Word 2 Vecs, N-Grams and Fasttexts respectively on the final compile these data samples. Below is a visualization graph of different feature based classifiers in both test and train set derived from a small volume data set.

The horizontal coordinate indicates the performance accuracy band along with different feature classifiers placed in the vertical coordinate and the performance results of the training and test set are separated into two characteristic colors placed in each of the respective classifier's coordinates.

Performance Evaluation of the Large Volume Classifiers

Performance Comparison between the Large and Small Volume Classifiers

Pattern and Trend Analysis

Event frequency Histogram Analysis

Hypothesis Test - Most Frequently Occurring Violent Event

Yearly Violent and Non-Violent Event Article Comparison

Hypothesis Test - Most Event Impact full Year

Yearly Violent and Non-Violent Event Article’s Occurrence Ratio Comparison 20
Probability Distribution of Events on Different Day Intervals
Area Histogram of Different Events Over the Demi-Decade
Violent Event Dominance Comparison Over Different Months

Hypothesis Test - Half Yearly Period Event Ratio

Active Age Group Analysis in Violent Events

Hypothesis Test - Event Associated Most Active Age Group

Monthly Heat-Map of Violent Events Over the Years
Weekly Heat-Map of Violent Events Over the Years

Hypothesis Test - Event Distribution Ratio on Different Weekdays

Geographical Area Analysis Based on Violence Intensity

Hypothesis Test - On Divisions Violence Event Ratio
Co-relation test between Divisions population count and Violence Ratio 29
Co-relation test between Divisions Area Km square and Violence Ratio 30

To analyze the stated hypothesis, we performed the Wilcoxon Signed-Rank test on the frequency of the event, calculated from the same time intervals. So in the above figure we can see the probability distribution of the five different events at increasing day intervals ranging from 1 to 30. Below we can see a time series area graph for each of the years each month in numerical order 1 to 12.

H0: There is no difference in violence registrations between the half-yearly period of the year. As we can see in the above image, in all the data from the news article of the year, the most active age group with regard to violent events is the age group (20-40), then the most active group (the 40-60) is age at in third place, the teen group (1-20) is the most active. This means that the distribution of the age group (20-40 years) is different from that of other contemporary age groups.

In this section, we have drawn a heat map for all the target violent events over the five years of news data. This heatmap is the representation of the number of cases found over a month's worth of news article data related to each of the violent events respectively. From this we can identify the most dominant months of the violent event that takes place within.

H0: There is no difference between the frequency of occurrences of violence during the week and holidays. In the next experiment, we showed the rate of violent events in all geographical areas in Bangladesh. Therefore, we would like to think that if any of the violent events are interconnected.

To do this, we selected a random date from the half-decade data set and also a random number of daily intervals and calculated the frequency of each of the violent events.

Future Work

We performed several experiments on the collected data set and tested the data with various statistical tests for trend analysis, the rest of the summary is briefly described in the following section 5.1. In the last section 5.2 of this chapter, we discussed some potential future works related to this work. We applied this model to a half-decade time-series data set to extract some patterns of insight related to these violent events and proposed several hypotheses based on the outcome outlook, which we ultimately evaluated using various statistical tests to found some valid facts related to these violent events.

A systematic review of natural language processing and text mining of symptoms from patient-written electronic text data. Semi-supervised text classification framework: A review of dengue landscape factors and satellite earth observation. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.

Gan-bert: Generative adversarial learning for robust text classification with a bunch of labeled examples. Large-scale relation extraction from web documents and knowledge graphs with human-in-the-loop. Automated text classification of near misses from safety reports: An improved deep learning approach.

A Survey of Deep Learning Approaches to Medication and Unwanted Drug Extraction from Clinical Text. Journal of the American Medical Informatics Association. Automatic ontology construction from text: a review from shallow to deep learning trend. Artificial Intelligence Review.

Full System Workflow Diagram

Performance Comparison of Different Feature-Based Classifiers on Small Volume Set. 14

Performance Comparison of Small and Large Volume Set Classifiers Based on the Test

All Events Frequency Histogram

Yearly Violent & Non-Violent Event Article Comparison

Yearly Violent and Non-Violent Event Article’s Occurrence Ratio Comparison

Violent Event Occurrence Comparison Over the Demi-Decade

Probability Distribution of Events on Different Day Interval

Area Histogram of Different Events

Violent Event Dominance Comparison Over Different Months

Active Age Group Analysis in Violent Events

Monthly Heat-Map of Violent Events Over the Years

Weekly Heat-Map of Violent Events Over the Years

Geographical Area Analysis Based on Violence Intensity

Pearson Relation Between Violence Events

Performance Comparison of different feature classifiers using Large volume data set on

P value of Wilcoxon Signed-Rank Test of Different Events

Two-Tail Wilcoxon P-Value table of Different Years Violence Record Frequency

Two-Tail Wilcoxon test P-value comparison table of different Age groups

Events Wilcoxon Signed-Rank Test Pvalues Comparison Table of Weekdays

Wilcoxon Signed-Rank Two-Tailed test P-value Comparison table of Divisions

Pearson Relation Value Table Between Violence Events