IMPROVEMENT ON EXTRACTION TECHNIQUES AND OPTIMIZATION FEATURE SELECTION : THE USE OF
SENTIMENT ANALYSIS
Fazidah @ Idah Binti Wahit
1,*, Nor Samsiah Sani
1, Salwani Abdullah
11Fakulti Teknologi dan Sains Maklumat Universiti Kebangsaan Malaysia
*[email protected] ABSTRACT
Given the enormous amount of textual data produced and exchanged online, such as news stories, posts, tweets and product reviews, the need for effective Text-Feature Selection (TFS) is becoming increasingly relevant. This is difficult due to high text data dimensionality. Other methods rely on population-based meta-heuristic algorithms, which boost the consistency of the selected feature set and results of classification. Such methods, however, are classifier-dependent and generate a higher number of features. To resolve these issues, this research aims to establish a method based on Symbiotic Organisms Search (SOS) to boost the TFS. Secondly, this research will also improve the Text Function Extraction (TFE) by implementing a topic modelling algorithm in the form of extraction from social media comments that are appropriate for extracting the most relevant topics from comment datasets. The efficacy of the proposed method has been compared with PSO (particle swarm optimization), HS (Harmony Search) and GA (Genetic Algorithm) methods.
Experimental results and statistical analysis have validated that the proposed method outperforms the existing methods. The proposed method has theoretical implications for future research to analyse the data generated through social networks/medias. This method also has a very generalized practical implications for designing a system that can provide conclusive reviews on any social issues. The improvement is obtained for both English and Malay datasets, suggesting the generality of the proposed TFS and TFE system in the data set language.
Keywords: Topic modelling extraction; Optimisation Feature Selection; Social Media; Classifier; Meta Heuristic Algorithm
1. INTRODUCTION
Sentiment analysis is the field of study that analyses people’s opinions, sentiments, evaluations, attitudes, and emotions from written language. It is one of the active research areas in natural language processing and is also widely studied in data analytics. In fact, this research has spread outside of computer science to the management sciences and social sciences due to its importance to business and society. The growing importance of sentiment analysis coincides with the growth of social media such as reviews, forum discussions, blogs, micro-blogs, Twitter, and social networks.
For the first time in human history, we now have a huge volume of opinionated data recorded in the digital form for analysis.
However, social media is a huge virtual space to express and share individual opinions, influencing any aspects of life through comments and need implications for sentiment analysis. The weaknesses of text/sentiment classifiers is concerned with the process of extraction, where some of the different words have the same weight. However, topic modelling as extraction has not been used for extraction in text/sentiment classifiers, that avoid giving the same weight for different words.
Another problem related to text/sentiment classifiers is the problem of dimensionality of features. FS approaches are problem-dependent and it ignores the feature dependencies. Because of that, a lot of work has been done to improve FS by using optimisation feature selection and one of the improvements had been done by using Harmony Search Feature selection. However, the result shows that the work is still not able to solve the main problem, especially problems related to slow convergence rate and trapped in local optima. In this research, the improvement of FS by implementing Symbiotic Organisms Search (SOS) as a feature selection will be chosen and discussed to solve the current issue.
2. PROBLEM BACKGROUND
The issue is divided into two main aspects. The first of the issue is the weakness of extracting important words or terms from comments to differentiate between various topics or sentiments. To extract terms from the text, the terms Syntactic (Bag-of-Words) [1]. Few previous attempts by researchers have been successful in solving term extractions using the name entity technique [2], but this has not been found to be very efficient in managing a large number of feature/terms. However, there are several techniques developed to use Topic Modelling algorithm to extract comments from social media that are suitable for extracting the most relevant topics from the comment datasets, but these techniques still need to be improved. On the other hand, syntactic words will mask the central meaning, i.e., most meaning or sense. BOC also has some disadvantages, and this includes the second issue in text classifiers, which is the large amount of feature selection that some of the previous work uses to minimise the number of features without detecting the reduction effects on performance like chi-square. Due to meta-heuristic optimisation when choosing the optimal solution between possible multi-solutions that respects the target of performance of this study. Optimisation is important to determine which function is important for text classifiers between hundreds or thousands of solutions.
The third issue is related to the complexity of the impact of optimisation function selection on the algorithm of multiple classifiers, which is popularly used by previous works, namely K-nearest Neighbour, Naïve Bayesian and Data Tree. On the other hand, the consequence of using different languages on text classifiers whereby the key difference of each language in text classifiers is the pre-processing used to handle the process necessary for classifiers.
Therefore, based on the previously mentioned aspects of the issues, this research proposed to solve the issues arising from knowledge extraction based on the Topic modelling process. Moreover, this study suggested solving the vulnerability of a large number of features using meta-heuristic optimisation as feature selection to choose the optimal solution between possible multi-solutions in order to minimise the number of features. In the final stage, the study proposes evaluating the two different languages on text classification and evaluating the various classifiers on optimisation function selection. This proposal would help track and identify sentimental comments by offering text classifier output more effectively.
2.1. Research Objectives
1. To propose the reliability of feature extraction by using Topic Modelling extraction to improve the approach on classifiers.
2. To propose feature selection by using optimisation feature selection to improve curse dimensionality.
3. To test classifier effectiveness using Malay and English: two separate language datasets.
3. RESEARCH FRAMEWORK
Figure 1 shows the research framework for this research and the main phase is the improvement phase which shows the research contribution.
Figure 1. Research Framework
3.1.1 Groundwork PhaseThe Groundwork phase discusses the process of problem identification, objectives, scope, constraints, proposed solutions and research methodologies. The next step refers to the discussion of literary studies resulting from the reading of printed materials and materials obtained through electronic media such as scientific reports, reports official, and technical documents. Literary studies need to be done to provide clear disclosure of ideas and directions on process-related topics analysis of these sentiments. The literature study conducted can also provide a clear overview on the topic of the study while ensuring that the study conducted can meet the objectives outlined.
3.1.2 Induction Phase
This phase concentrates on designing the initial process for the text classifier of the proposed support for sentiment analysis frameworks, and it also involves the iteration of improvement phase and the effect of using different language phase, using a suitable programming language. More precisely, this phase focuses on the pre-processing and construction of text classifier.
3.1.3
Improvement phaseThe improvement phase aims to increase the consistency of the text classifier into two parts. Firstly, extractive is the most important term in the document to increase relevant terms and/or avoid non- relevant terms in the documents. Secondly, it improves the efficacy of classifiers by minimizing redundant functions, non-important words, and avoiding missing terms using optimization feature selection to the full. This research used the quality improvement phase to propose text classifier frameworks on the considered problem domains to improve the quality through Topic Modelling and optimisation feature selection. This research suggested three steps to improve the text classification. Those are:
i. Improvement feature extractions using Topic Modelling
The extraction step will extract the terms from the Malay language and English language using the Bag-Of-Words (BOW), whereby BOW will generate a vast number of features, and that will result in the increment of the computational learning cost. The solution can be tackled by employing an effective feature selection method to reduce the BOW features or indirectly by semantic clustering single words for representing the texts. Accordingly, a framework of AdaBoost.MH learning using topic modelling was proposed in this research. In the proposed framework, dubbed “LDA- AdaBoost.MH
“
, the Latent Dirichlet Allocation (LDA) topic model is used to model the texts to a small set of latent topics, where each topic is a semantic cluster of words among the texts (Bassam).The extracted latent topics are used for representing the texts as features for AdaBoost.MH learning, which are discussed in detail in Chapter IV.
ii. Improvement feature selection process
The main aim of this phase is to improve the quality of classifiers by using optimisation feature selection to maximise the accuracy of feature and minimizing the number of features as much as possible. In addition to investigating the basic feature selection and comparing with the optimized feature selection proposed, different improvement models are developed, as explained in the next chapter.
iii.
Effect Topic Modelling on improvement feature selectionThe effectiveness of the Topic Modelling phase on optimisation feature selection step was taken into account in this thesis, where it aims to figure out the correct way to test the BOW extraction step and the Topic Modelling rather than using poor classifier algorithms like K-NN, Naive Bayesian, and Data Tree.
This improvement phase will increase the performance of text classifier by allowing optimisation feature selection to reduce the huge number of feature as well as to avoid the redundant feature.
3.2 Evaluation and Comparison Quality Phase
In evaluating cluster quality, two kinds of measures, namely, internal quality measure and external quality measure [3] are used for this purpose. The internal quality measure does not make use of the external knowledge such as class label information for evaluating the produced clustering solution.
In contrary, the external quality measure mainly depends on the labelled test of the document corpora. Its methodology is to compare between the resulting cluster and labelled classes and to measure the extent to which documents from the same class or category are assigned to the same cluster. In the current study, F-measure [3] is used as an external quality measure, which is the most used measures in text mining.
4. CONCLUSION
The importance of this research is to increase the performance of text classifiers of sentiment analysis and test the effect of two different languages on sentiment analysis by tracking and detecting the weakness on text classifiers. Thus, the improvement that proposed in this research is expected identify the sentiment in Malay and English language using Text Classification and prove that the implementation will be better in terms of performance. The proposed classifiers aim to detect and identify the comments available in social media and this is important since currently, the methods of Text Classification being conducted on the news and topics of comments are still not satisfying. This work can be used to define the patterns of comments or detect the sentiments analysis more effectively by using topic modelling as extraction as well as using optimisation as feature selection to reduce the number of non-important feature. Finally, the significance of this study is that the analysis of each comment will be faster and more accurate than the manual analysis.
This research has selected the sentiment analysis as the domain because it facilitates the process of analysing the comments on social media accurately. The sentiment analysis domain includes patterns or types of comments which are observed to be more difficult to detect if it is done manually [4]
because millions of people used social media daily and most of them had used manual analysis in order to get a better understanding on the current issues [4]. So, this research will be effective by evaluating and proposed Text Classification to be used on sentiments analysis and improve the method to classify each comment under certain topics that used Malay and English languages.
5. ACKNOWLEDGMENT
The authors would like to thank Universiti Kebangsaan Malaysia (UKM) under the Research University Grant (project code: GUP-2019-060) for funding and supporting this research.
REFERENCES
[1] E. Cambria and A. Hussain, Sentic Computing: Techniques, Tools, and Applications.
Dordrecht, Nether-lands: Springer, 2012.
[2] M. Zhiwei, M. M. Singh, and Z. F. Z. : E. spam detection, “a method of metaclassifiers stacking : The 6th International Conference on Computing and Informatics (2017),” p, pp.
750–757.
[3] M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clustering Techniques,” 2000.
[4] B. Agarwal and N. Mittal, “March. Optimal feature selection for sentiment analysis,” in International Conference on Intelligent Text Processing and Computational Linguistics, 2013, pp. 13–24.