FEATURE EXTRACTION BASED ON FUZZY CLUSTERING AND EMOJI EMBEDDINGS FOR EMOTION CLASSIFICATION

(1)

FEATURE EXTRACTION BASED ON FUZZY CLUSTERING AND EMOJI EMBEDDINGS FOR EMOTION

CLASSIFICATION

Zahra Ahanin^1* and Maizatul Akmar Ismail²

1 2 Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur.

MALAYSIA

*Corresponding author: [email protected] Accepted: 18 February 2020 | Published: 23 March 2020

Abstract: Emotion analysis is a subset of sentiment analysis that is used to understand emotions in text. Considering the availability of textual data in social media, it is becoming crucial to analyse users’ emotions and bridge the gaps between social media contents and real- world activities including market trends prediction, emotion analysis and emotion monitoring.

Despite the extensive research on textual emotion analysis on social media platforms(applications) such as Twitter, analysing the use of emoji in the task of emotion analysis remain light. Natural Language Processing (NLP) applications for Social Media often use publicly available pre-trained word embeddings that exclude emojis or there exists limited number of emojis. However, the use of emojis in social media has increased drastically. In this paper we first use fuzzy clustering to cluster emojis into one or more emotion classes, and then develop pre-trained embeddings for emojis (named as FuzzyMoji2Vec). The proposed emoji embeddings can be used in NLP applications alongside available word embeddings. We demonstrate, for the task of sentiment analysis and emotion analysis, that the resulting emoji embeddings outperform the state-of-the-art existing models.

Keywords: fuzzy clustering, emotion analysis, emoji embedding

1. Introduction

The rapid growth of social media platforms provides rich multimedia data in large scales for various research opportunities, such as emotion analysis which focuses on automatically emotion (joy, sadness, fear, etc) prediction on given contents. Emotion analysis has been used in real world applications by using the online user-generated data, such as business and market trends, sociology, and psychology studies. Twitter with over 500 million tweets per day, is one of the popular social media platforms for sharing personal thoughts and feelings with friends (Yadollahi,2017), and has been the centre of emotion analysis studies in the past decade.

Twitter data properties such as the length of tweets (limited to 240 characters), informal texts, slangs, abbreviations, hashtags, and emojis has discriminated tweets from normal texts and increased the complexity of the emotion analysis (Alshenqeeti, 2016; Wijeratne et al, 2017).

Emojis are ideograms which are regarded as the new generation of emoticons (such as “: )”

and “: D”), and appears to be widely used as non-verbal cues to represent emotional state and feelings (sadness, happiness, etc) as well as concept and ideas (party, weather, buildings) (Alshenqeeti, 2016; Wood and Ruder, 2016). According to recent statistics by Emojipedia¹ ,

1 https://emojipedia.org/stats/

(2)

there are 3,019 emojis in the Unicode Standard as of March 2019, and over 5 billion emojis are sent daily. Recent studies claimed the importance of emojis as this feature can increase the accuracy of emotion analysis task (Novak et al, 2015). Therefore, it is essential to automatically process, and interpret text fused with emoji (Alshenqeeti, 2016; Wijeratne et al, 2017).

The popularity of emoji on social media attracted many researchers in the field on Natural Language Processing (NLP) such as emotion analysis which analyses users’ emotions in text.

Many recent NLP systems rely on word representations in finite dimensional vector space.

These NLP systems mainly use pre-trained word embeddings obtained from word2vec (Mikolov et al, 2013) or GloVe (Pennington et al, 2014) or fastText (Bojanowski et al., 2017), which does not have emoji or support very few numbers of emojis.

2. Literature Review

Emotion classification in sentiment and emotion analysis is an important and challenging task due to the ambiguous nature of emotions. Researchers utilize the explicit indications of emotions, such as emoji, in order to obtain the sentiment and emotions contained in social media text and improve the accuracy of the classification. Understanding emojis is significantly useful in as emojis are often used to complement text and carry semantical and sentimental information (Chen, et al, 2018). Several approaches have already been proposed to interpret and utilize emojis in emotion classification. In this section, overview of the related work performed is presented and discussed.

The earlier researches in NLP, often used emoji as a kind of distant supervision to generate large amount of annotated data, in which the emojis that express happiness such as indicates a positive polarity, and sadness such as indicates a negative polarity. However, emoji is just used as an assistant way, and research work for emoji itself is really lack (Li et al, 2017). In these researches, emojis are often clustered in one emotional class, though while emojis can be considered as multi-label classification problem. These researches often manually specified which emotional category each emoji belong to (Wood and Ruder, 2016; Asghar et al, 2017).

Such manual categorization requires an understanding of the emotional content of emoji, which is difficult and time-consuming. Besides, the availability of large number of emojis, and the frequently adding new emojis in social media platforms, makes it difficult task to manually update and categorize emojis. Moreover, any manual selection and categorization is prone to misinterpretations and may omit important details regarding usage.

Rakhmetullina et al (2018) mapped fourteen emojis into four emotion classes (anger, joy, sadness, and surprise) based on the emoji percentage of appearing in the tweet subset of certain emotion. Vora et al (2017) simply replaced the emojis with their textual meaning based on Unicode Consortium’s emoji definitions. Hauthal et al (2019) took a step further and utilized the Unicode emoji characters to categorize 86 emojis to one of the six emotional categories by Ekman et al (1987), with an addition of neutral category. In this study they categorized emojis based on the emoji name, which was extracted from Unicode emoji characters, and their possible synonyms associated to their description

(3)

One of ways of automatically interpreting the emotional content of an emoji is to learn emoji embeddings from the words describing the emoji semantics in official emoji tables (Eisner et al., 2016). Considering embedding models, there have been only a few works on emoji embeddings and all of them have been done recently (Eisner et al, 2016; Guibon et al, 2018).

Some studies do not embed emojis directly, but they considered Emoji as a group of meta- information or words such as Unicode descriptions of emojis, which the embedding will be based on. Emoji2Vec (Eisner et al, 2016) utilize the word embedding tools to represent emoji by using Unicode descriptions of emojis. Some other studies develop emoji embeddings based on their possible meanings and senses associated to their description (Ai et al., 2017). Barbieri et al., (2016) trained emoji embeddings from a large Twitter dataset of over 100 million English tweets using the skip-gram method (Mikolov et al., 2013), which led to improving accuracy on sentiment classification tasks. Guibon et al (2018) proposed an emoji embedding model using CBOW, and an emoji clustering based on these embeddings to automatically identify groups of emojis. In their unsupervised learning approach, they used spectral clustering to cluster the emoji vectors, which in comparison to Pohl and Rohs (2017), obtained more fine-grained clusters. They manually labelled the resulted emoji clusters based on Ekman’s categories of facial emotion expressions, in which some clusters were overlapped or split by intensity.

Two main emotion models, which are often used in NLP tasks are theory of Ekman (1972), and Plutchik’s Wheel of Emotion (1980). In this research we used Plutchik's theory of emotion (Figure 1), due to its notion of emotion polar opposites. For example, joy is opposite of sadness, and anger is opposite of fear.

Figure 1: Plutchik’s Wheel of Emotion (1980)

(4)

3. Problem Statement

Recent researches often use pre-trained word embeddings, in NLP tasks. The problem with the existing word embedding models, can be summarized as shown in Figure 2, the features which are used in classification are limited to only the emojis in the intersection between the emojis in existing embedding models and emojis in the actual corpus (tweets). Therefore, by learning the emotion class of more emojis and extending emoji embeddings the size of the intersection will be enlarged. It indicats more features in emotion classification task and therefore improve its discriminative power in the emotion classification.

Figure 2: The existing word embeddings and intersection with emojis in the corpus vocabulary

The current emoji embedding models are often using the Unicode description of emoji.

However, it is not guaranteed that their popular usage aligns with their description (Wood &

Ruder, 2016) and therefore it does not capture the dynamics of emoji usage, and changes in emoji’s intended meaning over time (Felbo et al, 2017). In other embedding approach such as skip-gram (Mikolov et al, 2013), representations of less frequent emojis are estimated rather poorly or not available at all. Unlike the existing emoji embedding approaches that were constructed based on the emoji description or emoji and text correlation, we used correlation of emoji and emotion labels to classify emojis into one or more emotion classes, and then we created an emoji embedding based on the assigned emotion classes. In this way, smaller number of data is required, and also the labelling is automatic and requires lesser human involvement.

(5)

4. Method

In order to cluster the emojis into emotion classes we used machine learning algorithms. Given an emoji set X = {x1, x2, ..., xk} and emotion classes Y = {y1, y2, ..., yn}, an emoji belongs to one or more emotion classes. To perform this task, we used Fuzzy Clustering (Bezdek,1973), which is a form of clustering in which each data point can belong to more than one cluster.

4.1 Data Collection

The data is collected by crawling tweets through the REST API² and over one million tweets in a period of two weeks were crawled. Retweets were excluded and only tweets that are tagged by Twitter as English and published in the USA is included. We only included the tweets that contain at least one emoji that is associated with emotion(s). In order to compile the list of emojis: 1) we scanned through the dataset to estimate the frequently used emojis, and 2) we selected commonly used emoji based on emojitracker³, which shows the use of emojis on Twitter in real-time. As a result, we selected 247 emojis, and therefore 454,975 tweets are used for this research that include emoji.

In addition, we used a set of manually labelled data which is publicly available as a part of SemEval 2018 competition⁴. This dataset includes tweets which are labelled into 11 classes of Plutchik’s theory of emotion (eight basic emotions with addition of love, optimism, and pessimism).

4.2 Projecting Emojis Onto Emotions

There have been several researches on the sentiment of emoticons such as Boia et al (2013) and Wood and Ruder (2016), which the findings suggested that sentiment of an emoticon is in substantial agreement with the sentiment of the entire tweet. Inspired by these results, we consider emoji of the tweets as representation of emotion of whole tweet. We have a set of tweets with 11 emotion class labels, in which many tweets contain multiple repetitions of the same emoji or multiple different emojis. We separated tweets with emojis, and therefore each emoji has 11 emotion classes as labels with values of 0 and 1. This data pre-processing captures association of emojis with multiple emotional classes which allows to make a multi-label classiﬁcation.

2 https://developer.twitter.com/en/docs

3 http://emojitracker.com

4 https://competitions.codalab.org/competitions/17751

(6)

Figure 2: Proposed fuzzy clustering-based model to project Emoji into emotion classes

In order to categorise emojis into emotion classes, we first used skip gram, and then calculated the pointwise mutual information (PMI) to get the correlation between emojis and emotion labels. The output is a matrix that includes the similarities (using cosine similarity) between emojis which is considered as the weight of each emoji. We then scaled the data.

Table 1: Example of Emoji Weights

Emoji anger disgust joy anticipation love pessimism optimism sadness fear trust Red

heart

-0.12 -0.10 0.5 0 1.81 0 0 0 0 0

Cosine similarity formula:

(1)

However, in the emotion labelled dataset, there is limited number of emojis. To extend the number of emojis, we proposed a fuzzy clustering method in the section 4.3.

(7)

4.3 Emoji Fuzzy Clustering

In order to categorise the emojis which did not appear in the labelled dataset, we used the 454,975 crawled tweets. We separated emojis to find the similarities between the them. To this end, first, the data (emojis) in Epsilon distance is determined (E: epsilon distance, k: number of data). After that these k data are given as input to fuzzy cluster (Table 2), which gives the membership of each data in the classes in form of probabilities.

Table 2: FuzzyMoji Clustering ALGORITHM I

Input Parameters: Data set, k, threshold Output: Classified test tuples

Step 1: Store all the training tuples

Step 2: Compute the similarity of all the training tuples using equation no. (1) Step 3: For each unseen tuple which is to be classified

A. Find the training tuples in E distance of the unseen tuple (1).

B. Use fuzzy clustering to get the probabilities for each cluster including unseen tuple.

C. Get the average weights of each cluster (S) and Multiply it with the probability of clusters(P):

S*P

D. Sum all the weights of each label End for

Step 4: Perform Gate to Assign the label which is more than the threshold Step 5: Perform emotion polar opposites based on Plutchik emotion model

Example of output:

Red heart: Love, Joy

In the last step, we used Plutchik’s notion of emotion polar opposites. For example, in Plutchik theory joy is opposite of sadness, and therefore in our approach, emoji can be either in joy or sadness class. Initially we obtained the emotion classes of 85 emojis (section 4.2), and after performing fuzzy clustering, 202 emojis are categorized in 11 emotion classes.

4.4 Emoji Embeddings (FuzzyMoji2Vec)

Inspired by Eisner (2016), in order to train our dataset, we used a variant of word2vec, skipgram, that has been used in many NLP tasks. The emoji embedding is based on training the model on emoji and its corresponding emotion class as labels. The output is an embedding layer of 300 dimensions to project each emoji into a vector space.

5. Results and Discussion

We used the output of the FuzzyMoji2Vec and combined it with existing word embedding (Glove) in both tasks of sentiment analysis and emotion analysis. To get the results of sentiment analysis (negative, positive, neutral), we followed the research paper by Eisner (2016), and used the same dataset (Novak et al, 2015) and two classification algorithms which are Random Forests (Ho, 1995) and Linear SVM (Fan et al., 2008). Furthermore, to perform multi-label emotion classification (SemEval-2018 Task 1: E-C), we used a deep learning method by Kim and Lee (2018) with a slight modification in the last layer to enable multi-labelling of tweets.

(8)

We used this method because of the use of Transfer Learning and Bi-LSTM methods, which improve the multi-label emotion classification. For this task, multi-label accuracy (Jaccard index), is used to measure the accuracy, which can be defined as:

𝐽𝑎𝑐𝑐𝑎𝑟𝑑 = 1

|𝑇|∑𝐺_𝑡∩ 𝑃_𝑡 𝐺_𝑡∪ 𝑃_𝑡

𝑡∈𝑇

Where, T is the set of tweets, Gt is the set of the gold labels, Pt is the set of the predicted labels for tweet t. Furthermore, micro-averaged F1-score and macro-averaged F1-score are used.

We compared the results with the existing emoji embedding researches for both tasks of sentiment analysis and emotion analysis as it is shown in Table 3 and Table 4. According to the report, the proposed method achieved the top accuracy results in comparison with the existing models, in the task of sentiment analysis, with a competitive, but slightly lower score for classification of 90% most frequent emojis.

Table 3: Experiment results of the proposed model and existing models on the task of sentiment analysis Classification accuracy on entire dataset, N = 12920

Word Embeddings Random Forest Linear SVM

Google News 57.5 58.5

Google News + (Barbieri et al., 2016) 58.2 60.0 Google News + emoji2vec (Eisner et al, 2016) 59.5 60.5 Google News + Sense_des (Ai et al., 2017) 59.1 62.2

Face Emojis (Guibon, 2018) 58.8 63.3

Google News + FuzzyMoji2Vec 59.8 61.8

Classification accuracy on tweets containing emoji, N = 2295

Word Embeddings Random Forrest Linear SVM

Face Emojis (Guibon, 2018) 58.6 62.9

Google News + FuzzyMoji2Vec 59.2 62.6

Classification accuracy on 90% most frequent emoji, N = 2186

Word Embeddings Random Forrest Linear SVM

Google News + (Barbieri et al., 2016) 52.8 56.9

Google News + emoji2vec (Eisner et al, 2016) 55.0 59.5 Google News + Sense_des (Ai et al., 2017) 50.2 55.3

Face Emojis (Guibon, 2018) 59.3 62.1

(9)

Face Emojis (Guibon, 2018) 53.5 54.8

Google News + FuzzyMoji2Vec 54.8 58.1

The comparison of multi-label emotion classification is shown in Table 4. Based on the results reported, the proposed emoji embedding can potentially improve the accuracy of emotion classification. It can be concluded that assigning correlated emotion classes to the emojis and then developing emoji embedding would increase the accuracy in task of multi-label emotion classification.

Table 4: Experiment results of the proposed model and existing models for the task of multi-label emotion classification

Jaccard Micro Macro

Google News (Glove) 56.5 68.7 51.1

Google News + emoji2vec (Eisner et al, 2016) 55.4 68.01 52.06

FaceEmojis (Guibon, 2018) 56.6 68.8 52.5

GoogleNews + FuzzyMoji2Vec 57.1 68.9 50.7

6. Conclusion

This paper developed a fuzzy clustering approach to categorize emojis and add discriminating information to the existing models. Each emoji may belong to one or more emotion classes in which prior studies often classified emojis in a single emotion class. The existence of large number of emojis, and the frequently adding new emojis in social media platforms, makes it difficult and time-consuming task to manually categorize emojis. The proposed model, classified emojis by finding similarities between emoji and emotion classes, and developed an emoji embedding, which can be used alongside the existing word embedding models. The results show that correlation between emoji and emotion classes and using emotional words, instead of the Unicode description of emojis or similarities between emojis, helps to improve emoji embeddings, and combination of word embeddings with emoji embeddings improves classification accuracy in both sentiment and emotion analysis tasks. It suggests that emoji embeddings can potentially enhance performance of other social NLP tasks as well.

Furthermore, we find that FuzzyMoji2Vec generally outperforms the existing emoji embeddings approaches.

(10)

References

Ai, W., Lu, X., Liu, X., Wang, N., Huang, G., & Mei, Q. (2017, April). Untangling emoji popularity through semantic embeddings. In Eleventh International AAAI Conference on Web and Social Media.

Alshenqeeti, H. (2016). Are emojis creating a new or old visual language for new generations?

A socio-semiotic study. Advances in Language and Literary Studies, 7(6), 56-69.

Asghar, M. Z., Khan, A., Bibi, A., Kundi, F. M., & Ahmad, H. (2017). Sentence-level emotion detection framework using rule-based classification. Cognitive Computation, 9(6), 868- 894.

Barbieri, F., Ronzano, F., & Saggion, H. (2016). What does this emoji mean? a vector space skip-gram model for twitter emojis. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016); p. 3967-72.

Bezdek, J. C. (1973). Cluster validity with fuzzy sets. Journal of Cybernetics. Volume 3, 1973 - Issue 3

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146.

Chen, Y., Yuan, J., You, Q., & Luo, J. (2018, October). Twitter sentiment analysis via bi-sense emoji embedding and attention-based LSTM. In 2018 ACM Multimedia Conference on Multimedia Conference (pp. 117-125). ACM.

Eisner, B., Rocktäschel, T., Augenstein, I., Bošnjak, M., & Riedel, S. (2016). emoji2vec:

Learning emoji representations from their description. arXiv preprint arXiv:1609.08359.

Ekman, P., Friesen, W. V., O'sullivan, M., Chan, A., Diacoyanni-Tarlatzis, I., Heider, K., ... &

Scherer, K. (1987). Universals and cultural differences in the judgments of facial expressions of emotion. Journal of personality and social psychology, 53(4), 712.

Felbo, B., Mislove, A., Søgaard, A., Rahwan, I., & Lehmann, S. (2017). Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. arXiv preprint arXiv:1708.00524.

Guibon, G., Ochs, M., & Bellot, P. (2018, March). From Emoji Usage to Categorical Emoji Prediction. 19th International Conference on Computational Linguistics and Intelligent Text Processing (CICLING 2018)

Hauthal, E., Burghardt, D., & Dunkel, A. (2019). Analyzing and Visualizing Emotional Reactions Expressed by Emojis in Location-Based Social Media. ISPRS International Journal of Geo-Information, 8(3), 113.

Ho, T. K. (1995, August). Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition (Vol. 1, pp. 278-282). IEEE.

Kim, Y., & Lee, H. (2018, June). DMCB at SemEval-2018 Task 1: Transfer Learning of Sentiment Classification Using Group LSTM for Emotion Intensity prediction. In Proceedings of The 12th International Workshop on Semantic Evaluation (pp. 300-304).

Li, X., Yan, R., & Zhang, M. (2017, July). Joint emoji classification and embedding learning.

In Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data (pp. 48-63). Springer, Cham.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural

(11)

Pennington, J., Socher, R., & Manning, C. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).

Rakhmetullina, A., Trautmann, D., & Groh, G. (2018). Distant Supervision for Emotion Classification Task using emoji2emotion.

Vora, P., Khara, M., & Kelkar, K. (2017). Classification of Tweets based on Emotions using Word Embedding and Random Forest Classifiers. International Journal of Computer Applications, 178(3), 1-7.

Wijeratne, S., Balasuriya, L., Sheth, A., & Doran, D. (2017, August). A semantics-based measure of emoji similarity. In Proceedings of the International Conference on Web Intelligence (pp. 646-653). ACM.

Wood, I., & Ruder, S. (2016, May). Emoji as emotion tags for tweets. In Proceedings of the Emotion and Sentiment Analysis Workshop LREC2016, Portorož, Slovenia (pp. 76-79).

Yadollahi, A., Shahraki, A. G., & Zaiane, O. R. (2017). Current state of text sentiment analysis from opinion to emotion mining. ACM Computing Surveys (CSUR), 50(2), 25.