Event Information Extraction from Indonesian Tweets using Conditional Random Field

(1)

Event Information Extraction from Indonesian Tweets using Conditional Random Field

Fawwaz Muhammad

School of Electrical Engineering and Informatics Institut Teknologi Bandung

Masayu Leylia Khodra

School of Electrical Engineering and Informatics Institut Teknologi Bandung

Abstract—Information extraction is a process to find structured text from unstructured or semi-structured text. This research has an objective to build an information extraction system specialized for Events in Indonesian tweets. The system consists of two main parts. First part filters relevant tweet from irrelevant tweet. This part is only using a rule based approach with additional bag of words feature and gets the best accuracy of 86%. The second part is doing the extraction process. From our experiments, we get the best combination for extractor module by using multi token tokenization method, all feature set and 1st Order Conditional Random Field. This combination result in average accuracy of 74% per token.

Keywords—Conditional Random Field; Information Extraction; Events; Indonesian Tweets;

I. INTRODUCTION

There are several medias who collect event information.

Websites like bandungtourism.com, infobandung.co.id collect event information around Bandung. The process of collecting event information is not only done in the digital, but also in famous daily newspaper like media Indonesia and Kompas.

Both of the newspaper have a one-page weekly digest for event information.

Indonesia is ranked as 5th biggest Twitter user country.

Unlike any other country, the growth of Twitter user in Indonesia is still going on. That is one reason why Twitter opens its office in Jakarta on March 2015 [1]. Twitter realizes the opportunity behind Indonesia population.

One of the Twitter usage in Indonesia is to promote some events. This fact makes Twitter has a potential to be a source of event information. As a proof, there are several Indonesian Twitter accounts which collects event information such as

@event_indonesia, @evenesia, dan @info_event. However, those accounts collects event information manually. Currently there is not any mechanism to collect event information automatically. Therefore it is needed to build a system that can extracts event information from Indonesian tweet automatically.

There are several research related to this work. From previous research, it can be concluded that Conditional Random Field is the best algorithm to do named entity recognition in tweet. However, using the same features as used

in [2] for Indonesian tweets is quite irrelevant due to special characteristics that Indonesian tweets have.

The other research is about detecting event. Several research like [3] is focussed on detecting the existance of some event in tweets. Another research like [4] also has a focus to detect unplanned event. However this research is slightly different with other research. Refering the definition of event [5] , this research has a focus to extract main component that form a planned event instead of detecting the existance of some events using Conditional Random Field Algorithm.

There is a different research conducting information extraction from Indonesian tweet with the a slightly diffrent domain [6]. It uses several features and gets a very good result.

Probably, by combining the algorithm used in [7] and the features used in [6] with additional domain specific features might give a better result.

The rest of organization used in this paper is as below.

Section II discusses any related source about information extraction. Section III explains the corpus we used in this research. In Section IV we present our methods used in this research. Section V is telling about the result of experiment done in previous process. Finally, the last section gives a conclusion and further works.

II. INFORMATION E^XTRACTION

Information extraction is a task for finding structured information from unstructured or semi-structured text.

Unstructured text is defined as a text that needs language analysis to understand the whole message of the text. Semi- structured text is a kind of text which its layout helps the understanding of . Structured information is a ready-to-fill relational database. It means the semantic of the data is defined by the user [8]. Information extraction can be traited as sequence labelling task [9].

According to Jurafsky & Martin, there are 5 steps to extract information from text [9]. However, not all step is relevant to Indonesian Twitter domain. The only relevant process in extracting information is Named Entity Recognition and Template filling. Named entity recognition can be done using rule based approach or statistical approach. There are several learning algorithm used in statistical approach such as SVM,k- NN, HMM and Conditional Random Field (CRF) [6].

(2)

CRF is a kind of discriminative algorithm that extends the functionality of linear regression [10]. Conditional random field has many varieties. In the case of sequence labelling, the linear chain CRF is the most suitable form of CRF to do the task. Linear CRF is quite similiar to Hidden Markov Model but instead of maximizing the joint probability, CRF maximizes the conditional probability. Linear CRF also has a parameter called CRF order that represents how many states before current state affect current state [11].

III. C^ORPUS

We target to extract 4 components that form an event as shown on Table 1. The components are event name (i-name), event location (i-place), event time (i-time) and additional event information (i-info). Event name identify an event from the others. We assume that event is asociated with its name. So whenever two events held in the same place and the same time but as long as they have a different name, both of event refer to diffrent event. Event location and Event time is defined as where and when the event takes place and time. The last component is additional information about event. The additional information augments any explanation about mentioned event. It can be a ticket information, reservation, artists performing on the event, website of the event, a picture about event or digital poster.

TABLE 1LABEL USED IN THIS RESEARCH

Label Description Example

i-name The event name Konser Raisa [Raisa Concert]

i-place Event location Jakarta Convention Center

i-time Event time 20 May 2015

i-info Additional information

about event Ticketing : 08579493302 other Unrelated token Jangan sampai ketinggalan !

[Don’t miss it !]

We build a corpus for event information from tweets. The corpus consist of 1120 tweets containing both relevant and irrelevant tweets. The tweets are collected from March 7^th 2015 20:21:56 (GMT +7) to april 27^th 2015 23:30:38 (GMT +7).

Then each tweet is labelled as relevant or irrelevant tweet by one human anotator. We get 700 relevant tweets and 420 irrelevant tweets from 1120 tweets. After that, we tokenize the relevant tweet and get 13.242 token with 3.194 unique token.

571 tokens among them are root words. For each token we put label and get a distribution among label as shown in Table 2.

TABLE 2DISTRIBUTION OF CORPUS OVER LABEL

Label Total Token Example

i-name 1.523 ‘#ngamplaglive’ ‘@psmitb’

i-place 3.503 ‘hotel’,’graha’

i-time 1.037 ‘maret’, ‘2015’

i-info 1.605 ‘info’, ‘w/’

Other 5.574 ‘di’,;&’

And if we group the token by its frequency, it will shown in Table 3.

TABLE 3DISTRIBUTION OF TOKEN FREQUENCY

Token Frequency

Total Unique

Token Example

Token Frequency

Total Unique

Token Example

1 1813 ‘Tweeps’,’musikal’

2 460 ‘seniman’,’sehat’

3-10 698 ‘banget’,’dari’

11-20 154 ‘start’,’musik’

20 > 69 ‘pvj’,’info’

And if we group the token by its affixes, it will shown as in Table 4.

TABLE 4DISTRIBUTION OF CORPUS OVER POSTAG

Affix POSTag

Total Unique

Token

Example

Ke-an Noun 6 ‘Kesenian’ (art)

Pe-an Noun 16 ‘Pendidikan’(education)

Pe- Noun 9 ‘Personil’(member)

-an Noun 5 ‘Pameran’(exhibition)

Me- Verb 10 ‘Menjual’(selling)

Me-kan Verb 9 ‘Memeriahkan’(jazz up)

Ber- Verb 4 ‘Berbagi’(shared)

Kan- Verb 10 ‘Saksikan’(watch)

di- Verb 6 ‘Diadakan’(held)

-i Verb 6 ‘Hubungi’(contact)

Ter- Adverb 2 ‘Terbatas’(limited)

Ber- Adjective 3 ‘Bersama‘(together)

IV. METHODS

Our system received tweets by querying Twitter search with several keywords or hashtags. We built a system that gathers tweets from 2 biggest account in Bandung. Both of those accounts collect event information around Twitter. Our system retrieves tweets from several ways, they are :

1. Any personal tweets which contains a mention to

@infobdgevent or @bdgevent excluding retweets. Since there are so many retweets in a day retweeting same information, we ignore retweets from personal account.

We assume that those tweets which contain a mention to

@infobdgevent or @bdgevent are a kind of Twitter user behaviour to submit any information about some event to @infobdgevent or @bdgevent.

2. Tweets from @infobdgevent or @bdgevent including retweet. We include the retweet from this account because we assume that after those personal accounts submit information about event to @infobdgevent or

@bdgevent, both accounts will retweet the submitted information.

3. Search result from #eventbdg or #eventbandung hashtag. Both hashtags are used to group tweets containing event information around Bandung.

After observing the data collected for 1 month, we get several characteristics of tweet that contain event information.

They are:

(3)

1. Tweets containing event information is mixed up with common tweets. Sometimes user tweet a content that is not related to any information about event. This kind of tweet is usually created to build interaction for his follower. Figure 1 (10 reasons why you shouldn’t date movie freak) is an example of irrelevant tweet. That tweet is not containing any information about any event.

Figure 1 Example of Irrelevant Tweet

2. Sometimes targeted Named Entity is appears partially.

Although we target 4 component that forms an event (Name, Location, Time, Additional Information), usually not all the component appears in observed tweet. Sometimes the collected tweet contains only event name and event time without event location.

Figure 2 (Dear Telkom University alumnus, prepare yourself to join the reunion) is an example of tweet that the event component is not completely appears in a tweet. In that tweet, the tweet is not containing where and when the event will take place and time. The tweet only says what is the event (Gathering). The event location and time is inferred from digital poster inside the tweet.

Figure 2 Example of Tweet containing only some part of Targeted Named Entity

Considering those reasons, We propose a system that consist of 2 main parts as Sakaki used [12]. First part is filter module. It is used to separate relevant tweets and irrelevant tweets. The Second part is Extractor module. It is used to extract Named Entity inside the tweet

Our architecture consists of 5 component. The architecture is shown on Figure 3. The Filter component (Filter) represents the first part of our modules while the rest of component represents the Extractor module. The extractor module built up by Tokenizer and POS Tagger which extract the feature used in next process, Named Entity Recognition component which extracts the named entity inside tweet and template filler which

transform the extracted named entity into uniform data before filled to database.

Figure 3 System Architecture A. Filter

This component is needed since not all of tweet received from Twitter relevant to the research. This component isolate relevant tweet before entering the extractor module. We define relevant tweet as any tweet that contains at least one information about event excluding the additional information of event. We assume that event can be associated with event name or place or time but not the additional information.

Since the filtering task is not the main focus of this research. We simplify the filter component by using a rule based approach combined with bag of words feature. The pseudocode of rule is given as below:

1. Eliminate mention, hashtag, url from tweet 2. Count the left character after elimination 3. If the remaining character ≤ 82 Character

a. Examine each words in tweet

b. If there any words in bag of words of relevant tweets  Classify as relavant tweet

c. If not  Classify as irrelevant tweet 4. If the remaining character > 82 character

a. Examine each words in tweet

b. If there any words in bag of words of irrelevant tweets  Classify as irrelavant tweet

c. If not  Classify as irrelevant tweet We select that rules based on the analysis of data observed in first month. We assume that every user want to maximize the usage of character limit in Twitter while still keeping the event to be informed. By averaging the length of irrelevant tweet and relevant tweet, we get 82 character as median value.

However this rules might not completely accurate. So to increase the filter performance, we use bag of words as additional rule to complete the character length-based rule.

(4)

The bag-of words used in this component is also based on the analysis of data observed in first month. We construct 2 bag of words, first bag of words containing a list of words of relevant tweet. Second, bag of words for irrelevant tweet.

Most of irrelevant tweets is ads or user opinions or expressions about random topic inside Twitter. By observing the data collected at the first mont we could list several words that only appear either in relevant or irrellevant tweet but not both. For example the word “jual” (Sell) or “sewa” (Rent) mostly appears in irrelevant tweets but “tiket”(Ticketing) or penampilan (Show) mostly appears in relevant tweets. We select bag of words Using minimum frequency treshold as a boundary for this bag of word, we could list any words that mostly appears in relevant or irrelevant tweet. In this research, we use a constant number 3 as the minimum word frequency for bag of words.

Every tweets will be filtered by this component. So, any irrelevant tweet like Figure 1 will be discarded and not be processed by the further component.

B. Tokenization

Tokenization is a process to transform text into smallest unit called token. The token is not always be word but in this case, the token is on the word level. We use modified Twokenizer [13] in this component to tokenize the raw tweet.

We add more regex to Twokenizer as shown on Table 5.

By the end of this process, a tweet like ‘Acara JAKARTA : Gadget Festival 2015 | 9-11 Januari | Marketing Kantor Golf Island Pantai Indah Kapuk.’Will be tokenized and resulting a list of token like : {Acara, JAKARTA, :, Gadget, Festival, 2015, |, 9, - , 11, Januari, |, Marketing, Kantor, Golf, Island, Pantai, Indah, Kapuk}.

TABLE 5ADDITIONAL REGEX

Pattern Name

Regular Expression Pattern

Matching Words Example

Explanation

Day Number

((?:3[01]|

[12][0-9]|

[0]?+[1-9]) (?>th|st|nd|

rd)?)

31

Day number is always be between 0-31

Day name

(?>senin|selasa|

rabu|kamis|jumat

mon|tues|wed|thu

|fri|sat|snn|sls

|rb|kms|jmt|sab|

ming)

Senin, selasa

Several tweet mixing language between indonesia and english to express day name.

Month Number

(?:1[012]|

0?+[1-9])

Valid month number is always be between 0-12

Month Name

Januari ferbruasi maret

Indonesian tweets may mix between indonesian language and

Pattern Name

Regular Expression Pattern

Matching Words Example

Explanation mber|desember|ja

n|feb|mar|apr|me i|jun|jul|agt|se p|okt|nov|des))

english language.

Some times they also use abbreviation.

C. POS Tagger

To simplify the problem, we use open source Bahasa Indonesia dictionary to determine the POS Tag of tokens [14].

Howerver, not all of the tokens given from previous process is a root words. So we add a morphological analysis as shown on Table 6 to help determining the POS tag of the token [15].

Finally, if there still no matching morphological rules for the token, the POS Tag of corresponding token will be left empty.

By the end of this process, the previous example will be tagged like : {Acara/N, JAKARTA/N, :, Gadget/N, Festival/N, 2015, |, 9, - , 11, Januari,

|, Marketing, Kantor/N, Golf/N, Island, Pantai/N, Indah/Adj, Kapuk/N}

TABLE 6MORPHOLOGICAL ANALYSIS USED TO DETECT POSTAG

POS Tag Affixes Example

Noun

 Ke-an

 Pe-an

 Pe-

 -an

 -in

 -wan

 -wati

 -isme

 -isasi

 -logi

 -tas

Peluncuran [launching], Pertemuan [meeting]

Verb

 me-

 ber-

 -kan

 di-

 -i

 Ter-

Mengikuti [attending], daftarkan[register],

Adjective or Adverb  ber-

 ter-

Terbaik [the best of], bersama [with]

D. Named Entity recognition(NER)

The NER component utilize Conditional Random Field Algorithm to build the model as [7] suggested. We use mallet [16] with custom pipe to extract features from token as shown on Table 7. To apply the CRF Algorithm, we use several feature function that takes several input. The inputs are the full tweet, the position of word in tweet, the label of current word and the several label before current word. The feature function will outputs 1 or 0 depending the existance of related feature.

Finally, using label likelihood we estimate the parameter weight for every feature function.

By the end of this process, the previous example will be anotated as {Acara/O, JAKARTA/O, :, Gadget/I- Name, Festival/I-Name, 2015/I-Name, |/O, 9/I-Time, -/I-Time , 11/I-I-Time, Januari/I-Time, |/O, Marketing/I-Place, Kantor/I-Place, Golf/I-Place, Island/I-

(5)

Place, Pantai/I-Place, Indah/I-Place, Kapuk/I-Place }.

TABLE 7FEATURE USED IN NER Feature Name Description

currentWord Lexicon of current word currentTag POSTag of current word

BefPOSTag The POSTag of token before current token Bef2POSTag The POSTag of token has gap 2 before current

token

isGazetteer Whether the current word is a gazetteer or not isLink Whether the current token is link or not isMention Whether the current token is a mention or not isHashtag Whether the current token is hashtag or not isNumber Whether the current token is a number or not isPunctuation Whether the current token is punctuation or not isDateSeparator Whether the current token separate number (date) or

not E. Template filler

Finally, before the extracted data is inserted to database, we normalize the fom of extracted data. We resolve relative temporal expression [9] into another standardized form. For example, if the extracted information contains event time token

‘besok’ (tomorrow), this component will resolve that expression into valid Timezone values like 21/06/2015.

V. ^EXPERIMENT

The purpose of our experiment is to find the suitable configuration for the architecture designed in preivous part.

The experiment is done on 2 module, first on the filter module and second in extractor module.

The first experiment is about evaluating the filter module.

We compare the result after applying only rule based filter and applying rule based filter with additional bag of words rule filter. The confusion matrix of filter module is as shown on Table 8. The matrix informs that although we use only a rule based approach for filtering relevant tweet, the accuracy is quite good (77,1%). This accuracy can be increased by applying additional rule using a bag of words and resulting better accuracy (86%).

TABLE 8CONFUSION MATRIX AFTER APPLY RULE BASED ONLY FILTER (LEFT) AND WITH ADDITIONAL BAG OF WORDS RULE (RIGHT)

Relevant Irrelevant Relevant Irrelevant

Relevant 540 160 Relevant 602 98

Irrelevant 169 251 Irrelevant 118 302

The second experiment is about evaluating the extractor module. We divide the evaluation process of extractor module by 3 step. First to test which tokenization method is better to use to tokenize event time. Second test is about to find set of features which gives best accuracy. The third test is about to find the order of CRF which gives us best accuracy.

The first step is about testing tokenization method. The tokenization method affect specially on how the event time component will be tokenized. We treat the event time component as single or multi token. If the time component

treated as multi token, the event time like ’9 june 2015’ will be treated as 3 different token (‘9’,’june’ and ‘2015’). After varying the tokenization method, we get the accuracy of multi token better than treating as single token as shown on Table 9.

The multi token produces less token variation instead of single token. That is why the multi token method better in tokenizing event time component.

TABLE 9ACCURACY OF TOKENIZATION METHODS

Single Token Multi Token

Accuracy 69% 75%

The next step is finding set of features which gives the model best result. Obviously, using all parameter has a better accuracy instead of partially ignored feature. Table 10,Table 11 and TABLE 12 show the confusion matrix after applying partially ignored feature. The accuracy of each feature set is summarized in Figure 4. Figure 4 prove that gazetteer features plays a huge role in increasing event place accuracy. Compared to other feature set which doesn’t use gazetteer as feature, the accuracy using gazetteer as a feature able to increase the accuracy of event place detection up to 59%.

Figure 4 The accuracy after applying various feature set

Examining the confusion matrix for all feature set reveals additional knowledge. For every feature set, the biggest mistake is the confusion betweent event additional information (i-info) and Other. This fact is caused by the difficulty of anotating tweet in consistent way. Sometimes human anotator confused to differ which token is i-info and which token is other. For instance, the tweet “Ayo merapat ke XL Martadinata, ada foodtruck disana dgn menu2 kuliner yg maknyos & booth fashion jg loh dsana” (Lets come to XL Martadinata, there is a lot of foodtruck with delicious food and fashion booth). Sometimes human anotator classify literal “dgn menu2 kuliner yg maknyos” (with delicious food) as i-info label since its add better context of the event while sometimes human anotator thinks the literal as other label because the literal doesn’t give a further information about the event.

TABLE 10CONFUSION MATRIX WITH ALL FEATURE

i-name i-place i-time i-info Other

i-name 811 42 43 137 490

i-place 37 615 18 153 214

i-time 46 9 1107 88 355

i-info 126 48 43 2593 693

other 195 43 77 514 4745

(6)

TABLE 11CONFUSION MATRIX WITHOUT USING POSTAG AND GAZETTEER FEATURE

i-name 1158 14 30 68 253

i-place 18 820 9 25 165

i-time 31 20 1385 42 127

i-info 47 25 16 3136 279

other 92 30 47 202 5203

TABLE 12CONFUSSION MATRIX USING ONLY POSTAG FEATURE

i-name 804 39 42 140 498

i-place 68 316 29 203 421

i-time 50 20 1101 103 331

i-info 120 9 45 2669 660

other 192 33 87 565 4697

As you can see, for every label the biggest false negative is belong to other class. This is caused because the data handled in this research is imbalanced data.

Last, step is finding the suitable CRF order parameter. CRF order indicate how many words behind or after current words affect significant role in deciding current word label. We try several order of CRF and we get a result as shown on Table 13. It means that the most important words to determine the label of current word is placed exactly 1 word adjacent to current word.

Table 13 CRF order Accuracy CRF Orde Accuracy

1 75%

2 71%

3 65%

VI. CONCLUSIONANDFURTHERWORK

The extraction system is built up by 2 main parts, the filter module and the extractor module. The filter module separate relevant tweets from irrelevant tweets and the extractor module find named entity recognition inside the tweet.

The filter module gives the best accuracy 86% using rule based approach. The rule utilizes tweet length and bag of words feature. Although this approach is very simple, it gives a good accuracy in this case. However, in more complex case, such as implicit message, this rule-based filter can not classify correctly between relevant or irrelevant tweet. Finally, we have built a system for extracting event information from Indonesian tweets using the best combination obtained from our experiments. The combination for extractor module consists of multi token tokenization method, using all feature set and setting the CRF to have order 1.

There are several suggestions to improve this work. First, using group annotation method to decrease info-other bias.

Second, handling imbalanced data set. The current work is not yet handle imbalanced data. Third, adding features for the extractor module. Possible feature is the word location inside

the tweet. Whether the current word located in the first piece of tweet or the last piece of tweet. Fourth, comparing this method with other methods. Fifth, extend the observation period to get more data. Although the period is quite long, the actual tweet retrieved by querying Twitter only returns few tweets a day.

Obviously special event is not held daily so tweets about event become quite rare.

ACKNOWLEDGEMENT

This paper is reviewed by Dyah Rahmawati. Thanks for the suggestion.

References

[1] N. Freischlad, "Tech in Asia," 2015. [Online]. Available:

https://www.techinasia.com/6-reasons-why-Twitter-strengthens- presence-in-indonesia/. [Accessed 19 06 2015].

[2] A. Ritter, S. Clark, Mausam and O. Etzioni, "Named Entity Recognition int tweets : an experimental study," Stroudsburg, 2011.

[3] T. Sugitani, M. Shirakawa, T. Hara dan S. Nishio, “Detecting Local Events by Analyzing Spatiotemporal Locality of Tweets,” dalam 27th International Conference on Advanced Information Networking and Applications Workshops, Barcelona, 2013.

[4] R. Li, K. H. Lei, R. Khadiwala dan K. C.-C. Chang, “TEDAS : a Twitter-based Event Detection and Analysis System,” dalam 28th International Conference on Data Engineering, Washington, DC, 2012.

[5] D. Getz, Event Studies Theory Research and policy for planned events, Burlington: Elsevier, 2007.

[6] M. Hasby dan M. L. Khodra, “Optimal Path Finding Based on Traffic Information Extraction from Twitter,” dalam International Conference on ICT for Smart Society (ICISS), Jakarta, 2013.

[7] A. Ritter, M. O. Etzioni and S. Clark, "Open Domain Event Extraction from Twitter," in 18th ACM SIGKDD international conference on Knowledge discovery and data mining , New York, 2012.

[8] R. Herbich dan T. Graepel, Handbook of Natural langguagge processing, Cambridge: CRC Press, 2010.

[9] D. Jurafsky dan J. H. Martin, Speech and Language Processing, An Introduction to Natural Language Processing,, 2nd penyunt., New Jersey: Pearson Prentice Hall, 2008.

[10] J. Lafferty, A. McCallum and F. Pereira, "Conditional Random Fields:

Probabilistic Models for Segmenting and Labelling Sequence Data,"

San Fransisco, 2001.

[11] C. Sutton and A. McCallum, "An Introduction to Conditional Random Fields for Relational Learning," in Introduction to Statistical and Relational Learning, L. Getoor and B. Taskar, Eds., Massachusetts, MIT Press, 2007, pp. 93-118.

[12] T. Sakaki, M. Okazaki and Y. Matsuo, "Earthquake Shakes Twitter Users: Real Time Event Detection by Social Sensors," in 19th international conference on World wide web, New York, 2010.

[13] O. Owoputi, B. O'connor, C. Dyer, K. Gimpel, N. Schneider and N. A.

Smith, "Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters," in Proceedings of NAACL , 2013, 2013.

[14] I. Lanin, "Kateglo," 11 06 2009. [Online]. Available:

https://ivanlanin.wordpress.com/2009/06/11/kateglo/. [Accessed 25 05 2015].

[15] G. Keraf, Tatabahasa Indonesia, Nusa Indah, 1984.

[16] A. K. McCallum, "MALLET: A Machine Learning for Language Toolkit.," 2002. [Online]. Available: http://mallet.cs.umass.edu.

[Accessed 17 04 2015].