automatic multi-document summarization using hybrid

(1)

AUTOMATIC MULTI-DOCUMENT SUMMARIZATION USING HYBRID ABSTRACTIVE-EXTRACTIVE SUMMARIZATION TECHNIQUE FOR

DOCUMENTS IN BAHASA INDONESIA

By Glorian Yapinus

12110008

A thesis submitted to the Faculty of

ENGINEERING AND INFORMATION TECHNOLOGY

in partial fulfillment of the requirements for the

BACHELOR’S DEGREE in

INFORMATION TECHNOLOGY

SWISS GERMAN UNIVERSITY EduTown BSD City

Tangerang 15339 Indonesia

July 2014

Revision after the Thesis Defense on 15^th July 2014

(2)

STATEMENT BY THE AUTHOR

I hereby declare that this submission is my own work and to the best of my knowledge, it contains no material previously published or written by another person, nor material which to a substantial extent has been accepted for the award of any other degree or diploma at any educational institution, except where due acknowledgement is made in the thesis.

Glorian Yapinus

_____________________________________________

Student Date

Approved by:

Alva Erwin, ST, M.Sc., MTI

_____________________________________________

Thesis Advisor Date

Dr. Maulahikmah Galinium, S.Kom., M.Sc.

_____________________________________________

Thesis Co-Advisor Date

Dr. Ir. Gembong Baskoro, M.Sc

_____________________________________________

Dean Date

(3)

ABSTRACT

AUTOMATIC MULTI-DOCUMENT SUMMARIZATION USING

HYBRID ABSTRACTIVE-EXTRACTIVE SUMMARIZATION TECHNIQUE FOR DOCUMENTS

IN BAHASA INDONESIA

By Glorian Yapinus

Alva Erwin, ST, M.Sc., MTI, Advisor

Dr. Maulahikmah Galinium, S.Kom., M.Sc., Co-Advisor

SWISS GERMAN UNIVERISTY

This paper discusses the development of multi-document summarization for Indonesian documents by using hybrid abstractive-extractive summarization approach. Multi- document summarization is a technology that able to summarize multiple documents and present them in one summary. The method that used in this research, hybrid abstractive-extractive summarization technique, is a summarization technique that is the combination of WordNet based text summarization (abstractive technique) and title word based text summarization (extractive technique). After an experiment with LSA as the comparison method, this research method successfully generated a well- compressed and readable summary with a fast processing time.

Keywords: Multi-Document Summarization; Abstractive Technique; Extractive Technique; Indonesian Documents

(4)

(5)

DEDICATION

I dedicate this thesis to God, to Indonesia, and to my beloved grandmother who passed away during the work of this thesis.

“Whatever you do, work heartily, as for the Lord and not for men” (Colossians 3:23)

(6)

ACKNOWLEDGEMENTS

First of all, I want to thank Jesus Christ who gave me strength, wisdom, and grace to overcome all of the obstacles during my study and this thesis work.

I would like to express my gratitude to my beloved parents for their generous love and support during my study. Without them, I will not become as strong and mature as now.

Many thanks to the SPARKLING family for their relentless support and prayers.

They have become such a morale booster when I am in difficult times.

I would like to thank my thesis advisor, Mr. Alva Erwin and co-advisor, Mr.

Maulahikmah Galinium for their help and guidance during the work of this thesis.

A big thanks to Wahyu Muliady from Akon Teknologi for his help in technical matters in this thesis work.

Special thanks to my ex-teaching assistants, Michael and Teresa Vania, who patiently taught me the basic yet foundational programming skills during my study.

Warmest thanks to all of the lecturers from the IT department for their valuable knowledge which eventually helped me to solve a lot of technical problem during this thesis work.

I would like to thank James Hunt from the English department for correcting any errors in this thesis.

I would like also to thank all of my friends from the IT 2010 and from other faculties for their help, competition, and togetherness during my study.

Last but not least, I want to thank to all people who are not mentioned in this section but greatly contributed to my study and this thesis work.

(7)

TABLE OF CONTENTS

STATEMENT BY THE AUTHOR ... 2

ABSTRACT ... 3

DEDICATION ... 5

ACKNOWLEDGEMENTS ... 6

LIST OF FIGURES ... 10

LIST OF TABLES ... 12

CHAPTER 1 – INTRODUCTION ... 13

1.1 General Statement of Problem Area ... 13

1.2 Research Purpose ... 14

1.3 Research Scope and Limitations ... 14

1.4 Research Problem ... 15

1.5 Significance of Study ... 16

1.6 Theoretical Perspective ... 16

1.7 Research Question and Hypothesis ... 17

1.7.1 Questions ... 17

1.7.2 Hypothesis ... 18

1.8 Methodology ... 18

1.9 Data Analysis ... 20

CHAPTER 2 – LITERATURE REVIEW ... 21

2.1 Related Works ... 21

2.1.1 Indonesian Automated Text Summarization by Budhi, et al. (2007) ... 21

2.1.2 Automatic Text Summarization Based on Semantic Analysis Approach for Documents in Bahasa Indonesia by Tardan (2013)... 22

2.1.3 Multi-Document Summarization for Documents in Bahasa Indonesia Using Centroid-Based Summarization and K-Means-Based Summarization by Linggakusuma (2008) ... 23

(8)

2.1.4 Multi-Document Summarization Using Sentence Extraction by Goldstein,

et al. (2000) ... 23

2.1.5 Multi-Document Summarization Using Latent Semantic Analysis by Steinberger, et al. (2007) ... 24

2.2 Main Theories ... 25

2.2.1 Title Word Feature ... 25

2.2.2 WordNet Concept ... 26

2.2.3 General Techniques for Pre-Processing in Text Summarization ... 27

2.2.4 Cosine Similarity ... 28

2.3 Tools ... 28

2.3.1 Selenium ... 28

2.3.2 MongoDB ... 29

2.3.3 Apache Solr ... 31

CHAPTER 3 - PROPOSED SYSTEM ... 33

3.1 The System Architecture ... 33

3.1.1 Concatenation ... 34

3.1.2 Pre-processing ... 34

3.1.3 Feature Scoring ... 37

3.1.4 Feature Ranking & Extraction ... 38

3.2 WordNet Development ... 39

3.2.1 WordNet Database Structure ... 39

3.2.2 Indonesian Dictionary Word Scraper ... 40

3.2.3 Word Synonyms ... 42

3.2.4 Category Weighting ... 43

3.3 Latent Semantic Analysis (LSA) Summarization with Solr ... 46

3.4 The Summary Evaluation Methods ... 48

3.4.1 Human-Based Evaluation ... 48

3.4.2 Compression Ratio ... 48

3.4.3 Processing Time Calculation ... 49

CHAPTER 4 – RESULT AND DISCUSSION ... 50

(9)

4.1 WordNet Development Result ... 50

4.1.1 Dictionary Collection ... 50

4.1.2 Words Collection ... 51

4.1.3 Category Collection ... 51

4.2 Hybrid Abstractive-Extractive Summarization Technique Result... 52

4.2.1 Summarization Result ... 52

4.2.2 Experiment Result ... 59

4.3 LSA Summarization with Solr Result ... 60

4.3.1 Summarization Result ... 60

4.3.2 Experiment Result ... 62

4.4 Human-Based Evaluation Result ... 64

4.4.1 Hybrid Abstractive-Extractive Summarization Technique Evaluation Result ………..64

4.4.2 LSA Summarization with Solr Evaluation Result ... 65

4.5 Overall Result of Experiments ... 66

CHAPTER 5 – CONCLUSIONS AND RECOMMENDATION ... 67

5.1 Conclusions ... 67

5.2 Future Works ... 68

5.3 Contributions ... 68

GLOSSARY ... 69

REFERENCES ... 70

APPENDIX A – PUBLISHED WORKS ... 72

CURRICULUM VITAE ... 77

(10)

LIST OF FIGURES

Figure 1.1 The Research Scope Overview ... 15

Figure 2.1 Selenium IDE Interface ... 29

Figure 2.2 MongoDB System Architecture (Suter 2012) ... 30

Figure 2.3 MongoDB Benchmark Result (Bengtsson 2010) ... 31

Figure 3.1 The Proposed System Architecture ... 33

Figure 3.2 The Stop Words Removal Flowchart ... 35

Figure 3.3 The N-Gram Detection and Extraction Flowchart ... 36

Figure 3.4 Feature Scoring Flowchart ... 38

Figure 3.5 Feature Ranking and Extraction Flowchart ... 39

Figure 3.6 WordNet Document Structure ... 39

Figure 3.7 KBBI Website Interface ... 41

Figure 3.8 The Example of Solr Document ... 47

Figure 4.1 Document Example of Dictionary Collection ... 50

Figure 4.2 Document Example of Words Collection ... 51

Figure 4.3 Document Example of Category Collection ... 51

Figure 4.4 Summary Example Generated by Hybrid Abstractive-Extractive Summarization Technique ... 52

Figure 4.5 The First Original Article Example ... 53

Figure 4.6 The Second Original Article Example... 54

Figure 4.7 The Third Original Article Example ... 55

Figure 4.8 The Translated First Original Article Example ... 56

Figure 4.9 The Translated Second Original Article Example ... 57

Figure 4.10 The Translated Third Original Article Example... 58

Figure 4.11 The Translated Summarization Result ... 58

(11)

Figure 4.12 The LSA Summarization with Solr Result ... 61 Figure 4.13 The Translated LSA Summarization with Solr Result ... 62

(12)

LIST OF TABLES

Table 4.1 The Experiment Datasets ... 59 Table 4.2 The Experiment Result ... 60 Table 4.3 The LSA Summarization with Solr Experiment Result ... 63 Table 4.4 Hybrid Abstractive Extractive Summarization Technique Evaluation Result ... 65 Table 4.5 LSA Summarization with Solr Evaluation Result ... 66 Table 4.6 Overall Experiments Result ... 66