Focused Web Crawler Dengan Sistem Terdistribusi

(1)

vi

ABSTRAK

Salah satu teknik untuk mengumpulkan informasi berupa artikel dari Internet adalah dengan menggunakan mesin crawler. Salah satu algoritma untuk mengumpulkan artikel hanya untuk topik tertentu pada sebuah mesin crawler dapat menggunakan Focused Crawling Algorithm dengan metode pengklasifikasian seperti naive bayes. Tahapan pengumpulan artikel meliputi algoritma ekstraksi dan pengklasifikasian artikel. Ekstraksi artikel dilakukan untuk dapat mengetahui isi kandungan artikel sehingga artikel dapat di klasifikasikan apakah termasuk artikel dengan topik tertentu atau bukan. Untuk mempercepat waktu yang dibutuhkan dalam pengumpulan informasi maka dapat dirancang dengan sistem terdistribusi dan dikombinasikan dengan metode multithreading dan pemakaian algoritma larger site first dalam pengurutan situs yang

akan di-crawl pertama kali. Penelitian dilakukan dengan menggunakan thread dan bandwith yang berbeda. Selain menghitung hasil dari crawling, peneliti juga

menghitung penggunaan heap memory dan cpu pada saat proses crawling. Hasil yang didapat adalah hasil crawling menggunakan algoritma larger site first lebih tinggi dibandingkan dengan tidak menggunakannya. Begitu juga dengan penggunaan thread dan bandwith, semakin besar maka semakin besar juga hasilnya. Akan tetapi ada berapa faktor yang menyebabkan menurunnya performa walaupun thread yang digunakan banyak. Untuk itu thread yang efektif digunakan pada penelitian kali ini adalah dengan 500 thread.

Kata kunci: focused crawler, sistem terdistribusi, naive bayes, multithreading, larger site first

(2)

vii

FOCUSED WEB CRAWLER WITH DISTRIBUTED SYSTEM

ABSTRACT

One technique for collecting information in the form of articles on the Internet is to use web crawler. One of algorithm to collect articles only for particular topics on web crawler can be use Focused Crawling Algorithm with classification method such as Naive Bayes. The stages collection of articles covering the extraction content and classification. Article extraction is to determine the contents of the articles, so that the article can be classified if the articles is on a specific topic or not. To speed up the time of collecting information, then can be designed with distributed systems and combined with multithreading method and larger site first algorithm in the sequencing of the site will be crawled first. The research was conducted by using a different thread and internet bandwidth. In addition to calculating the results of the crawling, the researchers also calculated the use of heap memory and cpu while crawling process. The results obtained are the result of usage larger site algorithm is higher compared to not using it. Likewise with the use of thread and bandwidth higher then the higher the results. But many factors can be decreased performance although the thread used a lot. Therefore the effective thread used in this research is the 500 threads.

Keywords: focused crawler, distributed system, naive bayes, multithreading, larger site first