Porter Stemming untuk Bahasa Indonesia
Proses stemming merupakan suatu proses memecahkan setiap imbuhan dari suatu kata seperti
awalan(prefiks, sisipan(infiks) ataupun akhiran(sufiks), hasil dari proses stemming ini akan
menghasilkan akar kata (stem) yaitu bagian dari kata yang tersisa setelah dihilangkan
imbuhannya (awalan dan akhiran).
Salah satu algoritma yang digunakan dalam proses stemming adalah Algoritma Porter Stemmer.
Porter Stemmer sendiri merupakan algoritma penghilangan akhiran morphological dan
infleksional yang umum dari bahasa Inggris. Algoritma ini terdiri dari himpunan kondisi atau
action rules.
Berikut merupakan contoh source code dengan menggunakan PHP dari Algoritma Porter
Stemmer ini.
<html>
<head><title>Porter Indonesia</title> </head>
<body>
<form action="stem_indonesia.php" method="POST"> Masukkan Kata : <input type="text" name="word"> <input type="submit" value="Cek">
</form>
</body> </html>
Coding di atas akan menampilkan form untuk inputan term (kata) yang akan di stemming.
Sedangkan proses stemmingnya ada dalam coding berikut.
/**
* File stem_indonesia.php *
* @filesource stem_indonesia.php * @author Kabul Kurniawan
* @package porter_indonesia * @license GPL * @version 0.0.1 */ $dbServer = "localhost"; $dbUser = "root"; $dbPass = "";
$dbKoneksi = mysql_connect($dbServer, $dbUser, $dbPass); $dbName = "kamus_porter";
mysql_select_db($dbName);
$word = $_POST['word'];
//STEP 1 (Cek Kamus partikel & partikel berprefiks)
$partikel = mysql_query("SELECT * FROM dsr_partikel WHERE name ='$word'"); if(mysql_num_rows($partikel)==1){
//langsung tulis $dasar =$word;
echo $dasar; exit;
}else {
$partikel_berprefiks=mysql_query("SELECT * FROM dsr_partikel_prefiks WHERE name='$word'");
if(mysql_num_rows($partikel_berprefiks)==1 && strlen($word) > 4){ //hapus prefiks
if(substr($word,0,4)=="meng" or substr($word,0,4)=="peng"){ echo substr($word,4);
}else if(substr($word,0,4)=="meny" or substr($word,0,4)=="peny"){ $dasar =substr($word,4);
echo "s".$dasar;
}else if(substr($word,0,3)=="mel" or substr($word,0,3)=="mer" or substr($word,0,3)=="mew" or substr($word,0,3)=="mey"){
echo substr($word,2);
}else if(substr($word,0,2)=="di"){ echo substr($word,2);
}else if(substr($word,0,3)=="mem" or substr($word,0,3)=="pem"){ if(substr($word,3,1)=="a" or substr($word,3,1)=="i" or substr($word,3,1)=="u" or substr($word,3,1)=="e" or substr($word,3,1)=="o"){ $dasar =substr($word,3);
echo "p".$dasar; }else{
$dasar =substr($word,3); echo $dasar;}
$dasar =substr($word,4); echo $dasar;
}
else if(substr($word,0,3)=="men" or substr($word,0,3)=="pen" ){ $dasar =substr($word,3); echo "t".$dasar; } exit; } else{ //hapus partikel
if((substr($word, -3) == 'kah' )||( substr($word, -3) == 'lah' )||( substr($word, -3) == 'pun' )||( substr($word, -3) == 'tah' )){
$word2 = substr($word, 0, -3);} else{
$word2 = $word;
}
echo "Penghapusan partikel = ".$word2."<br>"; }
}
//STEP 2 (Cek Kamus milik & milik berprefiks)
$milik = mysql_query("SELECT * FROM dsr_milik WHERE name='$word2'"); if(mysql_num_rows($milik)==1){
//langsung tulis $dasar =$word2;
echo $dasar; exit;
}else {
$milik_berprefiks = mysql_query("SELECT * FROM dsr_milik_prefiks WHERE name='$word2'");
if(mysql_num_rows($milik_berprefiks)==1 && strlen($word2) > 4){ //hapus prefiks
if(substr($word2,0,4)=="meng" or substr($word2,0,4)=="peng"){ echo substr($word2,4);
}else if(substr($word2,0,4)=="meny" or substr($word2,0,4)=="peny"){ $dasar =substr($word2,4);
echo "s".$dasar;
}else if(substr($word2,0,3)=="mel" or substr($word2,0,3)=="mew" or substr($word2,0,3)=="mer" or substr($word2,0,3)=="mey"){
echo substr($word2,2);
}else if(substr($word2,0,2)=="di"){ echo substr($word2,2);
}else if(substr($word2,0,3)=="mem" or substr($word2,0,3)=="pem" ){ if(substr($word2,3,1)=="a" or substr($word2,3,1)=="i" or substr($word2,3,1)=="u" or substr($word2,3,1)=="e" or substr($word2,3,1)=="o"){ $dasar =substr($word2,3);
echo "p".$dasar; }else{
$dasar =substr($word2,3); echo $dasar;}
$dasar =substr($word2,4); echo $dasar;
}else if(substr($word2,0,3)=="men" or substr($word2,0,3)=="pen" ){ $dasar =substr($word2,3); echo "t".$dasar; } exit; } else{ //hapus milik
if((substr($word2, -2)== 'ku')||(substr($word2, -2)== 'mu')){ $word3 = substr($word2, 0, -2);
}else if((substr($word2, -3)== 'nya')){ $word3 = substr($word2,0, -3); }
else{
$word3 = $word2; }
echo "Penghapusan milik = ".$word3."<br>"; }
}
//STEP 3 (Cek Kamus prefiks1 & perfiks1 bersufiks)
if(mysql_num_rows($prefiks1)==1){ //langsung tulis $dasar =$word3; echo $dasar; exit; }else {
$prefiks1_sufiks = mysql_query("SELECT * FROM dsr_prefiks1_sufiks WHERE name='$word3'");
if(mysql_num_rows($prefiks1_sufiks)==1 && strlen($word3) > 4){ //hapus sufiks
if (substr($word4, -3)== 'kan'){ $dasar = substr($word4, 0, -3); echo $dasar;}
elseif (substr($word4, -1)== 'i'){ $dasar = substr($word4, 0, -1); echo $dasar;}
elseif (substr($word4, -2)== 'an'){ $dasar = substr($word4, 0, -2); echo $dasar;} exit; } else{ //hapus prefiks1 if(substr($word3,0,4)=="meng" or substr($word3,0,4)=="peng"){ $word4 = substr($word3,4);
$dasar = substr($word3,4); $word4 = "s".$dasar;
}else if(substr($word3,0,3)=="mel" or substr($word3,0,3)=="mew" or substr($word3,0,3)=="mer" or substr($word3,0,3)=="mey"){
$word4 = substr($word3,2); }else if(substr($word3,0,2)=="di"){ $word4 = substr($word3,2);
}else if(substr($word3,0,3)=="mem" or substr($word2,0,3)=="pem"){ if(substr($word3,3,1)=="a" or substr($word3,3,1)=="i" or substr($word3,3,1)=="u" or substr($word3,3,1)=="e" or substr($word3,3,1)=="o"){ $dasar =substr($word3,3);
$word4 = "p".$dasar; }else{
$dasar =substr($word3,3); $word4 = $dasar;}
}else if(substr($word3,0,3 == "pel")){ $dasar =substr($word3,4);
echo $dasar;
}else if(substr($word3,0,3)=="men" or substr($word3,0,3)=="pen"){ $dasar =substr($word3,3);
$word4 = "t".$dasar; }else{
$word4 = $word3;
}
echo "Penghapusan prefiks1= ".$word4."<br>";
} }
//STEP 4 (Cek Kamus prefiks2 & perfiks2 bersufiks)
$prefiks2 = mysql_query("SELECT * FROM dsr_prefiks2 WHERE name='$word4'");
if(mysql_num_rows($prefiks2)==1){ //langsung tulis $dasar =$word4; echo $dasar; exit; }else {
$prefiks2_sufiks = mysql_query("SELECT * FROM dsr_prefiks2_sufiks WHERE name='$word4'");
if(mysql_num_rows($prefiks2_sufiks)==1 && strlen($word4) > 4){ //hapus sufiks
if (substr($word4, -3)== 'kan'){ $dasar = substr($word4, 0, -3); echo $dasar;}
elseif (substr($word4, -1)== 'i'){ $dasar = substr($word4, 0, -1); echo $dasar;}
$dasar = substr($word4, 0, -2); echo $dasar;} exit; } else{ //hapus prefiks2 if(substr($word4,0,3)=="ber" or substr($word4,0,3)=="per"){ $word5 = substr($word4,3); }else if(substr($word4,0,2)=="be"){ if(substr($word4,3)=="ajar"){ $dasar =substr($word4,3); $word5 = $dasar; }else{ $dasar =substr($word4,2); $word5 = $dasar;}
}else if(substr($word4,0,2)=="se" or substr($word4,0,2)=="ke"){ $word5 = substr($word4,2);
}else if(substr($word4,0,3) == "pel" or substr($word4,0,3) == "ter"){ $word5 =substr($word4,3);
} else{
$word5 = $word4; }
echo "Penghapusan prefiks2= ".$word5."<br>";
} }
//STEP 5 (Cek Kamus prefiks2 & perfiks2 bersufiks)
$sufiks = mysql_query("SELECT * FROM dsr_sufiks WHERE name='$word5'"); $prefiks2_sufiks = mysql_query("SELECT * FROM dsr_prefiks2_sufiks WHERE name='$word5'"); if(mysql_num_rows($sufiks)==1){ //langsung tulis $dasar =$word5; echo $dasar; exit; }else{ //hapus sufiks
if (substr($word5, -3)== 'kan' && strlen($word5) > 4){ $dasar1 = substr($word5, 0, -3);
}
elseif (substr($word5, -1)== 'i'){ $dasar1 = substr($word5, 0, -1); }
elseif (substr($word5, -2)== 'an'){ $dasar1 = substr($word5, 0, -2); }
echo "Penghapusan sufiks= ".$dasar1."<br>"; exit;
?>
The Porter Stemming Algorithm
This page was completely revised Jan 2006. The earlier edition is
here
.
(It has also been translated into Serbo-Croat by Jovana Milutinovich from Web Geeks Resources. Thank
you, Jovana!)
This is the ‘official’ home page for distribution of the Porter Stemming Algorithm, written and
maintained by its author, Martin Porter.
The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner
morphological and inflexional endings from words in English. Its main use is as part of a term
normalisation process that is usually done when setting up Information Retrieval systems.
History
The original stemming algorithm paper was written in 1979 in the Computer Laboratory, Cambridge
(England), as part of a larger IR project, and appeared as Chapter 6 of the final project report,
C.J. van Rijsbergen, S.E. Robertson and M.F. Porter, 1980. New models in probabilistic information
retrieval. London: British Library. (British Library Research and Development Report, no. 5587).
With van Rijsbergen’s encouragement, it was also published in,
M.F. Porter, 1980, An algorithm for suffix stripping, Program, 14(3) pp 130−137.
And since then it has been reprinted in
Karen Sparck Jones and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan
Kaufmann, ISBN 1-55860-454-4.
The original stemmer was written in BCPL, a language once popular, but now defunct. For the first few
years after 1980 it was distributed in its BCPL form, via the medium of punched paper tape. Versions in
other languages soon began to appear, and by 1999 it was being widely used, quoted and adapted.
Unfortunately there were numerous variations in functionality among these versions, and this web page
was set up primarily to ‘put the record straight’ and establish a definitive version for distribution.
Encodings
The ANSI C version that heads the table below is exactly equivalent to the original BCPL version. The
BCPL version did, however, differ in three minor points from the published algorithm and these are
clearly marked in the downloadable ANSI C version. They are discussed further below.
This ANSI C version may be regarded as definitive, in that it now acts as a better definition of the
algorithm than the original published paper.
Over the years, I have received many encoding from other workers, and they are also presented below. I
have a reasonable confidence that all these versions are correctly encoded.
language
author
affiliation
received
notes
ANSI C
me
ANSI C thread
safe
me
java
me
Perl
me
Perl
Daniel van
Balen
Oct
1999
slightly
faster?
python
Vivake Gupta
Jan
2001
Hazelwood
2001
Csharp .NET
compliant
Leif Azzopardi
Univerity of Paisley, Scotland
Nov
2002
Common
Lisp
Steven M.
Haflich
Franz Inc
Mar
2002
Ruby
Ray Pereda
www.raypereda.com
Jan
2003
github link
Visual
Basic VB6
Navonil
Mustafee
Brunel University
Apr
2003
Delphi
Jo Rabin
Apr
2004
Javascript
‘Andargor’
www.andargor.com
Jul 2004
substantial
revisions by
Christopher
McKenzie
(See ‘Links’ below.)