Porter Stemming

(1)

Porter Stemming untuk Bahasa Indonesia

Proses stemming merupakan suatu proses memecahkan setiap imbuhan dari suatu kata seperti

awalan(prefiks, sisipan(infiks) ataupun akhiran(sufiks), hasil dari proses stemming ini akan

menghasilkan akar kata (stem) yaitu bagian dari kata yang tersisa setelah dihilangkan

imbuhannya (awalan dan akhiran).

Salah satu algoritma yang digunakan dalam proses stemming adalah Algoritma Porter Stemmer.

Porter Stemmer sendiri merupakan algoritma penghilangan akhiran morphological dan

infleksional yang umum dari bahasa Inggris. Algoritma ini terdiri dari himpunan kondisi atau

action rules.

Berikut merupakan contoh source code dengan menggunakan PHP dari Algoritma Porter

Stemmer ini.

<html>

<head><title>Porter Indonesia</title> </head>

<body>

<form action="stem_indonesia.php" method="POST"> Masukkan Kata : <input type="text" name="word"> <input type="submit" value="Cek">

</form>

</body> </html>

Coding di atas akan menampilkan form untuk inputan term (kata) yang akan di stemming.

Sedangkan proses stemmingnya ada dalam coding berikut.

(2)

/**

* File stem_indonesia.php *

* @filesource stem_indonesia.php * @author Kabul Kurniawan

* @package porter_indonesia * @license GPL * @version 0.0.1 */ $dbServer = "localhost"; $dbUser = "root"; $dbPass = "";

$dbKoneksi = mysql_connect($dbServer, $dbUser, $dbPass); $dbName = "kamus_porter";

mysql_select_db($dbName);

$word = $_POST['word'];

//STEP 1 (Cek Kamus partikel & partikel berprefiks)

$partikel = mysql_query("SELECT * FROM dsr_partikel WHERE name ='$word'"); if(mysql_num_rows($partikel)==1){

//langsung tulis $dasar =$word;

(3)

echo $dasar; exit;

}else {

$partikel_berprefiks=mysql_query("SELECT * FROM dsr_partikel_prefiks WHERE name='$word'");

if(mysql_num_rows($partikel_berprefiks)==1 && strlen($word) > 4){ //hapus prefiks

if(substr($word,0,4)=="meng" or substr($word,0,4)=="peng"){ echo substr($word,4);

}else if(substr($word,0,4)=="meny" or substr($word,0,4)=="peny"){ $dasar =substr($word,4);

echo "s".$dasar;

}else if(substr($word,0,3)=="mel" or substr($word,0,3)=="mer" or substr($word,0,3)=="mew" or substr($word,0,3)=="mey"){

echo substr($word,2);

}else if(substr($word,0,2)=="di"){ echo substr($word,2);

}else if(substr($word,0,3)=="mem" or substr($word,0,3)=="pem"){ if(substr($word,3,1)=="a" or substr($word,3,1)=="i" or substr($word,3,1)=="u" or substr($word,3,1)=="e" or substr($word,3,1)=="o"){ $dasar =substr($word,3);

echo "p".$dasar; }else{

$dasar =substr($word,3); echo $dasar;}

(4)

$dasar =substr($word,4); echo $dasar;

}

else if(substr($word,0,3)=="men" or substr($word,0,3)=="pen" ){ $dasar =substr($word,3); echo "t".$dasar; } exit; } else{ //hapus partikel

if((substr($word, -3) == 'kah' )||( substr($word, -3) == 'lah' )||( substr($word, -3) == 'pun' )||( substr($word, -3) == 'tah' )){

$word2 = substr($word, 0, -3);} else{

$word2 = $word;

}

echo "Penghapusan partikel = ".$word2."<br>"; }

}

//STEP 2 (Cek Kamus milik & milik berprefiks)

$milik = mysql_query("SELECT * FROM dsr_milik WHERE name='$word2'"); if(mysql_num_rows($milik)==1){

//langsung tulis $dasar =$word2;

(5)

echo $dasar; exit;

}else {

$milik_berprefiks = mysql_query("SELECT * FROM dsr_milik_prefiks WHERE name='$word2'");

if(mysql_num_rows($milik_berprefiks)==1 && strlen($word2) > 4){ //hapus prefiks

if(substr($word2,0,4)=="meng" or substr($word2,0,4)=="peng"){ echo substr($word2,4);

}else if(substr($word2,0,4)=="meny" or substr($word2,0,4)=="peny"){ $dasar =substr($word2,4);

echo "s".$dasar;

}else if(substr($word2,0,3)=="mel" or substr($word2,0,3)=="mew" or substr($word2,0,3)=="mer" or substr($word2,0,3)=="mey"){

echo substr($word2,2);

}else if(substr($word2,0,2)=="di"){ echo substr($word2,2);

}else if(substr($word2,0,3)=="mem" or substr($word2,0,3)=="pem" ){ if(substr($word2,3,1)=="a" or substr($word2,3,1)=="i" or substr($word2,3,1)=="u" or substr($word2,3,1)=="e" or substr($word2,3,1)=="o"){ $dasar =substr($word2,3);

echo "p".$dasar; }else{

$dasar =substr($word2,3); echo $dasar;}

(6)

$dasar =substr($word2,4); echo $dasar;

}else if(substr($word2,0,3)=="men" or substr($word2,0,3)=="pen" ){ $dasar =substr($word2,3); echo "t".$dasar; } exit; } else{ //hapus milik

if((substr($word2, -2)== 'ku')||(substr($word2, -2)== 'mu')){ $word3 = substr($word2, 0, -2);

}else if((substr($word2, -3)== 'nya')){ $word3 = substr($word2,0, -3); }

else{

$word3 = $word2; }

echo "Penghapusan milik = ".$word3."<br>"; }

}

//STEP 3 (Cek Kamus prefiks1 & perfiks1 bersufiks)

(7)

if(mysql_num_rows($prefiks1)==1){ //langsung tulis $dasar =$word3; echo $dasar; exit; }else {

$prefiks1_sufiks = mysql_query("SELECT * FROM dsr_prefiks1_sufiks WHERE name='$word3'");

if(mysql_num_rows($prefiks1_sufiks)==1 && strlen($word3) > 4){ //hapus sufiks

if (substr($word4, -3)== 'kan'){ $dasar = substr($word4, 0, -3); echo $dasar;}

elseif (substr($word4, -1)== 'i'){ $dasar = substr($word4, 0, -1); echo $dasar;}

elseif (substr($word4, -2)== 'an'){ $dasar = substr($word4, 0, -2); echo $dasar;} exit; } else{ //hapus prefiks1 if(substr($word3,0,4)=="meng" or substr($word3,0,4)=="peng"){ $word4 = substr($word3,4);

(8)

$dasar = substr($word3,4); $word4 = "s".$dasar;

}else if(substr($word3,0,3)=="mel" or substr($word3,0,3)=="mew" or substr($word3,0,3)=="mer" or substr($word3,0,3)=="mey"){

$word4 = substr($word3,2); }else if(substr($word3,0,2)=="di"){ $word4 = substr($word3,2);

}else if(substr($word3,0,3)=="mem" or substr($word2,0,3)=="pem"){ if(substr($word3,3,1)=="a" or substr($word3,3,1)=="i" or substr($word3,3,1)=="u" or substr($word3,3,1)=="e" or substr($word3,3,1)=="o"){ $dasar =substr($word3,3);

$word4 = "p".$dasar; }else{

$dasar =substr($word3,3); $word4 = $dasar;}

}else if(substr($word3,0,3 == "pel")){ $dasar =substr($word3,4);

echo $dasar;

}else if(substr($word3,0,3)=="men" or substr($word3,0,3)=="pen"){ $dasar =substr($word3,3);

$word4 = "t".$dasar; }else{

$word4 = $word3;

(9)

}

echo "Penghapusan prefiks1= ".$word4."<br>";

} }

$prefiks2 = mysql_query("SELECT * FROM dsr_prefiks2 WHERE name='$word4'");

if(mysql_num_rows($prefiks2)==1){ //langsung tulis $dasar =$word4; echo $dasar; exit; }else {

$prefiks2_sufiks = mysql_query("SELECT * FROM dsr_prefiks2_sufiks WHERE name='$word4'");

if(mysql_num_rows($prefiks2_sufiks)==1 && strlen($word4) > 4){ //hapus sufiks

if (substr($word4, -3)== 'kan'){ $dasar = substr($word4, 0, -3); echo $dasar;}

elseif (substr($word4, -1)== 'i'){ $dasar = substr($word4, 0, -1); echo $dasar;}

(10)

$dasar = substr($word4, 0, -2); echo $dasar;} exit; } else{ //hapus prefiks2 if(substr($word4,0,3)=="ber" or substr($word4,0,3)=="per"){ $word5 = substr($word4,3); }else if(substr($word4,0,2)=="be"){ if(substr($word4,3)=="ajar"){ $dasar =substr($word4,3); $word5 = $dasar; }else{ $dasar =substr($word4,2); $word5 = $dasar;}

}else if(substr($word4,0,2)=="se" or substr($word4,0,2)=="ke"){ $word5 = substr($word4,2);

}else if(substr($word4,0,3) == "pel" or substr($word4,0,3) == "ter"){ $word5 =substr($word4,3);

} else{

$word5 = $word4; }

echo "Penghapusan prefiks2= ".$word5."<br>";

(11)

} }

$sufiks = mysql_query("SELECT * FROM dsr_sufiks WHERE name='$word5'"); $prefiks2_sufiks = mysql_query("SELECT * FROM dsr_prefiks2_sufiks WHERE name='$word5'"); if(mysql_num_rows($sufiks)==1){ //langsung tulis $dasar =$word5; echo $dasar; exit; }else{ //hapus sufiks

if (substr($word5, -3)== 'kan' && strlen($word5) > 4){ $dasar1 = substr($word5, 0, -3);

}

elseif (substr($word5, -1)== 'i'){ $dasar1 = substr($word5, 0, -1); }

elseif (substr($word5, -2)== 'an'){ $dasar1 = substr($word5, 0, -2); }

echo "Penghapusan sufiks= ".$dasar1."<br>"; exit;

(12)

?>

The Porter Stemming Algorithm

This page was completely revised Jan 2006. The earlier edition is

here

.

(It has also been translated into Serbo-Croat by Jovana Milutinovich from Web Geeks Resources. Thank

you, Jovana!)

This is the ‘official’ home page for distribution of the Porter Stemming Algorithm, written and

maintained by its author, Martin Porter.

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner

morphological and inflexional endings from words in English. Its main use is as part of a term

normalisation process that is usually done when setting up Information Retrieval systems.

History

The original stemming algorithm paper was written in 1979 in the Computer Laboratory, Cambridge

(England), as part of a larger IR project, and appeared as Chapter 6 of the final project report,

C.J. van Rijsbergen, S.E. Robertson and M.F. Porter, 1980. New models in probabilistic information

retrieval. London: British Library. (British Library Research and Development Report, no. 5587).

With van Rijsbergen’s encouragement, it was also published in,

M.F. Porter, 1980, An algorithm for suffix stripping, Program, 14(3) pp 130−137.

And since then it has been reprinted in

Karen Sparck Jones and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan

Kaufmann, ISBN 1-55860-454-4.

The original stemmer was written in BCPL, a language once popular, but now defunct. For the first few

years after 1980 it was distributed in its BCPL form, via the medium of punched paper tape. Versions in

other languages soon began to appear, and by 1999 it was being widely used, quoted and adapted.

Unfortunately there were numerous variations in functionality among these versions, and this web page

was set up primarily to ‘put the record straight’ and establish a definitive version for distribution.

(13)

Encodings

The ANSI C version that heads the table below is exactly equivalent to the original BCPL version. The

BCPL version did, however, differ in three minor points from the published algorithm and these are

clearly marked in the downloadable ANSI C version. They are discussed further below.

This ANSI C version may be regarded as definitive, in that it now acts as a better definition of the

algorithm than the original published paper.

Over the years, I have received many encoding from other workers, and they are also presented below. I

have a reasonable confidence that all these versions are correctly encoded.

language

author

affiliation

received

notes

ANSI C

me

ANSI C thread

safe

me

java

me

Perl

me

Perl

Daniel van

Balen

Oct

1999

slightly

faster?

python

Vivake Gupta

Jan

2001

(14)

Hazelwood

2001

Csharp .NET

compliant

Leif Azzopardi

Univerity of Paisley, Scotland

Nov

2002

Common

Lisp

Steven M.

Haflich

Franz Inc

Mar

2002

Ruby

Ray Pereda

www.raypereda.com

Jan

2003

github link

Visual

Basic VB6

Navonil

Mustafee

Brunel University

Apr

2003

Delphi

Jo Rabin

Apr

2004

Javascript

‘Andargor’

www.andargor.com

Jul 2004

substantial

revisions by

Christopher

McKenzie

(See ‘Links’ below.)

Visual Basic

VB7; .NET

compliant

Christos

Attikos

University of Piraeus, Greece

Jan

2005

php

Richard Heyes

www.phpguru.org

Feb

2005

(15)

2005

Haskell

Dmitry

Antonyuk

Nov

2005

T-SQL

Keith Lubell

www.atelierdevitraux.com

May

2006

matlab

Juan Carlos

Lopez

California Pacific Medical Center

Research Institute

Sep

2006

Tcl

Aris

Theodorakos

NCSR Demokritos

Nov

2006

D

Daniel

Truemper

Humboldt-Universitaet zu

Berlin

May

2007

erlang

(1) erlang

(2)

Alden Dima

National Institute of Standards

and Technology,

Gaithersburg, MD USA

Sep

2007

REBOL

Dale K

Brearcliffe

Apr

2009

Scala

Ken Faulkner

May

2009

sas

Antoine

St-Pierre

Business Researchers, Inc

Apr

2010

(16)

plugin vim

script

Mitchell

Bowden

May

2010

github link

node.js

Jed Parsons

jedparsons.com

May

2011

github link

Google Go

Alex

Gonopolskiy

Oct

2011

github link

awk

Gregory

Grefenstette

3ds.com/exalead

Jul 2012

clojure

Yushi Wang

Mar

2013

bitbucket

link

All these encodings of the algorithm can be used free of charge for any purpose. Questions about the

algorithms should be directed to their authors, and not to Martin Porter (except when he is the author).

To test the programs out, here is a

sample vocabulary

(0.19 megabytes), and the corresponding

output

.

Email

any comments, suggestions, queries

Points of difference from the published algorithm

There is an extra rule in Step 2,

(m>0) logi → log

So archaeology is equated with archaeological etc.

The Step 2 rule

(17)

is replaced by

(m>0) bli → ble

So possibly is equated with possible etc.

The algorithm leaves alone strings of length 1 or 2. In any case a string of length 1 will be unchanged if

passed through the algorithm, but strings of length 2 might lose a final s, so as goes to a and is to i.

These differences may have been present in the program from which the published algorithm derived.

But at such a great distance from the original publication it is now difficult to say.

It must be emphasised that these differences are very small indeed compared to the variations that

have been observed in other encodings of the algorithm.

Status

The Porter stemmer should be regarded as ‘frozen’, that is, strictly defined, and not amenable to further

modification. As a stemmer, it is slightly inferior to the Snowball English or Porter2 stemmer, which

derives from it, and which is subjected to occasional improvements. For practical work, therefore, the

new Snowball stemmer is recommended. The Porter stemmer is appropriate to IR research work

involving stemming where the experiments need to be exactly repeatable.

Common errors

Historically, the following shortcomings have been found in other encodings of the stemming algorithm.

The algorithm clearly explains that when a set of rules of the type

(condition)S1 → S2

are presented together, only one rule is applied, the one with the longest matching suffix S1 for the

given word. This is true whether the rule succeeds or fails (i.e. whether or not S2 replaces S1). Despite

this, the rules are sometimes simply applied in turn until either one of them succeeds or the list runs

out.

This leads to small errors in various places, for example in the Step 4 rules

(m>1)ement →

(m>1)ment →

(m>1)ent →

(18)

to remove final ement, ment and ent.

Properly, argument stems to argument. The longest matching suffix is -ment. Then stem argu- has

measure m equal to 1 and so -ment will not be removed. End of Step 4. But if the three rules are applied

in turn, then for suffix -entthe stem argum- has measure m equal to 2, and -ent gets removed.

The more delicate rules are liable to misinterpretation. (Perhaps greater care was required in explaining

them.) So

((m>1) and (s or t))ion

is taken to mean

(m>1)(s or t)ion

The former means that taking off -ion leaves a stem with measure greater than 1 ending -s or -t; the

latter means that taking off -sion or -tion leaves a stem of measure greater than 1. A similar confusion

tends to arise in interpreting rule 5b, to reduce final double L to single L.

Occasionally cruder errors have been seen. For example the test for Y being consonant or vowel set up

the wrong way round.

It is interesting that although the published paper explains how to do the tests on the strings S1 by a

program switch on the last or last but one letter, many encodings fail to use this technique, making

them much slower than they need be.

FAQs (frequently asked questions)

#1. What is the licensing arrangement for this software?

This question has become very popular recently (the period 2008−2009), despite the clear statment

above that ‘‘all these encodings of the algorithm can be used free of charge for any purpose.’’ The

problem I think is that intellectual property has become such a major issue that some more formal

statement is expected. So to restate it:

The software is completely free for any purpose, unless notes at the head of the program text indicates

otherwise (which is rare). In any case, the notes about licensing are never more restrictive than the BSD

License.

In every case where the software is not written by me (Martin Porter), this licensing arrangement has

been endorsed by the contributor, and it is therefore unnecessary to ask the contributor again to

confirm it.

(19)