Sesi 09 Dosen Pembina : Danang Junaedi IF-UTAMA 1 IF-UTAMA 2

(1)

Sesi 09

Dosen Pembina :

Danang Junaedi

IF-UTAMA 1 IF-UTAMA 2

Given a collection of records (

training set

)

◦ Each record contains a set of attributes, one of the attributes is the class.

Find a

model

for class attribute as a function of

the values of other attributes.

Goal: previously unseen records should be

assigned a class as accurately as possible.

◦ A test setis used to determine the accuracy of the model.

Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

(2)

IF-UTAMA 5

Apply Model Learn Model Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10

Predicting tumor cells as benign or malignant

Classifying credit card transactions

as legitimate or fraudulent

Classifying secondary structures of protein

as alpha-helix, beta-sheet, or random coil

Categorizing news stories as finance,

weather, entertainment, sports, etc

IF-UTAMA 6

Decision Tree based Methods

Rule-based Methods

Memory based reasoning

Neural Networks

Naïve Bayes and Bayesian Belief Networks

Support Vector Machines

Mengubah data menjadi pohon keputusan

(3)

A decision tree is a chronological

representation of the decision problem.

Each decision tree has two types of

nodes; round nodes correspond to the states of nature while square nodes correspond to the decision alternatives.

The branches leaving each round

node represent the different states of nature while the branches leaving each square node represent the different decision alternatives.

At the end of each limb of a tree are

the payoffs attained from the series of branches making up that limb.

Diagnosa penyakit tertentu, seperti hipertensi,

kanker, stroke dan lain-lain

Pemilihan produk seperti rumah, kendaraan,

komputer dan lain-lain

Pemilihan pegawai teladan sesuai dengan

kriteria tertentu

Deteksi gangguan pada komputer atau jaringan

komputer seperti Deteksi Entrusi, deteksi virus

(trojan dan varians)

(4)

IF-UTAMA 13 T id Refund M arital Status Taxable Incom e C heat 1 Yes Single 125K N o 2 N o M arried 100K N o 3 N o Single 70K N o 4 Yes M arried 120K N o 5 N o D ivorced 95K Y es 6 N o M arried 60K N o 7 Yes D ivorced 220K N o 8 N o Single 85K Y es 9 N o M arried 75K N o 10 N o Single 90K Y es 10 Training Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Splitting Attributes

Model: Decision Tree

IF-UTAMA 14 T id Refund M arital Status Taxable Income Cheat 1 Yes Single 125K No 2 N o M arried 100K No 3 N o Single 70K No 4 Yes M arried 120K No 5 N o Divorced 95K Yes 6 N o M arried 60K No 7 Yes Divorced 220K No 8 N o Single 85K Yes 9 N o M arried 75K No 10 N o Single 90K Yes 10 MarSt Refund TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

There could be more than one tree that fits the same data!

1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 1 0

11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 1 0 Decision Tree

(5)

IF-UTAMA 17 Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 1 0 Test Data

Start from the root of tree.

IF-UTAMA 18 Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 1 0 Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 1 0 Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 1 0 Test Data

(6)

IF-UTAMA 21 Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 1 0 Test Data IF-UTAMA 22 Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 1 0 Test Data

Assign Cheat to “No”

1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 1 0

11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 1 0 Decision Tree

(7)

Data dinyatakan dalam bentuk tabel dengan atribut dan

record.

AtributAtributAtributAtribut menyatakan suatu parameter yang dibuat

sebagai kriteria dalam pembentukan tree. Misalkan untuk menentukan main tenis, kriteria yang

diperhatikan adalah cuaca, angin dan temperatur. Salah satu atribut merupakan atribut yang menyatakan data

solusi per-item data yang disebut dengan target target target target

atribut atribut atribut atribut.

Atribut memiliki nilai-nilai yang dinamakan dengan

instance instance instance

instance. Misalkan atribut cuaca mempunyai instance berupa cerah, berawan dan hujan.

(8)

1.

Mengubah bentuk data (tabel) menjadi model

tree.

◦ ID3 Algorithm

◦ C.45 Algorithm

◦ etc

2.

Mengubah model tree menjadi rule

◦ Disjunction (v OR)

◦ Conjunction (^ AND)

3.

Menyederhanakan Rule (Pruning)

(9)

(10)

Given a set of examples, S, categorised in categories ci, then:

1.Choose the root node to be the attribute, A, which scores the highest

for information gain relative to S.

2.For each value v that A can possibly take, draw a branch from the

node.

3.For each branch from A corresponding to value v, calculate Sv. Then:

◦ If Sv is empty, choose the category cdefault which contains the most examples from S, and put this as the leaf node category which ends that branch.

◦ If Sv contains only examples from a category c, then put c as the leaf node category which ends that branch.

◦ Otherwise, remove A from the set of attributes which can be put into nodes. Then put a new node in the decision tree, where the new attribute being tested in the node is the one which scores highest for information gain relative to Sv (note: not relative to S). This new node starts the cycle again (from 2), with S replaced by Sv in the calculations and the tree gets built iteratively like this.

The algorithm terminates either when all the attributes have been

exhausted, or the decision tree perfectly classifies the examples.

Spesifikasikan masalah menentukan Atribut

Atribut

dan Target Atribut

Target Atribut

Target Atribut berdasarkan data yang ada

Hitung nilai Entropy

Entropy

Entropy dari setiap kriteria dengan

Entropy

data sample yang ditentukan.

Hitung Information Gain

Information Gain

Information Gain dari setiap kriteria

Node terpilih adalah kriteria dengan

Information Gain yang paling tinggi.

Ulangi sampai diperoleh node terakhir yang

berisi target atribut

Entropy(S) adalah jumlah bit yang diperkirakan

dibutuhkan untuk dapat mengekstrak suatu kelas

(+ atau -) dari sejumlah data acak pada ruang

sample S.

Entropy bisa dikatakan sebagai kebutuhan bit

untuk menyatakan suatu kelas. Semakin kecil nilai

Entropy maka semakin baik untuk digunakan dalam

mengekstraksi suatu kelas.

Panjang kode untuk menyatakan informasi secara

optimal adalah –log2 p bits untuk messages yang

mempunyai probabilitas p.

(11)

(12)

(13)

(14)

Decision Tree??

Sample _Atribut _{Target Atribut}

Jumlah instance = 8

Jumlah instance positif = 3

Jumlah instance negatif = 5

(

Hipertensi

)

Pins ce positif Pins ce positif Pins ce negatif Pins ce negatif

Entropy =− tan _ log2 tan _ − tan _ log2 tan _

(

) (

)

(

) (

)

0,955 0,424 0,531 0.678 -625 . 0 -1.415 375 . 0 625 . 0 log 625 . 0 375 . 0 log 375 . 0 8 5 log 8 5 8 3 log 8 3 2 2 2 2 = + = × − × − = × − × − =             ×       −             ×       − =

Jumlah instance = 8

Instance Usia

◦ Muda Instance positif = 1 Instance negatif = 3 ◦ Tua Instance positif = 2 Instance negatif = 2

Entropy Usia

◦ Entropy(muda) = 0.906 ◦ Entropy(tua) = 1 ( )

( ) ins ce positif ins ce positif ins ce negatif ins ce negatif negatif ce ins negatif ce ins positif ce ins positif ce ins P P P P Tua Entropy P P P P Muda Entropy _ tan 2 _ tan _ tan 2 _ tan _ tan 2 _ tan _ tan 2 _ tan log log log log − − = − − =

(15)

IF-UTAMA 57

(

)

( )

(

)

(

)

(

)

(

)

( )

002 .

0

5 .

0

453 .

0

955 .

0

1

8

4

906 .

0

8

4

955 .

0 ,

,

=

−

=

−

=

−

=

−

=

∑

∈ Tua Tua Muda Muda v Tua Muda v v

S

Entropy

S

Entropy

S

Entropy

S

Entropy

S

Entropy

Usia

S

Gain

Jumlah instance = 8 Intance Berat ◦ Overweight Instance positif = 3 Instance negatif = 1 ◦ Average Instance positif = 0 Instance negatif = 2 ◦ Underweight Instance positif = 0 Instance negatif = 2 ◦ Entropy(Overweight)=0.906 ◦ Entropy(Average)=0.5 ◦ Entropy(Underweight)=0.5 IF-UTAMA 58 ( ) ( )

( ) ins ce positif ins ce positif ins ce negatif ins ce negatif negatif ce ins negatif ce ins positif ce ins positif ce ins negatif ce ins negatif ce ins positif ce ins positif ce ins P P P P t Underweigh Entropy P P P P Average Entropy P P P P Overweight Entropy _ tan 2 _ tan _ tan 2 _ tan _ tan 2 _ tan _ tan 2 _ tan _ tan 2 _ tan _ tan 2 _ tan log log log log log log − − = − − = − − = ( ) ( ) ( ) ( )

(

)

(

)

(

)

( ) ( ) ( ) ( ) 0,252 125 . 0 125 . 0 453 . 0 955 . 0 5 . 0 8 2 5 . 0 8 2 906 . 0 8 4 955 . 0 , , , = − − − = − − − = − − − = − = ∑ ∈ t Underweigh t Underweigh average Average Overweight Overweight v t Underweigh Average Overwight v v S Entropy S S S Entropy S S S Entropy S S S Entropy S Entropy S S S Entropy Berat S Gain Jumlah instance = 8

Intance Jenis Kelamin

◦ Pria Instance positif = 2 Instance negatif = 4 ◦ Wanita Instance positif = 1 Instance negatif = 1 ◦ Entropy(Pria)=1 ◦ Entropy(Wanita)=0.75 ( )

( ) ins ce positif ins ce positif ins ce negatif ins ce negatif negatif ce ins negatif ce ins positif ce ins positif ce ins P P P P Wanita Entropy P P P P ia Entropy _ tan 2 _ tan _ tan 2 _ tan _ tan 2 _ tan _ tan 2 _ tan log log log log Pr − − = − − =

(16)

IF-UTAMA 61

(

)

( )

(

)

(

)

(

)

( )

(

)

0,017 188 . 0 75 . 0 955 . 0 75 . 0 8 2 1 8 6 955 . 0 min , Pr Pr , Pr = − − = − − = − − = − =

∑

∈ Wanita Wanita ia ia v Wanita ia v v S Entropy S S S Entropy S S S Entropy S Entropy S S S Entropy JenisKela S Gain

Atribut yang dipilih adalah atribut berat karena nilai

Information Gainnya paling tinggi

Jumlah Instance untuk Overweight = 4

Jumlah Instance untuk Average = 2

Jumlah Instance untuk Underweight = 2

Hitung Gain paling tinggi untuk dijadikan cabang

berikutnya IF-UTAMA 62 Berat Overweight Average Underweight

Jumlah instance = 4

Instance (Berat = Overwight ) & Usia =

◦ Muda Instance positif = 1 Instance negatif = 0 ◦ Tua Instance positif = 2 Instance negatif = 1

Instance (Berat = Overwight ) & Jenis Kelamin =

◦ Pria Instance positif = 2 Instance negatif = 1 ◦ Wanita Instance positif = 1 Instance negatif = 0

Clasification

Rule ????

(17)

Underfitting and Overfitting

◦ UnderfittingUnderfittingUnderfittingUnderfitting: when model is too simple, both training and test errors are large

◦ OverfittingOverfittingOverfittingOverfitting results in decision trees that are more complex than necessary

Missing Values

Costs of Classification

IF-UTAMA 65

Pre-Pruning (Early Stopping Rule)

◦ Stop the algorithm before it becomes a fully-grown

tree

◦ Typical stopping conditions for a node:

Stop if all instances belong to the same class

Stop if all the attribute values are the same

◦ More restrictive conditions:

Stop if number of instances is less than some user-specified threshold

Stop if class distribution of instances are independent of the available features (e.g., using χ2_test)

Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).

IF-UTAMA 66

Post-pruning

◦ Grow decision tree to its entirety

◦ Trim the nodes of the decision tree in a bottom-up

fashion

◦ If generalization error improves after trimming, replace

sub-tree by a leaf node.

◦ Class label of leaf node is determined from majority

class of instances in the sub-tree

◦ Can use MDL for post-pruning

A? A1 A2 A3 A4 Class = Yes 20 Class = No 10 Error = 10/30

Training Error (Before splitting) = 10/30 Pessimistic error = (10 + 0.5)/30 = 10.5/30 Training Error (After splitting) = 9/30 Pessimistic error (After splitting)

= (9 + 4 ××××0.5)/30 = 11/30 PRUNE! Class = Yes 8 Class = No 4 Class = Yes 3 Class = No 4 Class = Yes 4 Class = No 1 Class = Yes 5 Class = No 1

(18)

Missing values affect decision tree construction in

three different ways:

◦ Affects how impurity measures are computed

◦ Affects how to distribute instance with missing value to

child nodes

◦ Affects how a test instance with missing value is

classified

Tid Refund Marital Status Taxable Income Class 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 ? Single 90K Yes 1 0 Class = Yes Class = No Refund=Yes 0 3 Refund=No 2 4 Refund=? 1 0 Split on Refund: Entropy(Refund=Yes) = 0 Entropy(Refund=No) = -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183 Entropy(Children) = 0.3 (0) + 0.6 (0.9183) = 0.551 Gain = 0.9 ××××(0.8813 – 0.551) = 0.3303 Missing value Before Splitting: Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813 Class=Yes 0 + 3/9 Class=No 3

Tid Refund Marital Status Taxable Income Class 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 Refund Yes No Class=Yes 0 Class=No 3 Cheat=Yes 2 Cheat=No 4 Refund Yes

Tid Refund Marital Status Taxable Income Class 10 ? Single 90K Yes 1 0 No Class=Yes 2 + 6/9 Class=No 4

Probability that Refund=Yes is 3/9 Probability that Refund=No is 6/9 Assign record to the left child with weight = 3/9 and to the right child with weight = 6/9 Refund MarSt TaxInc YES NO NO NO Yes _No Married Single, Divorced < 80K > 80K

Married Single Divorced Total

Class=No 3 1 0 4

Class=Yes 6/9 1 1 2.67

Total 3.67 2 1 6.67

Tid Refund Marital Status Taxable Income Class 11 No ? 85K ? 10 New record:

Probability that Marital Status = Married is 3.67/6.67 Probability that Marital Status ={Single,Divorced} is 3/6.67