Sesi 09
Dosen Pembina :
Danang Junaedi
IF-UTAMA 1 IF-UTAMA 2
Given a collection of records (
training set
)
◦ Each record contains a set of attributes, one of the attributes is the class.
Find a
model
for class attribute as a function of
the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
◦ A test setis used to determine the accuracy of the model.Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
IF-UTAMA 5
Apply Model Learn Model Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10
Predicting tumor cells as benign or malignant
Classifying credit card transactions
as legitimate or fraudulent
Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random coil
Categorizing news stories as finance,
weather, entertainment, sports, etc
IF-UTAMA 6
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
Mengubah data menjadi pohon keputusan
A decision tree is a chronological
representation of the decision problem.
Each decision tree has two types of
nodes; round nodes correspond to the states of nature while square nodes correspond to the decision alternatives.
The branches leaving each round
node represent the different states of nature while the branches leaving each square node represent the different decision alternatives.
At the end of each limb of a tree are
the payoffs attained from the series of branches making up that limb.
IF-UTAMA 9 IF-UTAMA 10
Diagnosa penyakit tertentu, seperti hipertensi,
kanker, stroke dan lain-lain
Pemilihan produk seperti rumah, kendaraan,
komputer dan lain-lain
Pemilihan pegawai teladan sesuai dengan
kriteria tertentu
Deteksi gangguan pada komputer atau jaringan
komputer seperti Deteksi Entrusi, deteksi virus
(trojan dan varians)
IF-UTAMA 13 T id Refund M arital Status Taxable Incom e C heat 1 Yes Single 125K N o 2 N o M arried 100K N o 3 N o Single 70K N o 4 Yes M arried 120K N o 5 N o D ivorced 95K Y es 6 N o M arried 60K N o 7 Yes D ivorced 220K N o 8 N o Single 85K Y es 9 N o M arried 75K N o 10 N o Single 90K Y es 10 Training Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Splitting Attributes
Model: Decision Tree
IF-UTAMA 14 T id Refund M arital Status Taxable Income Cheat 1 Yes Single 125K No 2 N o M arried 100K No 3 N o Single 70K No 4 Yes M arried 120K No 5 N o Divorced 95K Yes 6 N o M arried 60K No 7 Yes Divorced 220K No 8 N o Single 85K Yes 9 N o M arried 75K No 10 N o Single 90K Yes 10 MarSt Refund TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
There could be more than one tree that fits the same data!
Apply Model Learn Model Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 1 0
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 1 0 Decision Tree
IF-UTAMA 17 Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 1 0 Test Data
Start from the root of tree.
IF-UTAMA 18 Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 1 0 Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 1 0 Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 1 0 Test Data
IF-UTAMA 21 Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 1 0 Test Data IF-UTAMA 22 Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 1 0 Test Data
Assign Cheat to “No”
Apply Model Learn Model Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 1 0
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 1 0 Decision Tree
IF-UTAMA 25 IF-UTAMA 26
Data dinyatakan dalam bentuk tabel dengan atribut dan
record.
AtributAtributAtributAtribut menyatakan suatu parameter yang dibuat
sebagai kriteria dalam pembentukan tree. Misalkan untuk menentukan main tenis, kriteria yang
diperhatikan adalah cuaca, angin dan temperatur. Salah satu atribut merupakan atribut yang menyatakan data
solusi per-item data yang disebut dengan target target target target
atribut atribut atribut atribut.
Atribut memiliki nilai-nilai yang dinamakan dengan
instance instance instance
instance. Misalkan atribut cuaca mempunyai instance berupa cerah, berawan dan hujan.
1.
Mengubah bentuk data (tabel) menjadi model
tree.
◦ ID3 Algorithm
◦ C.45 Algorithm
◦ etc
2.
Mengubah model tree menjadi rule
◦ Disjunction (v OR)
◦ Conjunction (^ AND)
3.
Menyederhanakan Rule (Pruning)
Given a set of examples, S, categorised in categories ci, then:
1.Choose the root node to be the attribute, A, which scores the highest
for information gain relative to S.
2.For each value v that A can possibly take, draw a branch from the
node.
3.For each branch from A corresponding to value v, calculate Sv. Then:
◦ If Sv is empty, choose the category cdefault which contains the most examples from S, and put this as the leaf node category which ends that branch.
◦ If Sv contains only examples from a category c, then put c as the leaf node category which ends that branch.
◦ Otherwise, remove A from the set of attributes which can be put into nodes. Then put a new node in the decision tree, where the new attribute being tested in the node is the one which scores highest for information gain relative to Sv (note: not relative to S). This new node starts the cycle again (from 2), with S replaced by Sv in the calculations and the tree gets built iteratively like this.
The algorithm terminates either when all the attributes have been
exhausted, or the decision tree perfectly classifies the examples.
IF-UTAMA 37 IF-UTAMA 38
Spesifikasikan masalah menentukan Atribut
Atribut
Atribut
Atribut
dan Target Atribut
Target Atribut
Target Atribut
Target Atribut berdasarkan data yang ada
Hitung nilai Entropy
Entropy
Entropy dari setiap kriteria dengan
Entropy
data sample yang ditentukan.
Hitung Information Gain
Information Gain
Information Gain
Information Gain dari setiap kriteria
Node terpilih adalah kriteria dengan
Information Gain yang paling tinggi.
Ulangi sampai diperoleh node terakhir yang
berisi target atribut
Entropy(S) adalah jumlah bit yang diperkirakan
dibutuhkan untuk dapat mengekstrak suatu kelas
(+ atau -) dari sejumlah data acak pada ruang
sample S.
Entropy bisa dikatakan sebagai kebutuhan bit
untuk menyatakan suatu kelas. Semakin kecil nilai
Entropy maka semakin baik untuk digunakan dalam
mengekstraksi suatu kelas.
Panjang kode untuk menyatakan informasi secara
optimal adalah –log2 p bits untuk messages yang
mempunyai probabilitas p.
IF-UTAMA 41 IF-UTAMA 42
IF-UTAMA 53 IF-UTAMA 54
Decision Tree??
Sample Atribut Target Atribut
Jumlah instance = 8
Jumlah instance positif = 3
Jumlah instance negatif = 5
(
Hipertensi)
Pins ce positif Pins ce positif Pins ce negatif Pins ce negatifEntropy =− tan _ log2 tan _ − tan _ log2 tan _
(
) (
)
(
) (
)
0,955 0,424 0,531 0.678 -625 . 0 -1.415 375 . 0 625 . 0 log 625 . 0 375 . 0 log 375 . 0 8 5 log 8 5 8 3 log 8 3 2 2 2 2 = + = × − × − = × − × − = × − × − =Jumlah instance = 8
Instance Usia
◦ Muda Instance positif = 1 Instance negatif = 3 ◦ Tua Instance positif = 2 Instance negatif = 2Entropy Usia
◦ Entropy(muda) = 0.906 ◦ Entropy(tua) = 1 ( )( ) ins ce positif ins ce positif ins ce negatif ins ce negatif negatif ce ins negatif ce ins positif ce ins positif ce ins P P P P Tua Entropy P P P P Muda Entropy _ tan 2 _ tan _ tan 2 _ tan _ tan 2 _ tan _ tan 2 _ tan log log log log − − = − − =
IF-UTAMA 57
(
)
( )
( )
( )
(
)
(
)
(
)
(
)
( )
002
.
0
5
.
0
453
.
0
955
.
0
1
8
4
906
.
0
8
4
955
.
0
,
,=
−
−
=
−
−
=
−
−
=
−
=
∑
∈ Tua Tua Muda Muda v Tua Muda v vS
Entropy
S
S
S
Entropy
S
S
S
Entropy
S
Entropy
S
S
S
Entropy
Usia
S
Gain
Jumlah instance = 8 Intance Berat ◦ Overweight Instance positif = 3 Instance negatif = 1 ◦ Average Instance positif = 0 Instance negatif = 2 ◦ Underweight Instance positif = 0 Instance negatif = 2 ◦ Entropy(Overweight)=0.906 ◦ Entropy(Average)=0.5 ◦ Entropy(Underweight)=0.5 IF-UTAMA 58 ( ) ( )( ) ins ce positif ins ce positif ins ce negatif ins ce negatif negatif ce ins negatif ce ins positif ce ins positif ce ins negatif ce ins negatif ce ins positif ce ins positif ce ins P P P P t Underweigh Entropy P P P P Average Entropy P P P P Overweight Entropy _ tan 2 _ tan _ tan 2 _ tan _ tan 2 _ tan _ tan 2 _ tan _ tan 2 _ tan _ tan 2 _ tan log log log log log log − − = − − = − − = ( ) ( ) ( ) ( )
(
)
(
)
(
)
( ) ( ) ( ) ( ) 0,252 125 . 0 125 . 0 453 . 0 955 . 0 5 . 0 8 2 5 . 0 8 2 906 . 0 8 4 955 . 0 , , , = − − − = − − − = − − − = − = ∑ ∈ t Underweigh t Underweigh average Average Overweight Overweight v t Underweigh Average Overwight v v S Entropy S S S Entropy S S S Entropy S S S Entropy S Entropy S S S Entropy Berat S Gain Jumlah instance = 8Intance Jenis Kelamin
◦ Pria Instance positif = 2 Instance negatif = 4 ◦ Wanita Instance positif = 1 Instance negatif = 1 ◦ Entropy(Pria)=1 ◦ Entropy(Wanita)=0.75 ( )
( ) ins ce positif ins ce positif ins ce negatif ins ce negatif negatif ce ins negatif ce ins positif ce ins positif ce ins P P P P Wanita Entropy P P P P ia Entropy _ tan 2 _ tan _ tan 2 _ tan _ tan 2 _ tan _ tan 2 _ tan log log log log Pr − − = − − =
IF-UTAMA 61
(
)
( )
( )
( )
(
)
(
)
(
)
( )
(
)
0,017 188 . 0 75 . 0 955 . 0 75 . 0 8 2 1 8 6 955 . 0 min , Pr Pr , Pr = − − = − − = − − = − =∑
∈ Wanita Wanita ia ia v Wanita ia v v S Entropy S S S Entropy S S S Entropy S Entropy S S S Entropy JenisKela S GainAtribut yang dipilih adalah atribut berat karena nilai
Information Gainnya paling tinggi
Jumlah Instance untuk Overweight = 4
Jumlah Instance untuk Average = 2
Jumlah Instance untuk Underweight = 2
Hitung Gain paling tinggi untuk dijadikan cabang
berikutnya IF-UTAMA 62 Berat Overweight Average Underweight
Jumlah instance = 4
Instance (Berat = Overwight ) & Usia =
◦ Muda Instance positif = 1 Instance negatif = 0 ◦ Tua Instance positif = 2 Instance negatif = 1
Instance (Berat = Overwight ) & Jenis Kelamin =
◦ Pria Instance positif = 2 Instance negatif = 1 ◦ Wanita Instance positif = 1 Instance negatif = 0
Clasification
Rule ????
Underfitting and Overfitting
◦ UnderfittingUnderfittingUnderfittingUnderfitting: when model is too simple, both training and test errors are large
◦ OverfittingOverfittingOverfittingOverfitting results in decision trees that are more complex than necessary
Missing Values
Costs of Classification
IF-UTAMA 65
Pre-Pruning (Early Stopping Rule)
◦ Stop the algorithm before it becomes a fully-grown
tree
◦ Typical stopping conditions for a node:
Stop if all instances belong to the same class
Stop if all the attribute values are the same
◦ More restrictive conditions:
Stop if number of instances is less than some user-specified threshold
Stop if class distribution of instances are independent of the available features (e.g., using χ2test)
Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).
IF-UTAMA 66
Post-pruning
◦ Grow decision tree to its entirety
◦ Trim the nodes of the decision tree in a bottom-up
fashion
◦ If generalization error improves after trimming, replace
sub-tree by a leaf node.
◦ Class label of leaf node is determined from majority
class of instances in the sub-tree
◦ Can use MDL for post-pruning
A? A1 A2 A3 A4 Class = Yes 20 Class = No 10 Error = 10/30
Training Error (Before splitting) = 10/30 Pessimistic error = (10 + 0.5)/30 = 10.5/30 Training Error (After splitting) = 9/30 Pessimistic error (After splitting)
= (9 + 4 ××××0.5)/30 = 11/30 PRUNE! Class = Yes 8 Class = No 4 Class = Yes 3 Class = No 4 Class = Yes 4 Class = No 1 Class = Yes 5 Class = No 1
Missing values affect decision tree construction in
three different ways:
◦ Affects how impurity measures are computed
◦ Affects how to distribute instance with missing value to
child nodes
◦ Affects how a test instance with missing value is
classified
IF-UTAMA 69 IF-UTAMA 70
Tid Refund Marital Status Taxable Income Class 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 ? Single 90K Yes 1 0 Class = Yes Class = No Refund=Yes 0 3 Refund=No 2 4 Refund=? 1 0 Split on Refund: Entropy(Refund=Yes) = 0 Entropy(Refund=No) = -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183 Entropy(Children) = 0.3 (0) + 0.6 (0.9183) = 0.551 Gain = 0.9 ××××(0.8813 – 0.551) = 0.3303 Missing value Before Splitting: Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813 Class=Yes 0 + 3/9 Class=No 3
Tid Refund Marital Status Taxable Income Class 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 Refund Yes No Class=Yes 0 Class=No 3 Cheat=Yes 2 Cheat=No 4 Refund Yes
Tid Refund Marital Status Taxable Income Class 10 ? Single 90K Yes 1 0 No Class=Yes 2 + 6/9 Class=No 4
Probability that Refund=Yes is 3/9 Probability that Refund=No is 6/9 Assign record to the left child with weight = 3/9 and to the right child with weight = 6/9 Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Married Single Divorced Total
Class=No 3 1 0 4
Class=Yes 6/9 1 1 2.67
Total 3.67 2 1 6.67
Tid Refund Marital Status Taxable Income Class 11 No ? 85K ? 10 New record:
Probability that Marital Status = Married is 3.67/6.67 Probability that Marital Status ={Single,Divorced} is 3/6.67
1.
Dr. Mourad YKHLEF.2009. Decision Tree
Decision Tree
Decision Tree
Decision Tree
Induction System
Induction System
Induction System
Induction System.King Saud University
2.
Achmad Basuki, Iwan Syarif. 2003. Decision
Decision
Decision
Decision
Tree
Tree
Tree
Tree. Politeknik Elektronika Negeri Surabaya
(PENS) – ITS
3.