ID3 : Induksi Decision Tree

(1)

ID3 : Induksi Decision Tree

Singkatan:

Iterative Dichotomiser 3

Induction of Decision "3" (baca: Tree) Pembuat: Ross Quinlan, sejak akhir

dekade 70-an.

Pengembangan Lanjut: Cikal bakal algoritma C4.5, pada tahun 1993.

Features: Tahap belajar yang cepat; time complexity yang rendah; ketelitian

klasifikasi yang tinggi.

Kategori Learning: Concept Learning, dengan tujuan

mendeskripsikan "Konsep umum apakah yang digunakan?"

Tujuan Algoritma: mendapatkan decision tree (salah satu bentuk "Classification Models") yang terbaik.

Problem: Upaya mendapatkan decision tree terbaik (minimal) yang konsisten dari sekumpulan data, termasuk dalam

kategori algoritma NP-Hard / Completeness.

Mekanisme Konstruksi:

Dilakukan secara top-down, diawali pertanyaan: "Attribute mana yang harus diperiksa pada root dari decision tree?"

Dibentuk dengan mempartisi training examples.

Kekuatan Algoritma yang Terutama: fungsi heuristik information gain untuk memilih attribute terbaik.

Overview pada Algoritma: Mewujudkan Greedy Heuristic Search: Hill-Climbing TANPA Backtracking.

FUNGSI

YANG TIDAK DIKETAHUI x1

x2 x3 : xn

y = f (x1,x2,x3, ... , xn)

(2)

Algoritma ID3

PROCEDURE ID3 (Examples, TargetAttribute, Attributes)

Examples are the training examples. Target-attribute is the attribute whose value is to be predicted by the tree. Attributes is a list of other attributes that may be tested by the learned decision tree.

Returns a decision tree that correctly classifies the given Examples.

Create a Root node for the tree

IF all Examples are positive, Return the single-node tree Root, with label = +

IF all Examples are negative, Return the single-node tree Root, with label = -

IF attributes is empty, Return the single-node tree Root, with label = most common value of Target_attribute in Examples Otherwise Begin

A <--- the attribute from Attributes that best* classifies Examples

The decision attribute for Root <--- A For each possible value, v_i, of A,

- Add a new tree branch below Root, corresponding to the test A = v_i

- Let Examples_vi be the subset of Examples that have value v_i for A

- IF Examples_vi is empty

* THEN below this new branch add a leaf node with label

= most common value of Target_attribute in Examples

* ELSE below this new branch add the subtree

Call ID3(Examples, Target_attribute, Attributes - {A})) End

Return Root

* The best attribute is the one with highes information gain, as defined in Equation:

Gain(S, A) = Entropy(S) −

v∈Values(A)

Σ

^S_S^v

^Entropy(S

^v

⁾

(3)

Beberapa Terms dan Contoh

14 Minggu Permainan Tenis pada Setiap Sabtu Pagi

Examples (S), adalah training examples yang ditunjukkan oleh tabel di bawah ini:

Day Outlook Temperature Humidity Wind Play Tennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Target Attribute adalah PlayTennis yang memiliki value yes atau no.

Attribute adalah Outlook, Temperature, Humidity, dan Wind.

Tunjukkan Model Klasifikasi Decision Tree untuk

Pengambilan Keputusan: "Bermain tenis atau tidak?", dari 14 minggu pengalaman seperti ditunjukkan oleh tabel di atas, dengan menggunakan Algoritma ID3 !

(4)

Solusi

S adalah koleksi dari 14 contoh dengan 9 contoh positif dan 5 contoh negatif, ditulis dengan notasi [9+,5-].

Entropy dari S adalah:

Entropy(S) =

^c

i=1

Σ ^{− p}

ⁱ

^log

2

p

_i

Entropy([9+,5-]) = - (9/14)log₂(9/14) - (5/14)log₂(5/14) = 0.94029

Catatan:

Entropy(S) = 0, jika semua contoh pada S berada dalam kelas yang sama.

Entropy(S) = 1, jika jumlah contoh positif dan jumlah contoh negatif dalam S adalah sama.

0 < Entropy(S) < 1, jika jumlah contoh positif dan negatif dalam S tidak sama.

Gain(S,A) adalah Information Gain dari sebuah attribute A pada koleksi contoh S:

Gain(S, A) = Entropy(S) −

v∈Values(A)

Σ

^S_S^v

^Entropy(S

^v

⁾

(5)

Values(Wind) = Weak, Strong S_Weak = [6+,2-]

S_Strong = [3+,3-]

Gain(S,Wind) = Entropy(S) - (8/14)Entropy(S_Weak) - 6/14)Entropy(S_Strong)

= 0.94029 - (8/14)0.81128 - (6/14)1.0000

= 0.04813

Values(Humidity)= High, Normal S_High = [3+,4-]

S_Normal = [6+,1-]

Gain(S,Humidity)= Entropy(S) - (7/14)Entropy(S_High) - (7/14)Entropy(S_Normal)

= 0.94029 - (7/14)0.98523 - (7/14)0.59167

= 0.15184

Values(Temperature) = Hot, Mild, Cool

S_Hot = [2+,2-]

S_Mild = [4+,2-]

S_Cool = [3+,1-]

Gain(S,Temperature) = Entropy(S) - (4/14)Entropy(S_Hot) -

(6/14)Entropy(S_Mild) - (4/14)Entropy(S_Cool)

= 0.94029 - (4/14)1.00000 - (6/14)0.91830 - (4/14)0.81128

= 0.02922

Values(Outlook)= Sunny, Overcast, Rain

S_Sunny = [2+,3-]

S_Overcast = [4+,0-]

S_Rain = [3+,2-]

Gain(S,Outlook)= Entropy(S) - (5/14)Entropy(S_Sunny) -

(4/14)Entropy(S_Overcast) - (5/14)Entropy(S_Rain)

= 0.94029 - (5/14)0.97075 - (4/14)1.000000 - (5/14)0.97075

= 0.24675

Jadi, information gain untuk 4 atribut yang ada adalah:

Gain(S,Wind) = 0.04813 Gain(S,Humidity) = 0.15184 Gain(S,Temperature) = 0.02922 Gain(S,Outlook) = 0.24675

Tampak bahwa attribute Outlook akan menyediakan prediksi terbaik untuk target attribute PlayTennis.

(6)

Untuk branch node Outlook=Sunny, S_Sunny = [D1, D2, D8, D9, D11]

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes D11 Sunny Mild Normal Strong Yes

Values(Temperature) = Hot, Mild, Cool

S_Hot = [0+,2-]

S_Mild = [1+,1-]

S_Cool = [1+,0-]

Gain(S_Sunny, Temperature)

= Entropy(S_Sunny) - (2/5)Entropy(S_Hot) - (2/5)Entropy(S_Mild) - (1/5)Entropy(S_Cold)

= 0.97075 - (2/5)0.00000 - (2/5)1.00000 - (1/5)0.00000

= 0.57075

Outlook

Yes

? ?

Over cast Rain Sunny

[D1, D2, ... D14]

[9+,5-]

[D1, D2, D8, D9, D11]

[2+,3-]

[D4, D5, D6, D10, D14]

[3+,2-]

[D3, D7, D12, D13]

[4+,0-]

(7)

Values(Humidity) = High, Normal

S_High = [0+,3-]

S_Normal = [2+,0-]

Gain(S_Sunny, Humidity)

= Entropy(S_Sunny) - (3/5)Entropy(S_High) - (2/5)Entropy(S_Normal)

= 0.97075 - (3/5)0.00000 - (2/5)1.00000

= 0.97075

Values(Wind) = Weak, Strong

S_Weak = [1+,2-]

S_Strong = [1+,1-]

Gain(S_Sunny, Wind) = Entropy(S_Sunny) - (3/5)Entropy(S_Weak) - (2/5)Entropy(S_Strong)

= 0.97075 - (3/5)0.91830 - (2/5)1.00000

= 0.01997

Attribute Humidity menyediakan prediksi terbaik pada level ini.

Outlook

Yes

Rain

Over cast

High

Humidity Sunny

Normal

No Yes

[D1, D2, ... D14]

[9+,5-]

[D1, D2, D8, D9, D11]

[2+,3-] [D4, D5, D6, D10, D14]

[3+,2-]

[D3, D7, D12, D13]

[4+,0-]

[D9, D11]

[2+,0-]

[D1, D2, D8]

[0+,3-]

?

(8)

Untuk branch node Outlook=Rain, S_Rain = [D4, D5, D6, D10, D14]

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D10 Rain Mild Normal Weak Yes D14 Rain Mild High Strong No

Values(Temperature) = Mild, Cool

{Perhatikan: Tidak ada lagi temperature=hot saat ini}

S_Mild = [2+,1-]

S_Cool = [1+,1-]

Gain(S_Rain, Temperature) = Entropy(S_Rain) - (3/5)Entropy(S_Mild) - (2/5)Entropy(S_Cold)

= 0.97075 - (3/5)0.91830 - (2/5)1.00000

= 0.01997

Values(Humidity) = High, Normal

S_High = [1+,1-]

S_Normal = [2+,1-]

Gain(S_Rain, Humidity) = Entropy(S_Rain) - (2/5)Entropy(S_High) - (3/5)Entropy(S_Normal)

= 0.97075 - (2/5)1.00000 - (3/5)0.91830

= 0.01997

Values(Wind) = Weak, Strong

S_Weak = [3+,0-]

S_Strong = [0+,2-]

Gain(S_Rain, Wind) = Entropy(S_Rain) -(3/5)Entropy(S_Weak) - (2/5)Entropy(S_Strong)

= 0.97075 - (3/5)0.00000 - (2/5)0.00000

= 0.97075

Attribute Wind menyediakan prediksi terbaik pada level ini.

(9)

Rule-Rule yang telah Dipelajari:

IF Outlook = Sunny AND Humidity = High THEN PlayTennis = No

IF Outlook = Sunny AND Humidity = Normal THEN PlayTennis = Yes

IF Outlook = Overcast THEN PlayTennis = Yes

IF Outlook = Rain AND Wind = Strong THEN PlayTennis = No

IF Outlook = Rain AND Wind = Weak THEN PlayTennis = Yes

Outlook

Yes

Rain

Over cast

High

Humidity Sunny

Normal

No Yes

Wind

Yes No

Weak Strong

[D1, D2, ... D14]

[9+,5-]

[D1, D2, D8, D9, D11]

[2+,3-] [D4, D5, D6, D10, D14]

[3+,2-]

[D3, D7, D12, D13]

[4+,0-]

[D9, D11]

[2+,0-]

[D1, D2, D8]

[0+,3-] [D4, D5, D10]

[3+,0-] [D6, D14]

[0+,2-]

(10)

Studi Kasus

Komite ujian untuk sebuah kampus bertemu mendiskusikan hasil ujian sejumlah mahasiswanya.

Terdapat 3 (tiga) kemungkinan hasil evaluasi, mahasiswa bisa:

lulus (P=Pass);

diberi kesempatan mengulang (R=Resit); atau gagal (F=Fail).

Beberapa pertemuan untuk memberikan hasil evaluasi sering kali memakan waktu yang lama. Sering pula membutuhkan penasihat ahli (pakar) pendidikan yang telah memiliki

pengalaman luas dari banyak pengambilan keputusan serupa.

Para pakar ini diminta untuk merumuskan sebuah petunjuk (guidelines), dan mereka kemudian menyusun sekumpulan contoh dari berbagai kasus pengambilan keputusan.

Target Attribute-nya adalah hasil evaluasi (Pass, Resit, dan Fail), sedangkan attributes-nya adalah:

NFails : Jumlah ujian yang gagal

NMarg : Jumlah ujian yang gagal, dengan nilai pada batas berhasil / gagal

Att : Catatan kehadiran mahasiswa Ext : Ada / tidaknya kondisi yang

meringankan, misalnya kondisi sakit yang menyebabkan kegagalan yang tak

diinginkan.

Ant : Hasil yang telah diantisipasi.

Induksi decision treenya dilakukan. Setelah pemeriksaan lanjut model pengambilan keputusan ini, para ahli memutuskan untuk menambahkan sejumlah contoh lagi pada kumpulan kasus,

sebab mereka merasa bahwa aturan-aturan untuk sekitar 2 atau 3 hasil yang gagal belumlah cukup. Mereka juga memutuskan untuk memodifikasi contoh untuk nomor 8.

(11)

Tabel contoh mula-mula:

Example Number

NFails NMarg Att Ext Ant Result

1 0 0 good no P P

2 0 0 poor yes F P

3 0 0 good yes F P

4 3 0 good no F F

5 3 1 poor no F F

6 3 0 good no P F

7 3 2 good yes P R

8 2 1 poor no F R

9 2 2 good yes P R

10 1 0 poor yes P R

11 1 1 good yes F R

12 1 1 good no F R

13 1 0 poor no F F

Penambahan dan modifikasinya adalah sebagai berikut:

Example Number

NFails NMarg Att Ext Ant Result

8 2 1 poor no F F

14 3 2 good no P F

15 2 2 good no F R

16 2 1 good yes P R

17 2 0 poor no F F

ID3 : Induksi Decision Tree