Association Rule Dengan FP-Tree dan FP Growth

(1)

Association Rule

Dengan FP-Tree dan FP Growth

Jesa Ariawan, MTI

(2)

Latar Belakang

• Algoritma Association Rule dengan Apriori kurang baik bila

terdapat banyak pola kombinasi data yang sering muncul (banyak

frequent pattern), banyak jenis item tetapi pemenuhan minimum

support rendah.

• Algoritma Association Rule dengan Apriori membutuhkan waktu

yang cukup lama karena scanning database dapat dilakukan

berulang-ulang untuk mendapatkan frequent pattern yang ideal.

• FP-Tree (Frequent Pattern – Tree) merupakan suatu algoritma

yang dirancang untuk mengatasi kendala bottleneck pada proses

penggalian data dengan algoritma Apriori (Zhao et al. 2003).

(3)

Ide gagasan FP-Tree

• Pemampatan data dengan model struktur data pohon untuk

menghindari pengulangan scanning database.

• Tanpa memerlukan candidate generation

• Dilanjutkan dengan proses algoritma FP-growth yang dapat

langsung mengekstrak frequent Itemset dari FP tree yang telah

terbentuk dengan menggunakan prinsip divide and conquer.

(4)

Definisi Frequent Pattern – Tree (FP – Tree)

• Terdiri atas sebuah root dengan label ‘null’, sekumpulan subtree

yang menjadi child dari root dan sebuah tabel frequent header.

• Setiap node dalam FP-Tree mengandung tiga informasi penting.

yaitu

label

item,

menginformasikan

jenis

item

yang

direpresentasikan node tersebut, support count, merepresentasikan

jumlah lintasan transaksi yang melalui node tesebut, dan pointer

penghubung yang menghubungkan node-node dengan label item

sama antar-lintasan, ditandai dengan garis panah putus-putus.

• Isi dari setiap baris dalam tabel frequent header terdiri atas label

item dan head of nodelink yang menunjuk ke node pertama dalam

FP-Tree yang menyimpan label item tersebut.

(5)

Proses untuk mendapatkan Frequent Itemset

dilakukan dalam 2 tahap

• Pembentukan Frequent Pattern – Tree (FP-Tree)

• Ekstrak Frequent Itemset hasil dari FP-Tree dengan

menggunakan algoritma FP-Growth

(6)

Algoritma FP Growth

dibagi dalam 3 tugas utama

1. Tahap Pembangkitan Conditional Pattern Base

Conditional Pattern Base merupakan subdatabase yang berisi prefix path

(lintasan prefix) dan suffix pattern (pola akhiran). Pembangkitan conditional

pattern base didapatkan melalui FP-tree yang telah dibangun sebelumnya.

2. Tahap Pembangkitan Conditional FP-tree

Pada tahap ini, support count dari setiap item pada setiap conditional pattern

base dijumlahkan, lalu setiap item yang memiliki jumlah support count lebih

besar sama dengan minimum support count akan dibangkitkan dengan

conditional FP-tree.

3. Tahap Pencarian frequent itemset

Apabila Conditional FP-tree merupakan lintasan tunggal (single path), maka

didapatkan frequent itemset dengan melakukan kombinasi item untuk setiap

conditional FP-tree. Jika bukan lintasan tunggal, maka dilakukan pembangkitan

(7)

Contoh Proses Asosiasi

dengan FP-Tree dan FP-Growth

• Contoh Tabel Transaksai dengan minimum support count 2

No.

Transaksi

1 a,b

2 b,c,d,g,h

3 a,c,d,e,f

4 a,d,e

5 a,b,z,c

6 a,b,c,d

7 a,r

8 a,b,c

9 a,b,d

10 b,c,e

(8)

Contoh Proses Asosiasi

dengan FP-Tree dan FP-Growth

• Frekuensi kemunculan tiap item dapat dilihat pada tabel berikut

Item

Frekuensi

a

8 b

7 c

6 d

5 e

3 f

1 r

1 z

1 g

1 h

1

(9)

Contoh Proses Asosiasi

dengan FP-Tree dan FP-Growth

• Kemunculan item yang frequent dalam setiap transaksi, diurut berdasarkan yang

frekuensinya paling tinggi.

TID

Item

1 {a,b}

2 {b,c,d}

3 {a,c,d,e}

4 {a,d,e}

5 {a,b,c}

6 {a,b,c,d}

7 {a}

8 {a,b,c}

9 {a,b,d}

10 {b,c,e}

(10)

Frequent Pattern Growth (FP-Growth) Algorithm

An Introduction

Florian Verhein

fverhein@it.usyd.edu.au

School of Information Technologies, The University of Sydney,

Australia

January 10, 2008

Outline

Introduction

FP-Tree data structure

Step 1: FP-Tree Construction

Step 2: Frequent Itemset Generation Discussion

(11)

Introduction

I Apriori: uses a generate-and-test approach generates

candidate itemsets and tests if they are frequent

I Generation of candidate itemsets is expensive (in both space

and time)

I Support counting is expensive

I Subset checking (computationally expensive)

I Multiple Database scans (I/O)

I FP-Growth: allows frequent itemset discovery without

candidate itemset generation. Two step approach:

I Step 1: Build a compact data structure called the FP-tree

I Built using 2 passes over the data-set.

I Step 2: Extracts frequent itemsets directly from the FP-tree

I Traversal through FP-Tree

Core Data Structure: FP-Tree

I Nodes correspond to items and have a

counter

I FP-Growth reads 1 transaction at a

time and maps it to a path

I Fixed order is used, so paths can

overlap when transactions share items (when they have the same prex ).

I In this case, counters are incremented

I Pointers are maintained between

nodes containing the same item, creating singly linked lists (dotted lines)

I The more paths that overlap, the

higher the compression. FP-tree may t in memory.

I Frequent itemsets extracted from the

(12)

Step 1: FP-Tree Construction (Example)

FP-Tree is constructed using 2 passes over the data-set:

I Pass 1:

I Scan data and nd support for each item. I Discard infrequent items.

I Sort frequent items in decreasing order based on their support.

I For our example: a, b, c, d, e

I Use this order when building the FP-Tree, so common prexes

can be shared.

Step 1: FP-Tree Construction (Example)

I Pass 2: construct the FP-Tree (see diagram on next slide) I Read transaction 1: {a, b}

I Create 2 nodes a and b and the path null → a → b. Set

counts of a and b to 1. I Read transaction 2: {b, c, d}

I Create 3 nodes for b, c and d and the path

null → b → c → d. Set counts to 1.

I Note that although transaction 1 and 2 share b, the paths are

disjoint as they don't share a common prex. Add the link between the b's.

I Read transaction 3: {a, c, d, e}

I It shares common prex item a with transaction 1 so the path

for transaction 1 and 3 will overlap and the frequency count for node a will be incremented by 1. Add links between the c's and d's.

I Continue until all transactions are mapped to a path in the

(13)

Step 1: FP-Tree Construction (Example)

FP-Tree size

I The FP-Tree usually has a smaller size than the uncompressed

data typically many transactions share items (and hence prexes).

I Best case scenario: all transactions contain the same set of

items.

I 1 path in the FP-tree

I Worst case scenario: every transaction has a unique set of

items (no items in common)

I Size of the FP-tree is at least as large as the original data.

I Storage requirements for the FP-tree are higher need to

store the pointers between the nodes and the counters.

I The size of the FP-tree depends on how the items are ordered

I Ordering by decreasing support is typically used but it does

(14)

Step 2: Frequent Itemset Generation

I FP-Growth extracts frequent itemsets from the FP-tree. I Bottom-up algorithm from the leaves towards the root

I Divide and conquer: rst look for frequent itemsets ending in

e, then de, etc. . . then d, then cd, etc. . .

I First, extract prex path sub-trees ending in an item(set). (hint: use

the linked lists)

↑ Complete FP-tree

→ Example: prex path sub-trees

Step 2: Frequent Itemset Generation

I Each prex path sub-tree is processed recursively to extract

the frequent itemsets. Solutions are then merged.

I E.g. the prex path sub-tree for e will be used to extract

frequent itemsets ending in e, then in de, ce, be and ae, then in cde, bde, cde, etc.

I Divide and conquer approach

(15)

Example

Let minSup = 2 and extract all frequent itemsets containing e.

I 1. Obtain the prex path sub-tree for e:

I 2. Check if e is a frequent item by adding the counts along the

linked list (dotted line). If so, extract it.

I Yes, count =3 so {e} is extracted as a frequent itemset.

I 3. As e is frequent, nd frequent itemsets ending in e. i.e. de,

ce, be and ae.

I i.e. decompose the problem recursively.

I To do this, we must rst to obtain the conditional FP-tree for

e.

Conditional FP-Tree

I The FP-Tree that would be built if we only consider

transactions containing a particular itemset (and then removing that itemset from all transactions).

(16)

Conditional FP-Tree

To obtain the conditional FP-tree for e from the prex sub-tree ending in e:

I Update the support counts along the prex paths (from e) to

reect the number of transactions containing e.

I b and c should be set to 1 and a to 2.

Conditional FP-Tree

I Remove the nodes containing e information about node e is

(17)

Conditional FP-Tree

I Remove infrequent items (nodes) from the prex paths I E.g. b has a support of 1 (note this really means be has a

support of 1). i.e. there is only 1 transaction containing b and e so be is infrequent can remove b.

Question: why were c and d not removed?

Example (continued)

I 4. Use the the conditional FP-tree for e to nd frequent

itemsets ending in de, ce and ae

I Note that be is not considered as b is not in the conditional

FP-tree for e.

I For each of them (e.g. de), nd the prex paths from the

conditional tree for e, extract frequent itemsets, generate conditional FP-tree, etc... (recursive)

I Example: e → de → ade ({d, e},{a, d, e} are found to be

(18)

Example (continued)

I 4. Use the the conditional FP-tree for e to nd frequent

itemsets ending in de, ce and ae

I Example: e → ce ({c, e} is found to be frequent)

I etc... (ae, then do the whole thing for b,... etc)

Result

I Frequent itemsets found (ordered by sux and order in which