• Tidak ada hasil yang ditemukan

Goal: Provide an overview of basic Association

N/A
N/A
Protected

Academic year: 2019

Membagikan "Goal: Provide an overview of basic Association"

Copied!
40
0
0

Teks penuh

(1)

1

(2)

2

Association Rules Outline

Goal: Provide an overview of basic Association Rule mining techniques

● Association Rules Problem Overview

• Large itemsets

● Association Rules Algorithms

• Apriori

• Sampling

• Partitioning

• Parallel Algorithms

● Comparing Techniques

● Incremental Algorithms

● Advanced AR Techniques

(3)

3

Market basket analysis

(4)

4

Example: Market Basket Data

● Items frequently purchased together:

Bread PeanutButter ● Uses:

Placement

• Advertising

• Sales

• Coupons

● Objective: increase sales and reduce costs

(5)

5

Association Rule Definitions

Set of items: I={I

1,I2,…,Im}

Transactions: D={t

1,t2, …, tn}, tj⊆ I

Itemset: {I

i1,Ii2, …, Iik} ⊆ I

Support of an itemset: Percentage of

transactions which contain that itemset.

Large (Frequent) itemset: Itemset whose

number of occurrences is above a threshold.

(6)

6

Association Rules Example

I = { Beer, Bread, Jelly, Milk, PeanutButter}

Support of {Bread, PeanutButter} is 60%

(7)

7

Association Rule Definitions

Association Rule (AR): implication X ⇒ Y

where X,Y ⊆ I and X ∩ Y = ;

Support of AR (s) X Y: Percentage of

transactions that contain X ∪Y.

P(X,Y) = σ (X ∪Y) / |T|

Confidence of AR (α) X Y: Ratio of

number of transactions that contain X ∪ Y to the number that contain X.

P(Y|X) = σ (X ∪Y) / σ (X)

(8)

8

Association Rules Ex (cont’d)

(9)

9

Example

500,000 transactions

20,000 transactions contains diapers 30,000 transactions contains milk

10,000 transactions contains both diapers & milk

E

milk diapers

10

20 10

460

0.50 0.02

Diapers Milk

0.33 0.02

Milk Diapers

(10)

10

Example

10,000 transactions contains wipes

8,000 transactions contains wipes & diapers 220 transactions contains wipes & milk

200 transactions contains wipes & diapers & milk

E

milk diapers

(11)

11

Association Rule Problem

● Given a set of items I={I

1,I2,…,Im} and a

database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij ∈ I, the Association

Rule Problem is to identify all association rules X ⇒ Y with a minimum support and confidence.

● Link Analysis

NOTE: Support of X ⇒ Y is same as support

of X ∪ Y.

(12)

12

Association Rule Techniques

1. Find Large Itemsets.

2. Generate rules from frequent itemsets.

(13)

13

Association Rule Mining

● Given a set of transactions, find rules that will predict the

occurrence of an item based on the occurrences of other items in the transaction

Market-Basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Cocoa, Eggs

3 Milk, Diaper, Cocoa, Coke

4 Bread, Milk, Diaper, Cocoa

5 Bread, Milk, Diaper, Coke

Example of Association Rules

{Diaper} → {Cocoa},

{Milk, Bread} → {Eggs,Coke}, {Cocoa, Bread} → {Milk},

Implication means co-occurrence, not causality!

(14)

14

Definition: Frequent Itemset

Itemset

• A collection of one or more items

• Example: {Milk, Bread, Diaper}

• k-itemset

• An itemset that contains k items

Support count (σ)

• Frequency of occurrence of an itemset

• E.g. σ({Milk, Bread,Diaper}) = 2

Support

• Fraction of transactions that contain an itemset

• E.g. s({Milk, Bread, Diaper}) = 2/5

Tan,Steinbach, Kumar

TID Items

1 Bread, Milk

2 Bread, Diaper, Cocoa, Eggs

3 Milk, Diaper, Cocoa, Coke 4 Bread, Milk, Diaper,

Cocoa

5 Bread, Milk, Diaper, Coke

Frequent Itemset

An itemset whose support is greater than or equal to a

(15)

15

Definition: Association Rule

Example:

Association Rule

• An implication expression of the form X

→ Y, where X and Y are itemsets • Example:

{Milk, Diaper} → {Cocoa}

Rule Evaluation Metrics

• Support (s)

• Fraction of transactions that contain both X and Y

• Confidence (c)

• Measures how often items in Y appear in transactions that contain X

Tan,Steinbach, Kumar

TID Items

1 Bread, Milk

2 Bread, Diaper, Cocoa, Eggs

3 Milk, Diaper, Cocoa, Coke 4 Bread, Milk, Diaper,

Cocoa

(16)

16

Association Rule Mining Task

● Given a set of transactions T, the goal of

association rule mining is to find all rules having

• support ≥ minsup threshold

• confidence ≥ minconf threshold

● Brute-force approach:

• List all possible association rules

• Compute the support and confidence for each rule

• Prune rules that fail the minsup and minconf

thresholds

(17)

17

Mining Association Rules

Example of Rules:

{Milk,Diaper} → {Cocoa} (s=0.4, c=0.67)

{Milk,Cocoa} → {Diaper} (s=0.4, c=1.0) {Diaper,Cocoa} → {Milk} (s=0.4,

c=0.67)

{Cocoa} → {Milk,Diaper} (s=0.4, c=0.67)

{Diaper} → {Milk,Cocoa} (s=0.4, c=0.5) {Milk} → {Diaper,Cocoa} (s=0.4, c=0.5)

Observations:

• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Cocoa}

• Rules originating from the same itemset have identical support but can have different confidence

• Thus, we may decouple the support and confidence requirements

Tan,Steinbach, Kumar

TID Items

1 Bread, Milk

2 Bread, Diaper, Cocoa, Eggs

3 Milk, Diaper, Cocoa, Coke 4 Bread, Milk, Diaper,

Cocoa

(18)

18

Mining Association Rules

● Two-step approach:

– Frequent Itemset Generation

– Generate all itemsets whose support ≥ minsup

– Rule Generation

– Generate high confidence rules from each frequent

itemset, where each rule is a binary partitioning of a frequent itemset

● Frequent itemset generation is still

computationally expensive

(19)

19

Frequent Itemset Generation

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given d items, there are 2d possible

candidate itemsets

(20)

20

Frequent Itemset Generation

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions List of

Candidates

M

w

Tan,Steinbach, Kumar

TID Items

1 Bread, Milk

2 Bread, Diaper, Cocoa,

Eggs

3 Milk, Diaper, Cocoa, Coke

4 Bread, Milk, Diaper,

Cocoa

5 Bread, Milk, Diaper, Coke

Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d !!!

M = 2k - 1

● Brute-force approach:

• Each itemset in the lattice is a candidate frequent itemset

(21)

21

Computational Complexity

● Given d unique items:

• Total number of itemsets = 2d

• Total number of possible association rules:

1

(22)

22

Frequent Itemset Generation

Strategies

● Reduce the number of candidates (M)

• Complete search: M=2d

• Use pruning techniques to reduce M

● Reduce the number of transactions (N)

• Reduce size of N as the size of itemset increases

• Used by DHP and vertical-based mining algorithms

● Reduce the number of comparisons (NM)

• Use efficient data structures to store the candidates or transactions

• No need to match every candidate against every transaction

(23)

23

Apriori

Large Itemset Property:

Any subset of a large itemset is large.

● Contrapositive:

If an itemset is not large,

none of its supersets are large.

(24)

24

Reducing Number of Candidates

● Apriori principle:

If an itemset is frequent, then all of its subsets must also

be frequent

● Apriori principle holds due to the following property

of the support measure:

• Support of an itemset never exceeds the support of its subsets

• This is known as the anti-monotone property of support

)

(

)

(

)

(

:

,

Y

X

Y

s

X

s

Y

X

(25)

25

Large Itemset Property

(26)

26

Apriori Ex (cont’d)

s=30% α = 50%

(27)

27

Found to be Infrequent

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Illustrating Apriori Principle

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Pruned supersets

(28)

28

Illustrating Apriori Principle

Item Count

Itemset Count

{Bread,Milk} 3

{Bread,Cocoa} 2

{Bread,Diaper} 3

{Milk,Cocoa} 2

{Milk,Diaper} 3 {Cocoa,Diaper} 3

Itemset Count

{Bread,Milk,Diaper} 3

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate

candidates involving Coke or Eggs)

Triplets (3-itemsets)

Minimum Support = 3

If every subset is considered,

6C

1 + 6C2 + 6C3 = 41

With support-based pruning,

6 + 6 + 1 = 13 Tan,Steinbach, Kumar

TID Items

1 Bread, Milk

2 Bread, Diaper, Cocoa, Eggs

3 Milk, Diaper, Cocoa, Coke 4 Bread, Milk, Diaper,

Cocoa

(29)

29

The Apriori Algorithm—An Example

Database TDB

1st scan

(30)

30

Apriori Algorithm

 C

1 = Itemsets of size one in I;

 Determine all large itemsets of size 1, L

1;

 i = 1;

 Repeat

 i = i + 1;

 C

i = Apriori-Gen(Li-1);

 Count C

i to determine Li;

 until no more large itemsets found;

(31)

31

Apriori Algorithm

● Method:

• Let k=1

• Generate frequent itemsets of length 1

• Repeat until no new frequent itemsets are identified • Generate length (k+1) candidate itemsets from length k

frequent itemsets

• Prune candidate itemsets containing subsets of length k that are infrequent

• Count the support of each candidate by scanning the DB • Eliminate candidates that are infrequent, leaving only those

that are frequent

(32)

32

● Problems with support and confidence; a rule

may have high support & high confidence

because the rule is an obvious rule… someone who buys potato chips is highly likely to buy soft

drink -> not surprising

● Confidence totally ignore P(B)

P(A,B) = σ (A ∪B) / |T|

(33)

33

Interestingness Measure: Correlations

(Lift)

● Lift (Wipes, Milk ⇒ Diapers) is 22.73

● People who purchased Wipes & Milk are 22.73 times

more likely to also purchase Diapers than people who do not purchase Wipes & Milk .

● Measure of dependent/correlated events: lift

)

Han & Kamber

>1 positively correlated = 1 independent

(34)

34

play basketballeat cereal [40%, 66.7%] is misleading

• The overall % of students eating cereal is 75% > 66.7%.

play basketballnot eat cereal [20%, 33.3%] is more

accurate, although with lower support and confidence

5000 Not cereal

3750 1750

2000 Cereal

Sum (row) Not basketball

Basketball

(35)

35

Conviction = 1, independent (not related)

(36)
(37)

37

Example

76 No osteoporosis

38 No osteoporosis

(38)

38

● DOF (df) = (2-1)(2-1) = 1

● significant level, α = 0.01

● χ2 = 6.63 (tabulated)

● Null hypothesis: A & B are not correlated

● Calculate expected values

● Calculate chi-square value

● Compare calculated & tabulated values

(39)

39

(40)

40

Referensi

Dokumen terkait

Berdasarkan Identifikasi masalah yang diuraikan di atas maka penelitian ini di batasi pada pengaruh tingkat investment opportunity set (IOS) berbasis pada harga saham yang

Ketika anak didik tidak mampu berkonsentrasi, ketika sebagian besar anak didik membuat kegaduhan, ketika anak didik menunjukan kelesuan, ketika minat anak didik semakin

Kesimpulan lainya adalah informasi tentang masuknya investor institusional bertransaksi di pasar modal merupakan suatu fakta material, karena dapat mempengaruhi harga (yang tercermin

(If Dalam  kondisi  normal,  baku  mutu  air  limbah  sebagaimana  climaksud  dalam  Pasal  4  setiap  saat  tidak  boleh  dilampaui  oleh  perianggung  jawab 

KEEFEKTIFAN PENGGUNAAN SOAL DENGAN SISTEM BERURUT (SYSTEMATIC QUESTION SYSTEM) DAN SOAL DENGAN SISTEM ACAK (RANDOM QUESTION SYSTEM) DALAM MEMINIMALISIR PERILAKU MENYONTEK DAN

Hal ini terlihat dari penelitian yang telah dilakukan di negara yang mayoritas Muslim seperti di Indonesia melaporkan bahwa wanita enggan untuk dilakukan masektomi jika di

Kelayakan operasional fokus kepada penilaian apakah sistem yang dikembangkan akan dapat dipakai dengan baik atau tidak oleh pengguna.[5] Masalah yang biasa muncul dalam

Untuk mengetahui besarnya pengaruh kualitas produk, citra merek, dan persepsi harga terhadap keputusan pembelian dapat dilihat dari besarnya nilai