• Tidak ada hasil yang ditemukan

Goal: Provide an overview of basic Association

N/A
N/A
Protected

Academic year: 2019

Membagikan "Goal: Provide an overview of basic Association"

Copied!
40
0
0

Teks penuh

(1)

1

(2)

2

Association Rules Outline

Goal: Provide an overview of basic Association Rule mining techniques

● Association Rules Problem Overview

• Large itemsets

● Association Rules Algorithms

• Apriori

• Sampling

• Partitioning

• Parallel Algorithms

● Comparing Techniques

● Incremental Algorithms

● Advanced AR Techniques

(3)

3

Market basket analysis

(4)

4

Example: Market Basket Data

● Items frequently purchased together:

Bread PeanutButter ● Uses:

Placement

• Advertising

• Sales

• Coupons

● Objective: increase sales and reduce costs

(5)

5

Association Rule Definitions

Set of items: I={I

1,I2,…,Im}

Transactions: D={t

1,t2, …, tn}, tj⊆ I

Itemset: {I

i1,Ii2, …, Iik} ⊆ I

Support of an itemset: Percentage of

transactions which contain that itemset.

Large (Frequent) itemset: Itemset whose

number of occurrences is above a threshold.

(6)

6

Association Rules Example

I = { Beer, Bread, Jelly, Milk, PeanutButter}

Support of {Bread, PeanutButter} is 60%

(7)

7

Association Rule Definitions

Association Rule (AR): implication X ⇒ Y

where X,Y ⊆ I and X ∩ Y = ;

Support of AR (s) X Y: Percentage of

transactions that contain X ∪Y.

P(X,Y) = σ (X ∪Y) / |T|

Confidence of AR (α) X Y: Ratio of

number of transactions that contain X ∪ Y to the number that contain X.

P(Y|X) = σ (X ∪Y) / σ (X)

(8)

8

Association Rules Ex (cont’d)

(9)

9

Example

500,000 transactions

20,000 transactions contains diapers 30,000 transactions contains milk

10,000 transactions contains both diapers & milk

E

milk diapers

10

20 10

460

0.50 0.02

Diapers Milk

0.33 0.02

Milk Diapers

(10)

10

Example

10,000 transactions contains wipes

8,000 transactions contains wipes & diapers 220 transactions contains wipes & milk

200 transactions contains wipes & diapers & milk

E

milk diapers

(11)

11

Association Rule Problem

● Given a set of items I={I

1,I2,…,Im} and a

database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij ∈ I, the Association

Rule Problem is to identify all association rules X ⇒ Y with a minimum support and confidence.

● Link Analysis

NOTE: Support of X ⇒ Y is same as support

of X ∪ Y.

(12)

12

Association Rule Techniques

1. Find Large Itemsets.

2. Generate rules from frequent itemsets.

(13)

13

Association Rule Mining

● Given a set of transactions, find rules that will predict the

occurrence of an item based on the occurrences of other items in the transaction

Market-Basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Cocoa, Eggs

3 Milk, Diaper, Cocoa, Coke

4 Bread, Milk, Diaper, Cocoa

5 Bread, Milk, Diaper, Coke

Example of Association Rules

{Diaper} → {Cocoa},

{Milk, Bread} → {Eggs,Coke}, {Cocoa, Bread} → {Milk},

Implication means co-occurrence, not causality!

(14)

14

Definition: Frequent Itemset

Itemset

• A collection of one or more items

• Example: {Milk, Bread, Diaper}

• k-itemset

• An itemset that contains k items

Support count (σ)

• Frequency of occurrence of an itemset

• E.g. σ({Milk, Bread,Diaper}) = 2

Support

• Fraction of transactions that contain an itemset

• E.g. s({Milk, Bread, Diaper}) = 2/5

Tan,Steinbach, Kumar

TID Items

1 Bread, Milk

2 Bread, Diaper, Cocoa, Eggs

3 Milk, Diaper, Cocoa, Coke 4 Bread, Milk, Diaper,

Cocoa

5 Bread, Milk, Diaper, Coke

Frequent Itemset

An itemset whose support is greater than or equal to a

(15)

15

Definition: Association Rule

Example:

Association Rule

• An implication expression of the form X

→ Y, where X and Y are itemsets • Example:

{Milk, Diaper} → {Cocoa}

Rule Evaluation Metrics

• Support (s)

• Fraction of transactions that contain both X and Y

• Confidence (c)

• Measures how often items in Y appear in transactions that contain X

Tan,Steinbach, Kumar

TID Items

1 Bread, Milk

2 Bread, Diaper, Cocoa, Eggs

3 Milk, Diaper, Cocoa, Coke 4 Bread, Milk, Diaper,

Cocoa

(16)

16

Association Rule Mining Task

● Given a set of transactions T, the goal of

association rule mining is to find all rules having

• support ≥ minsup threshold

• confidence ≥ minconf threshold

● Brute-force approach:

• List all possible association rules

• Compute the support and confidence for each rule

• Prune rules that fail the minsup and minconf

thresholds

(17)

17

Mining Association Rules

Example of Rules:

{Milk,Diaper} → {Cocoa} (s=0.4, c=0.67)

{Milk,Cocoa} → {Diaper} (s=0.4, c=1.0) {Diaper,Cocoa} → {Milk} (s=0.4,

c=0.67)

{Cocoa} → {Milk,Diaper} (s=0.4, c=0.67)

{Diaper} → {Milk,Cocoa} (s=0.4, c=0.5) {Milk} → {Diaper,Cocoa} (s=0.4, c=0.5)

Observations:

• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Cocoa}

• Rules originating from the same itemset have identical support but can have different confidence

• Thus, we may decouple the support and confidence requirements

Tan,Steinbach, Kumar

TID Items

1 Bread, Milk

2 Bread, Diaper, Cocoa, Eggs

3 Milk, Diaper, Cocoa, Coke 4 Bread, Milk, Diaper,

Cocoa

(18)

18

Mining Association Rules

● Two-step approach:

– Frequent Itemset Generation

– Generate all itemsets whose support ≥ minsup

– Rule Generation

– Generate high confidence rules from each frequent

itemset, where each rule is a binary partitioning of a frequent itemset

● Frequent itemset generation is still

computationally expensive

(19)

19

Frequent Itemset Generation

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given d items, there are 2d possible

candidate itemsets

(20)

20

Frequent Itemset Generation

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions List of

Candidates

M

w

Tan,Steinbach, Kumar

TID Items

1 Bread, Milk

2 Bread, Diaper, Cocoa,

Eggs

3 Milk, Diaper, Cocoa, Coke

4 Bread, Milk, Diaper,

Cocoa

5 Bread, Milk, Diaper, Coke

Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d !!!

M = 2k - 1

● Brute-force approach:

• Each itemset in the lattice is a candidate frequent itemset

(21)

21

Computational Complexity

● Given d unique items:

• Total number of itemsets = 2d

• Total number of possible association rules:

1

(22)

22

Frequent Itemset Generation

Strategies

● Reduce the number of candidates (M)

• Complete search: M=2d

• Use pruning techniques to reduce M

● Reduce the number of transactions (N)

• Reduce size of N as the size of itemset increases

• Used by DHP and vertical-based mining algorithms

● Reduce the number of comparisons (NM)

• Use efficient data structures to store the candidates or transactions

• No need to match every candidate against every transaction

(23)

23

Apriori

Large Itemset Property:

Any subset of a large itemset is large.

● Contrapositive:

If an itemset is not large,

none of its supersets are large.

(24)

24

Reducing Number of Candidates

● Apriori principle:

If an itemset is frequent, then all of its subsets must also

be frequent

● Apriori principle holds due to the following property

of the support measure:

• Support of an itemset never exceeds the support of its subsets

• This is known as the anti-monotone property of support

)

(

)

(

)

(

:

,

Y

X

Y

s

X

s

Y

X

(25)

25

Large Itemset Property

(26)

26

Apriori Ex (cont’d)

s=30% α = 50%

(27)

27

Found to be Infrequent

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Illustrating Apriori Principle

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Pruned supersets

(28)

28

Illustrating Apriori Principle

Item Count

Itemset Count

{Bread,Milk} 3

{Bread,Cocoa} 2

{Bread,Diaper} 3

{Milk,Cocoa} 2

{Milk,Diaper} 3 {Cocoa,Diaper} 3

Itemset Count

{Bread,Milk,Diaper} 3

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate

candidates involving Coke or Eggs)

Triplets (3-itemsets)

Minimum Support = 3

If every subset is considered,

6C

1 + 6C2 + 6C3 = 41

With support-based pruning,

6 + 6 + 1 = 13 Tan,Steinbach, Kumar

TID Items

1 Bread, Milk

2 Bread, Diaper, Cocoa, Eggs

3 Milk, Diaper, Cocoa, Coke 4 Bread, Milk, Diaper,

Cocoa

(29)

29

The Apriori Algorithm—An Example

Database TDB

1st scan

(30)

30

Apriori Algorithm

 C

1 = Itemsets of size one in I;

 Determine all large itemsets of size 1, L

1;

 i = 1;

 Repeat

 i = i + 1;

 C

i = Apriori-Gen(Li-1);

 Count C

i to determine Li;

 until no more large itemsets found;

(31)

31

Apriori Algorithm

● Method:

• Let k=1

• Generate frequent itemsets of length 1

• Repeat until no new frequent itemsets are identified • Generate length (k+1) candidate itemsets from length k

frequent itemsets

• Prune candidate itemsets containing subsets of length k that are infrequent

• Count the support of each candidate by scanning the DB • Eliminate candidates that are infrequent, leaving only those

that are frequent

(32)

32

● Problems with support and confidence; a rule

may have high support & high confidence

because the rule is an obvious rule… someone who buys potato chips is highly likely to buy soft

drink -> not surprising

● Confidence totally ignore P(B)

P(A,B) = σ (A ∪B) / |T|

(33)

33

Interestingness Measure: Correlations

(Lift)

● Lift (Wipes, Milk ⇒ Diapers) is 22.73

● People who purchased Wipes & Milk are 22.73 times

more likely to also purchase Diapers than people who do not purchase Wipes & Milk .

● Measure of dependent/correlated events: lift

)

Han & Kamber

>1 positively correlated = 1 independent

(34)

34

play basketballeat cereal [40%, 66.7%] is misleading

• The overall % of students eating cereal is 75% > 66.7%.

play basketballnot eat cereal [20%, 33.3%] is more

accurate, although with lower support and confidence

5000 Not cereal

3750 1750

2000 Cereal

Sum (row) Not basketball

Basketball

(35)

35

Conviction = 1, independent (not related)

(36)
(37)

37

Example

76 No osteoporosis

38 No osteoporosis

(38)

38

● DOF (df) = (2-1)(2-1) = 1

● significant level, α = 0.01

● χ2 = 6.63 (tabulated)

● Null hypothesis: A & B are not correlated

● Calculate expected values

● Calculate chi-square value

● Compare calculated & tabulated values

(39)

39

(40)

40

Referensi

Dokumen terkait

Yes No New human health risk assessment for non-professional users This application seeks to establish whether there are grounds to reassess the substance approvals identified in

In the third class, first, based on one of the feature extraction strategies for the entire image or for each pixel, the new features are extracted, and, after calculating the cost

2.3 The Energy Supply-Side Program If there is no energy saving program implemented and new power tie-lines are expanded in order to increase their capacities to supply more power to

These new industries are ex- pected to create green jobs and improve the entire Southern Africa economy through steel exports and other value-added products Steel can also be produced

According to the theory, when given specific and challenging goals, an employee will perform better than when they are assigned a "do your best" goal, easy goal, or even no goal Latham