Goal: Provide an overview of basic Association

(1)

1

(2)

2

Association Rules Outline

Goal: Provide an overview of basic Association Rule mining techniques

● Association Rules Problem Overview

• Large itemsets

● Association Rules Algorithms

• Apriori

• Sampling

• Partitioning

• Parallel Algorithms

● Comparing Techniques

● Incremental Algorithms

● Advanced AR Techniques

(3)

3

Market basket analysis

(4)

4

Example: Market Basket Data

● Items frequently purchased together:

Bread ⇒PeanutButter ● Uses:

• _Placement

• Advertising

• Sales

• Coupons

● Objective: increase sales and reduce costs

(5)

5

Association Rule Definitions

● Set of items: I={I

1,I2,…,Im}

● Transactions: D={t

1,t2, …, tn}, tj⊆ I

● Itemset: {I

i1,Ii2, …, Iik} ⊆ I

● Support of an itemset: Percentage of

transactions which contain that itemset.

● Large (Frequent) itemset: Itemset whose

number of occurrences is above a threshold.

(6)

6

Association Rules Example

I = { Beer, Bread, Jelly, Milk, PeanutButter}

Support of {Bread, PeanutButter} is 60%

(7)

7

Association Rule Definitions

● Association Rule (AR): implication X ⇒ Y

where X,Y ⊆ I and X ∩ Y = ;

● Support of AR (s) X ⇒ Y: Percentage of

transactions that contain X ∪Y.

• P(X,Y) = σ (X ∪Y) / |T|

● Confidence of AR (α) X ⇒ Y: Ratio of

number of transactions that contain X ∪ Y to the number that contain X.

• P(Y|X) = σ (X ∪Y) / σ (X)

(8)

8

Association Rules Ex (cont’d)

(9)

9

Example

500,000 transactions

20,000 transactions contains diapers 30,000 transactions contains milk

10,000 transactions contains both diapers & milk

E

milk diapers

10

20 10

460

0.50 0.02

Diapers ⇒ Milk

0.33 0.02

Milk ⇒ Diapers

(10)

10

Example

10,000 transactions contains wipes

8,000 transactions contains wipes & diapers 220 transactions contains wipes & milk

200 transactions contains wipes & diapers & milk

E

milk diapers

(11)

11

Association Rule Problem

● Given a set of items I={I

1,I2,…,Im} and a

database of transactions D={t₁,t₂, …, t_n} where t_i={I_i1,I_i2, …, I_ik} and I_ij ∈ I, the Association

Rule Problem is to identify all association rules X ⇒ Y with a minimum support and confidence.

● Link Analysis

● NOTE: Support of X ⇒ Y is same as support

of X ∪ Y.

(12)

12

Association Rule Techniques

1. Find Large Itemsets.

2. Generate rules from frequent itemsets.

(13)

13

Association Rule Mining

● Given a set of transactions, find rules that will predict the

occurrence of an item based on the occurrences of other items in the transaction

Market-Basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Cocoa, Eggs

3 Milk, Diaper, Cocoa, Coke

4 Bread, Milk, Diaper, Cocoa

5 Bread, Milk, Diaper, Coke

Example of Association Rules

{Diaper} → {Cocoa},

{Milk, Bread} → {Eggs,Coke}, {Cocoa, Bread} → {Milk},

Implication means co-occurrence, not causality!

(14)

14

Definition: Frequent Itemset

● Itemset

• A collection of one or more items

• Example: {Milk, Bread, Diaper}

• k-itemset

• An itemset that contains k items

● Support count (σ)

• Frequency of occurrence of an itemset

• E.g. σ({Milk, Bread,Diaper}) = 2

● Support

• Fraction of transactions that contain an itemset

• E.g. s({Milk, Bread, Diaper}) = 2/5

Tan,Steinbach, Kumar

TID Items

1 Bread, Milk

3 Milk, Diaper, Cocoa, Coke 4 Bread, Milk, Diaper,

Cocoa

Frequent Itemset

An itemset whose support is greater than or equal to a

(15)

15

Definition: Association Rule

Example:

● Association Rule

• An implication expression of the form X

→ Y, where X and Y are itemsets • Example:

{Milk, Diaper} → {Cocoa}

● _{Rule Evaluation Metrics}

• Support (s)

• Fraction of transactions that contain both X and Y

• Confidence (c)

• Measures how often items in Y appear in transactions that contain X

TID Items

1 Bread, Milk

Cocoa

(16)

16

Association Rule Mining Task

● Given a set of transactions T, the goal of

association rule mining is to find all rules having

• support ≥ minsup threshold

• confidence ≥ minconf threshold

● Brute-force approach:

• List all possible association rules

• Compute the support and confidence for each rule

• Prune rules that fail the minsup and minconf

thresholds

(17)

17

Mining Association Rules

Example of Rules:

{Milk,Diaper} → {Cocoa} (s=0.4, c=0.67)

{Milk,Cocoa} → {Diaper} (s=0.4, c=1.0) {Diaper,Cocoa} → {Milk} (s=0.4,

c=0.67)

{Cocoa} → {Milk,Diaper} (s=0.4, c=0.67)

{Diaper} → {Milk,Cocoa} (s=0.4, c=0.5) {Milk} → {Diaper,Cocoa} (s=0.4, c=0.5)

Observations:

• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Cocoa}

• Rules originating from the same itemset have identical support but can have different confidence

• Thus, we may decouple the support and confidence requirements

TID Items

1 Bread, Milk

Cocoa

(18)

18

Mining Association Rules

● Two-step approach:

– Frequent Itemset Generation

– Generate all itemsets whose support ≥ minsup

– Rule Generation

– Generate high confidence rules from each frequent

itemset, where each rule is a binary partitioning of a frequent itemset

● Frequent itemset generation is still

computationally expensive

(19)

19

Frequent Itemset Generation

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given d items, there are 2d_possible

candidate itemsets

(20)

20

Frequent Itemset Generation

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions List of

Candidates

M

w

TID Items

1 Bread, Milk

2 Bread, Diaper, Cocoa,

Eggs

3 Milk, Diaper, Cocoa, Coke

4 Bread, Milk, Diaper,

Cocoa

Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d _!!!

M = 2k_{- 1}

● Brute-force approach:

• Each itemset in the lattice is a candidate frequent itemset

(21)

21

Computational Complexity

● Given d unique items:

• Total number of itemsets = 2d

• Total number of possible association rules:

1

(22)

22

Frequent Itemset Generation

Strategies

● Reduce the number of candidates (M)

• Complete search: M=2d

• Use pruning techniques to reduce M

● Reduce the number of transactions (N)

• Reduce size of N as the size of itemset increases

• Used by DHP and vertical-based mining algorithms

● Reduce the number of comparisons (NM)

• Use efficient data structures to store the candidates or transactions

• No need to match every candidate against every transaction

(23)

23

Apriori

● Large Itemset Property:

Any subset of a large itemset is large.

● Contrapositive:

If an itemset is not large,

none of its supersets are large.

(24)

24

Reducing Number of Candidates

● Apriori principle:

• _{If an itemset is frequent, then all of its subsets must also}

be frequent

● Apriori principle holds due to the following property

of the support measure:

• Support of an itemset never exceeds the support of its subsets

• This is known as the anti-monotone property of support

)

(

)

(

)

(

:

,

Y

X

Y

s

X

s

Y

X

⊆

⇒

≥

∀

(25)

25

Large Itemset Property

(26)

26

Apriori Ex (cont’d)

s=30% α = 50%

(27)

27

Found to be Infrequent

null

A B C D E

ABCDE

Illustrating Apriori Principle

null

A B C D E

ABCDE

Pruned supersets

(28)

28

Illustrating Apriori Principle

Item Count

Itemset Count

{Bread,Milk} 3

{Bread,Cocoa} 2

{Bread,Diaper} 3

{Milk,Cocoa} 2

{Milk,Diaper} 3 {Cocoa,Diaper} 3

Itemset Count

{Bread,Milk,Diaper} 3

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate

candidates involving Coke or Eggs)

Triplets (3-itemsets)

Minimum Support = 3

If every subset is considered,

6_C

1 + 6C2 + 6C3 = 41

With support-based pruning,

6 + 6 + 1 = 13 _{Tan,Steinbach, Kumar}

TID Items

1 Bread, Milk

Cocoa

(29)

29

The Apriori Algorithm—An Example

Database TDB

1st_scan

(30)

30

Apriori Algorithm

 C

1 = Itemsets of size one in I;

 Determine all large itemsets of size 1, L

1;

 i = 1;

 Repeat

 i = i + 1;

 C

i = Apriori-Gen(Li-1);

 Count C

i to determine Li;

 until no more large itemsets found;

(31)

31

Apriori Algorithm

● Method:

• Let k=1

• Generate frequent itemsets of length 1

• Repeat until no new frequent itemsets are identified • Generate length (k+1) candidate itemsets from length k

frequent itemsets

• Prune candidate itemsets containing subsets of length k that are infrequent

• Count the support of each candidate by scanning the DB • Eliminate candidates that are infrequent, leaving only those

that are frequent

(32)

32

● Problems with support and confidence; a rule

may have high support & high confidence

because the rule is an obvious rule… someone who buys potato chips is highly likely to buy soft

drink -> not surprising

● Confidence totally ignore P(B)

• P(A,B) = σ (A ∪B_{) / |T|}

(33)

33

Interestingness Measure: Correlations

(Lift)

● Lift (Wipes, Milk ⇒ Diapers) is 22.73

● People who purchased Wipes & Milk are 22.73 times

more likely to also purchase Diapers than people who do not purchase Wipes & Milk .

● Measure of dependent/correlated events: lift

)

Han & Kamber

>1 positively correlated = 1 independent

(34)

34

● play basketball ⇒ eat cereal [40%, 66.7%] is misleading

• The overall % of students eating cereal is 75% > 66.7%.

● play basketball ⇒ not eat cereal [20%, 33.3%] is more

accurate, although with lower support and confidence

5000 Not cereal

3750 1750

2000 Cereal

Sum (row) Not basketball

Basketball

(35)

35

Conviction = 1, independent (not related)

(36)

(37)

37

Example

76 No osteoporosis

38 No osteoporosis

(38)

38

● DOF (df) = (2-1)(2-1) = 1

● significant level, α = 0.01

● χ2 = 6.63 (tabulated)

● Null hypothesis: A & B are not correlated

● Calculate expected values

● Calculate chi-square value

● Compare calculated & tabulated values

(39)

39

(40)

40