1
2
Association Rules Outline
Goal: Provide an overview of basic Association Rule mining techniques
● Association Rules Problem Overview
• Large itemsets
● Association Rules Algorithms
• Apriori
• Sampling
• Partitioning
• Parallel Algorithms
● Comparing Techniques
● Incremental Algorithms
● Advanced AR Techniques
3
Market basket analysis
4
Example: Market Basket Data
● Items frequently purchased together:
Bread ⇒PeanutButter ● Uses:
• Placement
• Advertising
• Sales
• Coupons
● Objective: increase sales and reduce costs
5
Association Rule Definitions
● Set of items: I={I
1,I2,…,Im}
● Transactions: D={t
1,t2, …, tn}, tj⊆ I
● Itemset: {I
i1,Ii2, …, Iik} ⊆ I
● Support of an itemset: Percentage of
transactions which contain that itemset.
● Large (Frequent) itemset: Itemset whose
number of occurrences is above a threshold.
6
Association Rules Example
I = { Beer, Bread, Jelly, Milk, PeanutButter}
Support of {Bread, PeanutButter} is 60%
7
Association Rule Definitions
● Association Rule (AR): implication X ⇒ Y
where X,Y ⊆ I and X ∩ Y = ;
● Support of AR (s) X ⇒ Y: Percentage of
transactions that contain X ∪Y.
• P(X,Y) = σ (X ∪Y) / |T|
● Confidence of AR (α) X ⇒ Y: Ratio of
number of transactions that contain X ∪ Y to the number that contain X.
• P(Y|X) = σ (X ∪Y) / σ (X)
8
Association Rules Ex (cont’d)
9
Example
500,000 transactions
20,000 transactions contains diapers 30,000 transactions contains milk
10,000 transactions contains both diapers & milk
E
milk diapers
10
20 10
460
0.50 0.02
Diapers ⇒ Milk
0.33 0.02
Milk ⇒ Diapers
10
Example
10,000 transactions contains wipes
8,000 transactions contains wipes & diapers 220 transactions contains wipes & milk
200 transactions contains wipes & diapers & milk
E
milk diapers
11
Association Rule Problem
● Given a set of items I={I
1,I2,…,Im} and a
database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij ∈ I, the Association
Rule Problem is to identify all association rules X ⇒ Y with a minimum support and confidence.
● Link Analysis
● NOTE: Support of X ⇒ Y is same as support
of X ∪ Y.
12
Association Rule Techniques
1. Find Large Itemsets.
2. Generate rules from frequent itemsets.
13
Association Rule Mining
● Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items in the transaction
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Cocoa, Eggs
3 Milk, Diaper, Cocoa, Coke
4 Bread, Milk, Diaper, Cocoa
5 Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper} → {Cocoa},
{Milk, Bread} → {Eggs,Coke}, {Cocoa, Bread} → {Milk},
Implication means co-occurrence, not causality!
14
Definition: Frequent Itemset
● Itemset
• A collection of one or more items
• Example: {Milk, Bread, Diaper}
• k-itemset
• An itemset that contains k items
● Support count (σ)
• Frequency of occurrence of an itemset
• E.g. σ({Milk, Bread,Diaper}) = 2
● Support
• Fraction of transactions that contain an itemset
• E.g. s({Milk, Bread, Diaper}) = 2/5
Tan,Steinbach, Kumar
TID Items
1 Bread, Milk
2 Bread, Diaper, Cocoa, Eggs
3 Milk, Diaper, Cocoa, Coke 4 Bread, Milk, Diaper,
Cocoa
5 Bread, Milk, Diaper, Coke
Frequent Itemset
An itemset whose support is greater than or equal to a
15
Definition: Association Rule
Example:
● Association Rule
• An implication expression of the form X
→ Y, where X and Y are itemsets • Example:
{Milk, Diaper} → {Cocoa}
● Rule Evaluation Metrics
• Support (s)
• Fraction of transactions that contain both X and Y
• Confidence (c)
• Measures how often items in Y appear in transactions that contain X
Tan,Steinbach, Kumar
TID Items
1 Bread, Milk
2 Bread, Diaper, Cocoa, Eggs
3 Milk, Diaper, Cocoa, Coke 4 Bread, Milk, Diaper,
Cocoa
16
Association Rule Mining Task
● Given a set of transactions T, the goal of
association rule mining is to find all rules having
• support ≥ minsup threshold
• confidence ≥ minconf threshold
● Brute-force approach:
• List all possible association rules
• Compute the support and confidence for each rule
• Prune rules that fail the minsup and minconf
thresholds
17
Mining Association Rules
Example of Rules:
{Milk,Diaper} → {Cocoa} (s=0.4, c=0.67)
{Milk,Cocoa} → {Diaper} (s=0.4, c=1.0) {Diaper,Cocoa} → {Milk} (s=0.4,
c=0.67)
{Cocoa} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Cocoa} (s=0.4, c=0.5) {Milk} → {Diaper,Cocoa} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Cocoa}
• Rules originating from the same itemset have identical support but can have different confidence
• Thus, we may decouple the support and confidence requirements
Tan,Steinbach, Kumar
TID Items
1 Bread, Milk
2 Bread, Diaper, Cocoa, Eggs
3 Milk, Diaper, Cocoa, Coke 4 Bread, Milk, Diaper,
Cocoa
18
Mining Association Rules
● Two-step approach:
– Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup
– Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a frequent itemset
● Frequent itemset generation is still
computationally expensive
19
Frequent Itemset Generation
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given d items, there are 2d possible
candidate itemsets
20
Frequent Itemset Generation
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
N
Transactions List of
Candidates
M
w
Tan,Steinbach, Kumar
TID Items
1 Bread, Milk
2 Bread, Diaper, Cocoa,
Eggs
3 Milk, Diaper, Cocoa, Coke
4 Bread, Milk, Diaper,
Cocoa
5 Bread, Milk, Diaper, Coke
Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d !!!
M = 2k - 1
● Brute-force approach:
• Each itemset in the lattice is a candidate frequent itemset
21
Computational Complexity
● Given d unique items:
• Total number of itemsets = 2d
• Total number of possible association rules:
1
22
Frequent Itemset Generation
Strategies
● Reduce the number of candidates (M)
• Complete search: M=2d
• Use pruning techniques to reduce M
● Reduce the number of transactions (N)
• Reduce size of N as the size of itemset increases
• Used by DHP and vertical-based mining algorithms
● Reduce the number of comparisons (NM)
• Use efficient data structures to store the candidates or transactions
• No need to match every candidate against every transaction
23
Apriori
● Large Itemset Property:
Any subset of a large itemset is large.
● Contrapositive:
If an itemset is not large,
none of its supersets are large.
24
Reducing Number of Candidates
● Apriori principle:
• If an itemset is frequent, then all of its subsets must also
be frequent
● Apriori principle holds due to the following property
of the support measure:
• Support of an itemset never exceeds the support of its subsets
• This is known as the anti-monotone property of support
)
(
)
(
)
(
:
,
Y
X
Y
s
X
s
Y
X
⊆
⇒
≥
∀
25
Large Itemset Property
26
Apriori Ex (cont’d)
s=30% α = 50%
27
Found to be Infrequent
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Illustrating Apriori Principle
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Pruned supersets
28
Illustrating Apriori Principle
Item Count
Itemset Count
{Bread,Milk} 3
{Bread,Cocoa} 2
{Bread,Diaper} 3
{Milk,Cocoa} 2
{Milk,Diaper} 3 {Cocoa,Diaper} 3
Itemset Count
{Bread,Milk,Diaper} 3
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving Coke or Eggs)
Triplets (3-itemsets)
Minimum Support = 3
If every subset is considered,
6C
1 + 6C2 + 6C3 = 41
With support-based pruning,
6 + 6 + 1 = 13 Tan,Steinbach, Kumar
TID Items
1 Bread, Milk
2 Bread, Diaper, Cocoa, Eggs
3 Milk, Diaper, Cocoa, Coke 4 Bread, Milk, Diaper,
Cocoa
29
The Apriori Algorithm—An Example
Database TDB
1st scan
30
Apriori Algorithm
C
1 = Itemsets of size one in I;
Determine all large itemsets of size 1, L
1;
i = 1;
Repeat
i = i + 1;
C
i = Apriori-Gen(Li-1);
Count C
i to determine Li;
until no more large itemsets found;
31
Apriori Algorithm
● Method:
• Let k=1
• Generate frequent itemsets of length 1
• Repeat until no new frequent itemsets are identified • Generate length (k+1) candidate itemsets from length k
frequent itemsets
• Prune candidate itemsets containing subsets of length k that are infrequent
• Count the support of each candidate by scanning the DB • Eliminate candidates that are infrequent, leaving only those
that are frequent
32
● Problems with support and confidence; a rule
may have high support & high confidence
because the rule is an obvious rule… someone who buys potato chips is highly likely to buy soft
drink -> not surprising
● Confidence totally ignore P(B)
• P(A,B) = σ (A ∪B) / |T|
33
Interestingness Measure: Correlations
(Lift)
● Lift (Wipes, Milk ⇒ Diapers) is 22.73
● People who purchased Wipes & Milk are 22.73 times
more likely to also purchase Diapers than people who do not purchase Wipes & Milk .
● Measure of dependent/correlated events: lift
)
Han & Kamber
>1 positively correlated = 1 independent
34
● play basketball ⇒ eat cereal [40%, 66.7%] is misleading
• The overall % of students eating cereal is 75% > 66.7%.
● play basketball ⇒ not eat cereal [20%, 33.3%] is more
accurate, although with lower support and confidence
5000 Not cereal
3750 1750
2000 Cereal
Sum (row) Not basketball
Basketball
35
Conviction = 1, independent (not related)
37
Example
76 No osteoporosis
38 No osteoporosis
38
● DOF (df) = (2-1)(2-1) = 1
● significant level, α = 0.01
● χ2 = 6.63 (tabulated)
● Null hypothesis: A & B are not correlated
● Calculate expected values
● Calculate chi-square value
● Compare calculated & tabulated values
39
40