Association Rules
Lecture 4/DMBI/IKI83403T/MTI/UI
Yudho Giri Sucahyo, Ph.D, CISA (yudho@cs.ui.ac.id)
Faculty of Computer Science, University of Indonesia
Objectives
Objectives
`
Introduction
`
What is Association Mining?
`
What is Association Mining?
`
Mining Association Rules
`
Al
ith
f A
i ti
R l Mi i
`
Algorithms for Association Rules Mining
Introduction
Introduction
` You sell more if customers can see the product.
` Customers that purchase one type of product are likely
` Customers that purchase one type of product are likely to be interested in other particular products.
` Market-basket analysis ¼ studying the composition of
h i b k f d h d d i i l
shopping basket of products purchased during a single shopping event.
` Market-basket data ¼ the transactional list of purchases p by customer. It is challenging, because
` Very large number of records (often millions of trans/day)
` Sparseness (each market basket contains only a small portion of
` Sparseness (each market-basket contains only a small portion of
items carried)
` Heterogeneity (those with different tastes tend to purchase a specific
subset of items)
University of Indonesia
subset of items).
Introduction (2)
Introduction (2)
` Product presentations can be more intelligently planned for specific times a day, days of the week, or holidays. for specific times a day, days of the week, or holidays.
` Can also involve sequential relationships.
` Market-basket analysis is an y undirected (along with ( g clustering) DM operation, seeking patterns that were previously unknown.
` Cross-selling
` The propensity for the purchaser of a specific item to purchase a different item
a different item
` Can be maximized by locating those products that tend to be purchased by the same consumer in places where both
What is Association Mining?
What is Association Mining?
`
Association rule mining (ARM):
` Finding frequent patterns association correlation or causal
` Finding frequent patterns, association, correlation, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.
` Frequent pattern: pattern that occurs frequently in a database. `
Motivation: finding regularities in data
`
Motivation: finding regularities in data
` What products were often purchased together? — Beer and diapers?!
` What are the subsequent purchases after buying a PC?
` What kinds of DNA are sensitive to this new drug?
` Can we automatically classify web documents?
University of Indonesia
` Can we automatically classify web documents?
5
Why is Frequent Pattern or Association
Mining an Essential Task in DM?
Mining an Essential Task in DM?
` Foundation for many essential data mining tasks
` Association correlation causality
` Association, correlation, causality
` Sequential patterns, temporal or cyclic association, partial periodicity, spatial and multimedia association
p y p
` Associative classification, cluster analysis, iceberg cube, fascicles (semantic data compression)
` Broad applications
` Basket data analysis, cross-marketing, catalog design, sale campaign analysis
What is Association Mining?
What is Association Mining?
`
Examples:
` Rule form: “A → B [support confidence]”
` Rule form: A → B [support, confidence] .
` buys(x, “diapers”) → buys(x, “beers”) [0.5%, 60%]
` major(x, “CS”) ^ takes(x, “DB”) j ( , ) ( , ) → grade(x, “A”) [1%, 75%]g ( , ) [ , ]
`
A
support
of 0.5% for Assoc Rule means that 0.5%
of all the transaction show that diapers and beers are
purchased together.
`
A
confidence
of 60% means that 60% of the
customers who purchased diapers also bought beers.
`
Rules that satisfy both minimum support and
f d
h
h ld
ll d
University of Indonesia
minimum confidence threshold are called
strong
.
7
What is Association Mining?
What is Association Mining?
` A set of items is referred to as an itemset.
` An itemset that contains k items is a k-itemset
` An itemset that contains k items is a k-itemset.
` {beer, diaper} is a 2-itemset.
` If an itemset satisfies minimum support then it is a
` If an itemset satisfies minimum support, then it is a
frequent itemset.
` The set of frequent q k-itemsets is commonly denoted by y y Lkk.
` ARM is a two-step process:
` Find all frequent itemsets
` Generate strong AR from the frequent itemsets.
` The second step is the easiest of the two. Overall
f f i i AR i d t i d b th fi t t
Association Mining
Association Mining
mining association rules
(Agrawal et al SIGMOD93) (Agrawal et. al SIGMOD93)
Fast algorithm
(Agrawal et. al VLDB94)
Generalized A.R.
(Srikant et al; Han et al VLDB95)
Problem extension Better algorithms
Partitioning
(Agrawal et. al VLDB94) (Srikant et. al; Han et. al. VLDB95)
Quantitative A.R.
(Srikant et. al SIGMOD96)
Hash-based
(Park et. al SIGMOD95)
g
(Navathe et. al VLDB95)
Direct Itemset Counting
(Brin et. al SIGMOD97)
N-dimensional A.R.
(Lu et. al DMKD’98)
Parallel mining
(Agrawal et. al TKDE96)
( )
Meta-ruleguided mining
University of Indonesia 9
Distributed mining
(Cheung et. al PDIS96)
Incremental mining
(Cheung et. al ICDE96)
Many Kinds of Association Rules
Many Kinds of Association Rules
` Boolean association rule:
` If a rule concerns associations between the presence or absence
` If a rule concerns associations between the presence or absence of items.
` Example: buys(x, “diapers”) → buys(x, “beers”) [0.5%, 60%] (R1)
` Quantitative association rule:
` Describes associations between quantitative items. Quantitative values for items are partitioned into intervals
values for items are partitioned into intervals.
` Example: age(X,”30-39”) ∧ income(X,“42K-48K”) →
buys(X,“LCD TV”) (R2)
` Age and income have been discretized.
` Single-dimensional association rule ¼ R1
Many Kinds of Association Rules
Many Kinds of Association Rules
` Single-level association rule
` Example: age(X ”30-39”) → buys(X “laptop computer”)
` Example: age(X, 30 39 ) → buys(X, laptop computer )
` Multilevel association rule
` Example: p age(X,”30-39”) g ( ) → buys(X,“computer”)y ( p )
` Computer is a higher-level abstraction of laptop
` Various Extensions
` Mining maximal frequent patterns.
` If p is a maximal frequent pattern, then any superpattern of p is not frequent
not frequent.
` Used to substantially reduce the number of frequent itemsts generated in mining.
University of Indonesia 11
Mining Single-Dimensional
Boolean Assocation Rules
Boolean Assocation Rules
` Given` A database of customer transactions
` A database of customer transactions
` Each transaction is a list of items (purchased by a customer in a visit)
` Find all rules that correlate the presence of one set of items with that of another set of items
` Example: 98% of people who purchase tires and auto accessories
` Example: 98% of people who purchase tires and auto accessories also get automotive services done
` Any number of items in the consequent/antecedent of rule
Application Examples
Application Examples
` Market-basket Analysis
` * → Fanta -- what the store should do to boost Fanta sales
` → Fanta what the store should do to boost Fanta sales
` Bodrex → * -- what other products should the store stocks up on if the store has a sale on Bodrex
` Attached mailing in direct marketing
University of Indonesia 13
Rule Measures: Support and Confidence
Rule Measures: Support and Confidence
` Find all the rules X & Y ⇒ Z with
Customer buys diaper
Customer buys both
minimum confidence and support
` support,s, probability that a transaction contains {X Y Z}
buys diaper
transaction contains {X, Y, Z}
` confidence,c, conditional probability that a transaction having {X, Y} also
Customer buys beer
contains Z.
Transaction ID Items Bought
2000 A B C Let minimum support 50%, and
buys beer
2000 A,B,C
1000 A,C
4000 A,D
Let minimum support 50%, and
minimum confidence 50%, we have
A ⇒ C (50%, 66.6%)
C A (50% 100%)
,
Mining Association Rules -- Example
Mining Association Rules
Example
Transaction ID Items Bought Transaction ID Items Bought
2000 A,B,C
1000 A,C
Min. support 50% Min. confidence 50%
4000 A,D
5000 B,E,F Frequent Itemset Support
{A} 75%
{B} 50%
For rule A ⇒ C:
support = support({A, C}) = 50%
{B} 50%
{C} 50%
{A,C} 50%
support support({A, C}) 50%
confidence = support({A, C})/support({A}) = 66.6%
The Apriori principle:
University of Indonesia 15
Any subset of a frequent itemset must be frequent.
Mining Frequent Itemsets: the Key Step
Mining Frequent Itemsets: the Key Step
¦Find the frequent itemsets: the sets of items that have minimum support
minimum support
A subset of a frequent itemset must also be a frequent itemset,
i e if {AB} is a frequent itemset both {A} and {B} should be a i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset
Iteratively find frequent itemsets with cardinality from 1 to k Iteratively find frequent itemsets with cardinality from 1 to k
(k-itemset)
The Apriori Algorithm
The Apriori Algorithm
Ck: Candidate itemset of size k
L f i f i k
Lk : frequent itemset of size k
L1 = {frequent items};
f (k 1 L ! ∅ k++) d b i for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk; for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support end
return ∪k Lk;
University of Indonesia 17
The Apriori Algorithm – Example 1
The Apriori Algorithm Example 1
TID Items
Database D itemset sup.
{1} 2 itemset sup.{1} 2
C itemset L
S D itemset sup
C3 itemset L3
{2 3 5}
Scan D itemset sup
The Apriori Algorithm – Example 2
The Apriori Algorithm Example 2
Tid Items 1 3 4 5 6 7 9 2 1 3 4 5 13 3 1 2 4 5 7 11 4 1 3 4 8
Itemset Tid Support
1 2,3,4,5 80% (4)
3 1,2,4,5 80% (4)
Rule Support Confidence
7 ⇒ 5 40% (2) 100%
University of Indonesia 19
Frequent Patterns
with MinSupport = 40% Association Ruleswith MinSupport = 40% and MinConf = 70%
Major Drawbacks of Apriori
Major Drawbacks of Apriori
` Apriori has to read the database many times to test the support of candidate patterns. In order to find a frequent support of candidate patterns. In order to find a frequent pattern X with length 50, it has to traverse the database 50 times.
` On dense datasets with long patterns, as the length of the pattern increases, the performance of Apriori drops rapidly due to the explosion of the candidate patterns
due to the explosion of the candidate patterns.
University of Indonesia 21
TreeProjection Algorithm
(Agarwal, Aggarwal & Prasad 2000) (Agarwal, Aggarwal & Prasad 2000)null {7534, 5134, 7514, 134, 134} Lev el 0
7 {54, 54} 5 {34, 134, 14} 1 {34, 4, 3 4 Level 1
34, 34}
{4,4, 4, 4}
75 74 51 53 54 13 14 34 Level 2
754 514 534 134 Level 3
7 5 1 3 4
7
5 2
1 1 2
3 1 2 3
Lexicographical Tree
and
Triangular Matrix for Counting Frequent Patterns
with Length Two 3 1 2 3
4 2 3 4 4
Eclat Algorithm
(Zaki et al. 1997)Eclat Algorithm
(Zaki et al. 1997)13457
tid‐list intersecti on
1 3 4 5 7
University of Indonesia 23
{}
FP-Growth
(Han, Pei & Yin 2000)FP Growth
(Han, Pei & Yin 2000)node‐links
1:3 5:1 5:1
7:1 5:1 7:1
CT-PRO
(Sucahyo & Gopalan 2004)CT PRO
(Sucahyo & Gopalan 2004)0 5 1
University of Indonesia 25
4 0 5
(c) Global CFP-Tree
Mining Very Large Database
Mining Very Large Database
Partition Algorithm
(Savasere, Omiecinski & Navathe 1995)Mining Very Large Database
Mining Very Large Database
` Projection (Pei 2002)
Tid Items
It It Items Items It
3 7 5 1 4 <empty>
Items
<empty>
Projection 1 Items
<empty>
Projection 7
Projection 5
Projection 3
(a) Parallel Projection
(b) Partition Projection Projection 7 Projection 3
University of Indonesia 27
Presentation of Association Rules
(Table Form)
Visualization of Association Rule
Using Rule Graph
Using Rule Graph
University of Indonesia 29
Visualization of Association Rule
Using Plane Graph
Conclusion
Conclusion
`
Association rule mining
`
probably the most significant contribution from the
database community in KDD
`
A large number of papers have been published
`
Many interesting issues have been explored
`
An interesting research direction
`
Association analysis in other types of data: spatial
`
Association analysis in other types of data: spatial
data, multimedia data, time series data, etc.
University of Indonesia 31
References
References
` Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques,
Morgan Kaufmann, 2001.
` David Olson and Yong Shi, Introduction to Business Data Mining, McGraw-Hill,
2007.
` Agarwal, R. C., Aggarwal, C. C. & Prasad, V. V. V. 2001, 'A Tree Projection g gg j
Algorithm for Generation of Frequent Item Sets', Journal of Parallel and
Distributed Computing (Special Issue on High-Performance Data Mining),vol. 61, no. 3, pp. 350-371.
` Han, J., Pei, J. & Yin, Y. 2000, 'Mining Frequent Patterns without Candidate
Generation', in Proceedings of the ACM SIGMOD International Conference on
Management of Data, Dallas, Texas, USA, pp. 1-12.
` Savasere, A., Omiecinski, E. & Navathe, S. 1995, 'An Efficient Algorithm for
Mining Association Rules in Large Databases', in Proceedings of the 21st
References (2)
References (2)
` Pei, J. 2002, Pattern-growth Methods for Frequent Pattern Mining, PhD Thesis, Simon Fraser University, Canada.
Z k M J 1997 'P ll l Al h f F D f A R l ' D
` Zaki, M. J. 1997, 'Parallel Algorithms for Fast Discovery of Association Rules', Data Mining and Knowledge Discovery: An International Journal,vol. 1, no. 4, pp. 343-373
` Sucahyo, Y. G. & Gopalan, R. P. 2004, 'CT-PRO: A Bottom-Up Non Recursive Frequent Itemset Mining Algorithm Using Compressed FP Tree Data Structure' in Proceedings Itemset Mining Algorithm Using Compressed FP-Tree Data Structure , in Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI),
Brighton, UK.