Lect4 Association Rules

(1)

Association Rules

Lecture 4/DMBI/IKI83403T/MTI/UI

Yudho Giri Sucahyo, Ph.D, CISA (yudho@cs.ui.ac.id)

Faculty of Computer Science, University of Indonesia

Objectives

`

Introduction

`

What is Association Mining?

`

What is Association Mining?

`

Mining Association Rules

`

Al

ith

f A

i ti

R l Mi i

`

Algorithms for Association Rules Mining

(2)

Introduction

` You sell more if customers can see the product.

` Customers that purchase one type of product are likely

` Customers that purchase one type of product are likely to be interested in other particular products.

` Market-basket analysis ¼ studying the composition of

h i b k f d h d d i i l

shopping basket of products purchased during a single shopping event.

` Market-basket data ¼ the transactional list of purchases p by customer. It is challenging, because

` Very large number of records (often millions of trans/day)

` Sparseness (each market basket contains only a small portion of

` Sparseness (each market-basket contains only a small portion of

items carried)

` Heterogeneity (those with different tastes tend to purchase a specific

subset of items)

University of Indonesia

subset of items).

Introduction (2)

` Product presentations can be more intelligently planned for specific times a day, days of the week, or holidays. for specific times a day, days of the week, or holidays.

` Can also involve sequential relationships.

` Market-basket analysis is an y undirected (along with ( g clustering) DM operation, seeking patterns that were previously unknown.

` Cross-selling

` The propensity for the purchaser of a specific item to purchase a different item

a different item

` Can be maximized by locating those products that tend to be purchased by the same consumer in places where both

(3)

What is Association Mining?

`

Association rule mining (ARM):

` Finding frequent patterns association correlation or causal

` Finding frequent patterns, association, correlation, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.

` Frequent pattern: pattern that occurs frequently in a database. `

Motivation: finding regularities in data

`

Motivation: finding regularities in data

` What products were often purchased together? — Beer and diapers?!

` What are the subsequent purchases after buying a PC?

` What kinds of DNA are sensitive to this new drug?

` Can we automatically classify web documents?

5

Why is Frequent Pattern or Association

Mining an Essential Task in DM?

` Foundation for many essential data mining tasks

` Association correlation causality

` Association, correlation, causality

` Sequential patterns, temporal or cyclic association, partial periodicity, spatial and multimedia association

p y p

` Associative classification, cluster analysis, iceberg cube, fascicles (semantic data compression)

` Broad applications

` Basket data analysis, cross-marketing, catalog design, sale campaign analysis

(4)

What is Association Mining?

`

Examples:

` Rule form: “A → B [support confidence]”

` Rule form: A → B [support, confidence] .

` buys(x, “diapers”) → buys(x, “beers”) [0.5%, 60%]

` major(x, “CS”) ^ takes(x, “DB”) j ( , ) ( , ) → grade(x, “A”) [1%, 75%]g ( , ) [ , ]

`

A

support

of 0.5% for Assoc Rule means that 0.5%

of all the transaction show that diapers and beers are

purchased together.

`

A

confidence

of 60% means that 60% of the

customers who purchased diapers also bought beers.

`

Rules that satisfy both minimum support and

f d

h

h ld

ll d

minimum confidence threshold are called

strong

.

7

What is Association Mining?

` A set of items is referred to as an itemset.

` An itemset that contains k items is a _k-itemset

` An itemset that contains k items is a _k-itemset.

` {beer, diaper} is a 2-itemset.

` If an itemset satisfies minimum support then it is a

` If an itemset satisfies minimum support, then it is a

frequent itemset.

` The set of frequent q k-itemsets is commonly denoted by y y L_k_k.

` ARM is a two-step process:

` Find all frequent itemsets

` Generate strong AR from the frequent itemsets.

` The second step is the easiest of the two. Overall

f f i i AR i d t i d b th fi t t

(5)

Association Mining

mining association rules

(Agrawal et al SIGMOD93) (Agrawal et. al SIGMOD93)

Fast algorithm

(Agrawal et. al VLDB94)

Generalized A.R.

(Srikant et al; Han et al VLDB95)

Problem extension Better algorithms

Partitioning

(Agrawal et. al VLDB94) (Srikant et. al; Han et. al. VLDB95)

Quantitative A.R.

(Srikant et. al SIGMOD96)

Hash-based

(Park et. al SIGMOD95)

g

(Navathe et. al VLDB95)

Direct Itemset Counting

(Brin et. al SIGMOD97)

N-dimensional A.R.

(Lu et. al DMKD’98)

Parallel mining

(Agrawal et. al TKDE96)

( )

Meta-ruleguided mining

University of Indonesia 9

Distributed mining

(Cheung et. al PDIS96)

Incremental mining

(Cheung et. al ICDE96)

Many Kinds of Association Rules

` Boolean association rule:

` If a rule concerns associations between the presence or absence

` If a rule concerns associations between the presence or absence of items.

` Example: buys(x, “diapers”) → buys(x, “beers”) [0.5%, 60%] (R1)

` Quantitative association rule:

` Describes associations between quantitative items. Quantitative values for items are partitioned into intervals

values for items are partitioned into intervals.

` Example: age(X,”30-39”) ∧ income(X,“42K-48K”) →

buys(X,“LCD TV”) (R2)

` Age and income have been discretized.

` Single-dimensional association rule ¼ R1

(6)

Many Kinds of Association Rules

` Single-level association rule

` Example: age(X ”30-39”) → buys(X “laptop computer”)

` Example: age(X, 30 39 ) → buys(X, laptop computer )

` Multilevel association rule

` Example: p age(X,”30-39”) g ( ) → buys(X,“computer”)y ( p )

` Computer is a higher-level abstraction of laptop

` Various Extensions

` Mining maximal frequent patterns.

` If p is a maximal frequent pattern, then any superpattern of p is not frequent

not frequent.

` Used to substantially reduce the number of frequent itemsts generated in mining.

Mining Single-Dimensional

Boolean Assocation Rules

` Given

` A database of customer transactions

` Each transaction is a list of items (purchased by a customer in a visit)

` Find all rules that correlate the presence of one set of items with that of another set of items

` Example: 98% of people who purchase tires and auto accessories

` Example: 98% of people who purchase tires and auto accessories also get automotive services done

` Any number of items in the consequent/antecedent of rule

(7)

Application Examples

` Market-basket Analysis

` * → Fanta -- what the store should do to boost Fanta sales

` → Fanta what the store should do to boost Fanta sales

` Bodrex → * -- what other products should the store stocks up on if the store has a sale on Bodrex

` Attached mailing in direct marketing

Rule Measures: Support and Confidence

` Find all the rules X & Y ⇒ Z with

Customer buys diaper

Customer buys both

minimum confidence and support

` support,s, probability that a transaction contains {X Y Z}

buys diaper

transaction contains {X, Y, Z}

` confidence,c, conditional probability that a transaction having {X, Y} also

Customer buys beer

contains Z.

Transaction ID Items Bought

2000 A B C Let minimum support 50%, and

buys beer

2000 A,B,C

1000 A,C

4000 A,D

Let minimum support 50%, and

minimum confidence 50%, we have

A ⇒ C (50%, 66.6%)

C A (50% 100%)

,

(8)

Mining Association Rules -- Example

Mining Association Rules

Example

Transaction ID Items Bought Transaction ID Items Bought

2000 A,B,C

1000 A,C

Min. support 50% Min. confidence 50%

4000 A,D

5000 B,E,F Frequent Itemset Support

{A} 75%

{B} 50%

For rule A ⇒ C:

support = support({A, C}) = 50%

{B} 50%

{C} 50%

{A,C} 50%

support support({A, C}) 50%

confidence = support({A, C})/support({A}) = 66.6%

The Apriori principle:

Any subset of a frequent itemset must be frequent.

Mining Frequent Itemsets: the Key Step

¦Find the frequent itemsets: the sets of items that have minimum support

minimum support

A subset of a frequent itemset must also be a frequent itemset,

i e if {AB} is a frequent itemset both {A} and {B} should be a i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset

Iteratively find frequent itemsets with cardinality from 1 to k Iteratively find frequent itemsets with cardinality from 1 to k

(k-itemset)

(9)

The Apriori Algorithm

C_k: Candidate itemset of size k

L f i f i k

L_k : frequent itemset of size k

L₁ = {frequent items};

f (k 1 L ! ∅ k++) d b i for (k = 1; L_k !=∅; k++) do begin

C_k+1 = candidates generated from L_k; for each transaction t in database do

increment the count of all candidates in C_k+1

that are contained in t

L_k+1 = candidates in C_k+1 with min_support end

return ∪_k L_k;

The Apriori Algorithm – Example 1

The Apriori Algorithm Example 1

TID Items

Database D itemset sup.

{1} 2 itemset sup._{1} ₂

C _itemset L

S D itemset sup

C₃ _itemset L₃

{2 3 5}

Scan D itemset sup

(10)

The Apriori Algorithm – Example 2

The Apriori Algorithm Example 2

Tid Items 1 3 4 5 6 7 9 2 1 3 4 5 13 3 1 2 4 5 7 11 4 1 3 4 8

Itemset Tid Support

1 2,3,4,5 80% (4)

3 1,2,4,5 80% (4)

Rule Support Confidence

7 ⇒ 5 40% (2) 100%

Frequent Patterns

with MinSupport = 40% Association Rules_{with MinSupport = 40% and MinConf = 70%}

(11)

Major Drawbacks of Apriori

` Apriori has to read the database many times to test the support of candidate patterns. In order to find a frequent support of candidate patterns. In order to find a frequent pattern X with length 50, it has to traverse the database 50 times.

` On dense datasets with long patterns, as the length of the pattern increases, the performance of Apriori drops rapidly due to the explosion of the candidate patterns

due to the explosion of the candidate patterns.

TreeProjection Algorithm

(Agarwal, Aggarwal & Prasad 2000) (Agarwal, Aggarwal & Prasad 2000)

null {7534, 5134, 7514, 134, 134} Lev el 0

7 {54, 54} 5 {34, 134, 14} 1 {34, 4, 3 4 Level 1

34, 34}

{4,4, 4, 4}

75 74 51 53 54 13 14 34 Level 2

754 514 534 134 Level 3

7 5 1 3 4

7

5 2

1 1 2

3 1 2 3

Lexicographical Tree

and

Triangular Matrix for Counting Frequent Patterns

with Length Two 3 1 2 3

4 2 3 4 4

(12)

Eclat Algorithm

(Zaki et al. 1997)

Eclat Algorithm

(Zaki et al. 1997)

₁₃₄₅₇

tid‐list intersecti on

1 3 4 5 7

{}

FP-Growth

(Han, Pei & Yin 2000)

FP Growth

(Han, Pei & Yin 2000)

node‐links

1:3 5:1 5:1

7:1 5:1 7:1

(13)

CT-PRO

(Sucahyo & Gopalan 2004)

CT PRO

(Sucahyo & Gopalan 2004)

0 5 1

4 0 5

(c) Global CFP-Tree

Mining Very Large Database

Partition Algorithm

(Savasere, Omiecinski & Navathe 1995)

(14)

Mining Very Large Database

` Projection (Pei 2002)

Tid Items

It It Items Items It

3 7 5 1 4 <empty>

Items

<empty>

Projection 1 Items

<empty>

Projection 7

Projection 5

Projection 3

(a) Parallel Projection

(b) Partition Projection Projection 7 Projection 3

Presentation of Association Rules

(Table Form)

(15)

Visualization of Association Rule

Using Rule Graph

Visualization of Association Rule

Using Plane Graph

(16)

Conclusion

`

_{Association rule mining}

`

probably the most significant contribution from the

database community in KDD

`

A large number of papers have been published

`

Many interesting issues have been explored

`

_{An interesting research direction}

`

Association analysis in other types of data: spatial

`

Association analysis in other types of data: spatial

data, multimedia data, time series data, etc.

References

` Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques,

Morgan Kaufmann, 2001.

` David Olson and Yong Shi, Introduction to Business Data Mining, McGraw-Hill,

2007.

` Agarwal, R. C., Aggarwal, C. C. & Prasad, V. V. V. 2001, 'A Tree Projection g gg j

Algorithm for Generation of Frequent Item Sets', Journal of Parallel and

Distributed Computing (Special Issue on High-Performance Data Mining),vol. 61, no. 3, pp. 350-371.

` Han, J., Pei, J. & Yin, Y. 2000, 'Mining Frequent Patterns without Candidate

Generation', in Proceedings of the ACM SIGMOD International Conference on

Management of Data, Dallas, Texas, USA, pp. 1-12.

` Savasere, A., Omiecinski, E. & Navathe, S. 1995, 'An Efficient Algorithm for

Mining Association Rules in Large Databases', in Proceedings of the 21st

(17)

References (2)

` Pei, J. 2002, Pattern-growth Methods for Frequent Pattern Mining, PhD Thesis, Simon Fraser University, Canada.

Z k M J 1997 'P ll l Al h f F D f A R l ' D

` Zaki, M. J. 1997, 'Parallel Algorithms for Fast Discovery of Association Rules', Data Mining and Knowledge Discovery: An International Journal,vol. 1, no. 4, pp. 343-373

` Sucahyo, Y. G. & Gopalan, R. P. 2004, 'CT-PRO: A Bottom-Up Non Recursive Frequent Itemset Mining Algorithm Using Compressed FP Tree Data Structure' in Proceedings Itemset Mining Algorithm Using Compressed FP-Tree Data Structure , in Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI),

Brighton, UK.