• Tidak ada hasil yang ditemukan

2 Distributed Association Rule Mining

N/A
N/A
Protected

Academic year: 2024

Membagikan "2 Distributed Association Rule Mining"

Copied!
2
0
0

Teks penuh

(1)

An Abstraction Based Communication Efficient Distributed Association Rule Mining

P. Santhi Thilagam1and V.S. Ananthanarayana2

1 Sr. Lecturer, Dept. of Computer Engineering, NITK-Surathkal, India-575 025 santhi [email protected]

2 Professor, Dept. of Information Technology, NITK-Surathkal, India-575 025 [email protected]

Abstract. Association rule mining is one of the most researched areas because of its applicability in various fields. We propose a novel data structure called Sequence Pattern Count, SPC, tree which stores the database compactly and completely and requires only one scan of the database for its construction. The completeness property of the SPC tree with respect to the database makes it more suitable for mining as- sociation rules in the context of changing data and changing supports without rebuilding the tree. A performance study shows that SPC tree is efficient and scalable. We also propose a Doubly Logarithmic-depth Tree, DLT, algorithm which uses SPC tree to efficiently mine the huge amounts of geographically distributed datasets in order to minimize the communication and computation costs. DLT requires only O(n) mes- sages for support count exchange and it takes onlyO(log log n) time for exchange of messages, which increases its efficiency.

Keywords: Association rule mining, Distributed databases, Sequence Pattern Count Tree, Incremental mining, Doubly Logarithmic-depth Tree.

1 Introduction

Due to the explosive growth in the number, size and complexity of databases, many geographically distributed organizations, the ever-growing number of ap- plications and the high scalability of distributed systems, there is a need for mining distributed databases [4]. Association Rule Mining (ARM), the mining of frequent patterns in large transaction databases and many other types of databases has been studied popularly in data mining research. The most impor- tant component affecting the performance of any ARM algorithm is the number of disk accesses required [1]. Recently alternative data structures were employed in order to improve the efficiency of existing and new algorithms [5]. This mo- tivated us to propose an approach that employs an abstraction called Sequence Pattern Count, SPC, tree which is more compact and complete, and suitable for incremental mining and changing support [10]. A SPC tree is constructed using a single database scan and can be updated dynamically. Algorithm based on this structure does not require any more database scans to generate frequent itemsets [6].

S. Rao et al. (Eds.): ICDCN 2008, LNCS 4904, pp. 251–256, 2008.

c Springer-Verlag Berlin Heidelberg 2008

(2)

252 P. Santhi Thilagam and V.S. Ananthanarayana

All proposed algorithms for mining association rules in distributed databases focus on reduction of communication [3], efficient usage of memory, processing power, ability to scale up the number of processors and associated data, ability to increase the size of the database, decrease in response time with addition of processors[4][8]. However, the majority of the parallel mining algorithms suffer from high communication and synchronization overhead [2][7]. In this work, a new distributed association rule mining approach is proposed that decreases communication costs by introducing a new message exchange procedure and a new computational efficient technique that reduces computation time. Our main contributions reported in this paper are:

1. Communication optimization: Doubly Logarithmic-depth Tree algorithm is used to minimize the communication costs in terms of number of messages for support count exchange and time taken for exchange of these messages.

2. Minimization of the number of database scans: SPC tree structure is used to make the mining process more efficient in terms of database scans in the distributed environment, which requires only one scan of the database for mining frequent itemsets.

2 Distributed Association Rule Mining

Let ‘DB’ be a database and ‘n’ be the processors of nodes namelyP1, P2, . . . , Pn

which are connected over a computer network. Each processor has a local mem- ory and a local disk. The processor can communicate only by passing messages and we assume that there is no loss of message during communication and net- work is completely reliable. Let the database DB be partitioned into n non- overlapping blocks DB1, DB2, . . . , DBn where n is the number or processors available and each partitionDBi has the same schema. Let the size of DB and the partitionsDBi beD andDi respectively. For a given itemsetX, letX.sup andX.supibe the respective support counts ofX in DB andDBi. The problem is to find all frequent itemsets in DB[9].

2.1 Solution Approach

Distributed Association Rule Mining is explained in Algorithm 1 which consists of the following phases:

SPC tree Construction – This preprocessing step allows us to store the DBi compactly in main memory. The frequent-1 itemsets can be generated during the construction of the SPC tree which will be used in mining process later.

SPC tree construction is explained in Algorithm 2. SPC tree Growth – This algorithm adopts pattern-growth approach to mine all frequent itemsets from the SPC tree which requires no more database scans.Communication between processors – It uses DLT algorithm to exchange the local counts between the processors in order to calculate the global count of each itemsets which is ex- plained in Algorithm 3.

Referensi

Dokumen terkait

The study conducted consists of three major steps (Figure 3), namely: the spatial data pre-process, application of apriori algorithm to obtain the association pattern of

They are the Constitutional Court Ruling regarding the examination of provisions for the determination of mining zones in Law on Mineral and Coal Mining (Law No. 4

The cross-sectional study was performed on 240 subjects (135 patients of ST-elevation MI below 45 years and 105 age-matched controls). Association rule mining was used to detect

This paper shows a process of generating Multidimensional Fuzzy Association Rules mining from a normalized database of medical record patients.. The process consists of two

Membership Membership shall be open to any individual or institution possessing an interest in mining history and/or the preservation of mining sites and materials!. Fees Fees are

PO3 UNIT I: Lectures: 7 Data Mining, Data Mining task primitives, Integration of Data Mining system with the database, Major issues in Data Mining, Data Pre-processing, Descriptive

Plant Survival Rates of the 10 Local Tree Species in the Post-Coal Mining Land The results showed that seven of the ten local plant species planted on the post-coal mining area had

During the next step, candidate 2-item sets are created using frequent 1-item sets as the Apriori rule confirms that superset of non-frequent 1- Item Count Scala Codebook 4 Learning