Gunawan dan I Ketut Eddy Purnama ICGC RCICT 2010

(1)

ICGC-RCICT

2010

2-3 March 2010

ISSN: 2086-4868

Department of Electrical Engineering

and Information Technology

Faculty of Engineering

Gadjah Mada University

Jalan Grafika no 2 Kampus UGM

Yogyakarta, 55281, Indonesia

PROCEEDINGS OF

AND

THE FIRST INTERNATIONAL CONFERENCE

ON GREEN COMPUTING

THE SECOND AUN/SEED-NET

REGIONAL CONFERENCE ON ICT

DEPARTMENT OF ELECTRICAL ENGINEERING

AND INFORMATION TECHNOLOGY

FACULTY OF ENGINEERING

GADJAH MADA UNIVERSITY

Proceedings of

The First International Conference on Green Computing

and

The Second

AUN/SEED-Net Regional Conference on ICT

(2)

Efficiency Comparison on Eclat, FP-Tree, and

Top-Down Algorithms in Search L-Itemsets

Gunawan *,**), I.K.E Purnama *)

Email: [email protected] [email protected]

*)

Department of Electrical Engineering, Faculty of Industrial Technology, Institut Teknologi Sepuluh Nopember Kampus ITS Sukolilo, Surabaya 60111, East Java, Indonesia.

**)

Department of Computer Science, Sekolah Tinggi Teknik Surabaya Ngagel Jaya Tengah 73-77, Surabaya 60284, East Java, Indonesia.

Abstract— We present the comparison in efficiency of the three famous algorithms, Eclat, FP-Tree, and Top-Down, to search large itemset in huge database. Eclat can reduce the amount of database scanning. This algorithm is started from minimal itemset and continued by intersecting among items. This process is redone until infrequent node is found (the support is less than minimum support). Top-Down algorithm is almost the same as Eclat. Items in the database are joined to be one itemset and they will be combined by deleting one item in the itemset in turns. The process is done recursively until it finds a frequent node. FP-Tree algorithm does the mining process without generating candidate itemsets. Original database is stored in a tree form, and then the mining process is done one by one for all frequent items in database. Those three algorithms have their own benefits and weaknesses. Eclat and Top-Down algorithm have their advantages in making branch of the tree. Eclat makes its branch by checking whether that combination is already in database or not. On other hand, Top-Down makes its branch by looking whether the branch is already made by another node or not. FP-Tree algorithm’s advantage is it can generate a little bit candidate itemsets unlike apriori. The weakness of Eclat and Top-Down is in the characteristic of the database. If the database has a lot of frequent items and the minimum support value is small, the combination process will waste the time. The weakness of FP-Tree is it saves the original database in a tree form, so it spends a lot of memory.

Keywords-data mining; association rule mining; Eclat; FP-Tree; Top-Down;

I. INTRODUCTION

Data mining has provided a huge contribution in the development of various disciplines. Data mining aims to find a pattern of a lot of data collection, and has an important role in corporate decision-making that has a large-scale work. Corporate transaction data is processed to discover patterns from these data. Before the pattern is found, the data must take several steps, such as data selection, data cleaning and preprocessing, data transformation, data reduction (elimination of unnecessary data), data mining task and algorithm selection , post-processing, and the last is to apply the techniques in data mining.

One of the tasks of data mining is to discover frequent item sets. Frequent item sets is pairs of data that often arise. Frequent itemset has important role in data mining that in the application to find specific patterns in the database, such as association rule, correlation, sequence,

episode, classifier, and cluster. Of all these patterns, one of the highlights is the association rule. Association rule based on searching of motivation to do an analysis of transactions in supermarkets. From the analysis of these transactions can studied about customer behavior in shopping. Association rule produces conclusions about how often a product is purchased in conjunction with other products.

Figure 1 shows that the data mining process starts from the preparation of data sets. Then the filtered data is brought to the next process that is selection of appropriate algorithms to search Litemset (Large Itemset).

There are various algorithms that can be used to discovery of large litemset. But in this study only Eclat algorithm, FP-Tree, and Top-Down will be discussed.

II. RELATED THEORIES

A. FP-Tree Algorithm

In this chapter will discuss about searching frequent pattern using the Frequent Pattern Tree (FP-Tree) algorithm. Input from the FP-Tree algorithm consists of two fields, which is the index field and target fields.

In this case, field index only use to sort the data. In the FP-Tree algorithm, there is no candidate generation process. This is caused by the formation of candidate generation is a factor that causes delays. Large Itemset

ISSN: 2086-4868 Technical Paper Proceedings of ICGC-RCICT 2010

(8)

Data set

Figure 2. Large Itemset Search Process with FP-Tree

Figure 3. First Step of FP-Tree search process by using the FP-Tree algorithm can be seen

in figure 2.

FP-Tree algorithm is an algorithm for mining the database which this algorithm minimizes the candidate search process. Candidate is search in the FP-Tree which is only often appears. Otherwise, FP-Tree use tree structure so it can avoid reading the database repeatedly. This is certainly different from Apriori algorithm that doing readings database continuously. Apriori algorithm is waste time.

1) Tree Construction

Data structure on FP-Tree must be right so that it can accommodate all the information required in the mining process accurately. Making data structures in the FP-Tree should be drawn up based on some facts that made observations as follows:

x Database search must be done at one time only.

Search this database must be able to classify the items that included in the frequent itemset. The frequency of appearance of these items should also be stored in addition to the name of the item itself. This is done because the mining process is only done on frequent item only. Frequent items will be removed.

x If the structure of frequent items storage of each

transaction is right then search the database repeatedly can be avoided.

x Transactions that have set of the same frequent

item can be merged into one and only need to increase the frequency. To be able to find two sets of item are same or not, each transaction must be ordered either in ascending or descending.

x In two transactions that have the same common

prefix based on a particular item frequent sequences, the same parts can be combined into a

number of frequencies. If the frequent items are sorted by frequency of occurrence desendingly the possibility of the emergence of the same prefix string will be greater.

2) Example of Tree Formation

The following will be given a transaction database, database is to clarify the construction of the FP-Tree. The database contains information of transaction id and item purchased as shown by the first two columns of table 1

and a minimum support İ _.

The first step performed in FP-Tree algorithm, which is scanning the database to get the whole first frequent itemset and record the frequency of each item and it compared with the minimum support. If the frequency of itemset is below the minimum support then the itemset will not be used in next process.

Frequent itemset that has been obtained then sorted descendingly based on their frequency. The next step is to form the root of the tree is initialized with NIL. Database then scanned for the second time. Sequencing performed on the database on each transaction. The items on each transaction are sorted according to items name that are ordered based on the results of the previous process. Sequencing results in a database transaction can be seen in Table 1.

(9)

Data set

Search and Sort Frequent Item

Frequent Item

Combination Process for each Frequent

Item

Large Itemsets

Figure 4. Large Itemset Search Process with Eclat of all rows equal to one. Figure of FP-Tree formation of

the first transaction can be seen in figure 3. Process conducted in the same way until the fifth transaction.

3) Mining Process with FP-Tree

Generally, the process of mining using FP-Tree can be divided into 4 steps. The steps are intended as follows:

1. Mining in the FP-Tree starts from an item with

the smallest support in the header table.

2. Find frequent items that contain items that will be mining.

3. Look k- conditional pattern base for k items that

will be mining.

4. Form of conditional FP-Tree based on the

conditional pattern base that has been formed previously.

4) Conditional Pattern Base Search

After making the formation of the tree that has been given before which is based on mining step in the FP-Tree, the next thing to do is do a search conditional pattern base of each item that has been listed in the case example of table 1. The following example is to mining on the p-conditional pattern base.

The first mining in a given case starting from the node p. What to do is look at the header table, head-link from node p starts from where and to where. Based on the observations obtained frequent patterns containing p, which is (f: 4, c: 3, a: 3, m: 2, p: 2) and (c: 1, b: 1, p: 1). In the first path is shown that the path (f, c, a, m, p) appears twice in the database. Path (f, c, a) appears three times and the path (f) itself appears four times but only appeared twice together with the p. Therefore, the calculated path prefix is (f: 2, c: 2, a: 2, m: 2, p: 2).

While the path (c: 1, b: 1, p: 1), the path prefix used is (c: 1, b: 1). Of the two prefix paths will be formed p’ sub

pattern base termed p’ conditional pattern base.

Definition of p' conditional pattern base is a sub-pattern that occurs with the following conditions of p.

From the prefix path obtained for node p, then performed the counting support of the items contained in the prefix path. The results support the calculation of each item can be seen in Table 2.

Previously been known that the minimum support from the example of table 1 is 3. So from Table 2 can be seen that the eligible item is item c so that item c only the frequent items. Next is the formation of p conditional FP-Tree. From table 2 notice that only items c that qualifying the requirements, then p-conditional FP-Tree consists of only one branch only.

B. Eclat Algorithm

Eclat algorithm was first introduced by Mohammed Javeed Zaki in 1997. The emergence of this algorithm must not be separated from the Apriori. In the Eclat algorithm does not perform database read repeatedly as in the Apriori algorithm.

Input for Eclat algorithm consists of two fields, index field and target fields. Large Itemset search process by using the Eclat algorithm can be seen in figure 4.

1) Mining Process with Eclat

Mining process using the Eclat algorithm is basically done by taking slices among the items that qualifying the minimum support value. Operation frequent wedge between these items will certainly accelerate the process of mining by using the Eclat algorithm. If the frequent itemset is more longer then the number of slices operations performed will be less, so join process will be fewer as well.

Eclat algorithm does not require complex hash-tree structure. Therefore, the memory used by Eclat algorithm is not too much. Even Eclat algorithm has the advantage to allocate memory properly. It is because of the Eclat algorithm only needs to be done with a linear scan of the itemset. In the Eclat algorithm, the smaller value of minimum support will found the faster frequent itemset. As a result, mining process is also done faster. This will be very beneficial because it can reduce operational costs.

C. Top-Down Algorithm

In the previous chapter has discussed about the FP-Tree and Eclat algorithm. In this chapter will discuss the final algorithm, the Top-Down algorithm. Discussion of Top-Down algorithm will contain about how to obtain mining frequent itemset of the database provided. Large Itemset search process by using Top-Down algorithm can be seen in figure 5.

In the Top-Down algorithm, database mining starts from the maximal itemset. Maximal itemset is the itemset which the itemset is not part of another itemset. First, the items search in the database that is qualifying the minimum support. All frequent items produced will be

(10)

Data set

Group and Search Frequent Item

Figure 5. Large Itemset Search Process with Top-Down merged into one and begin the process of combination of items one by one.

1) Mining Process with Top-Down

Mining process by using Top-Down algorithm is basically done by combining of the mining itemset. Combination process is meant to remove an item in itemset interchangeably. New itemset results from the combination will be mining again repeatedly until discovered frequent itemset or a search had reached the last level.

The principle to run Top-Down algorithm is mining process is done recursively until it was found that the nodes that are traced are frequent. After getting the frequent items, the next step is to form probability combination. New nodes are automatically generated as frequent itemset. The principle of Top-Down algorithm is very efficient. This is based on the fact that at the time of the formation of combinations, Top-Down algorithm always check on the other branches that if the combination has been previously established or not.

III. TESTING AND RESULT

In this chapter will be given a comparison of time and memory used at the time of mining with Eclat, FP-Tree, and Top-Down algorithms. Mining for testing program will use several databases with different support value. So at each mining on a database will be given some support value by using the three algorithms.

The process of testing conducted in this study using a computer with Pentium processor 4IV 1.8 GHz and 256 Mb of memory. The operating system used in this testing process is Microsoft Windows 2000. Database that will be used for testing is BMS-WebView-1 database which has total of 22,225 transactions, and using Movie database that has total of 5888 transactions.

In this chapter will be divided into two sub-chapters which the first section will explain about the time comparison and the second section will explain about the memory comparison that is used by the three algorithms.

A. Time Comparison

Time used to perform mining will vary for each algorithms. In some case of certain databases may Eclat algorithm has faster mining times than FP-Tree algorithm or on other database mining process with Top-Down algorithm faster than another algorithm. Mining time is also influenced by how much support is used.

TABLE III. USINGBMS-WEBVIEW-1 DATABASE (HH:MM:SS)

Min. Support

Algorithms

Eclat FP-Tree Top-Down

0.02 00:00:53 00:01:49 00:02:43

0.03 00:00:33 00:01:40 00:54:39

0.04 00:00:24 00:01:48 00:00:34

0.05 00:00:12 00:01:55 00:00:01

In table 3 it appears that Top-Down algorithm has the fastest time compared with Eclat and FP-Tree if the minimum support that used are 0.04 and 0.05. With that support, Eclat algorithm when compared with the FP-Tree has better time mining. However, the smaller of the support given in the BMS-WebView-1 database mining, the time mining with Top-Down algorithm worse than other algorithms. FP-Tree algorithm in this case has a lapse of time that was stable which is compared with other algorithms.

0.03 00:00:30 00:01:24 12:11:05

0.04 00:00:10 00:00:51 02:41:54

0.05 00:00:02 00:00:09 00:00:00

In table 4, the minimum support used in the experiments using movie database is 0.3, 0.4, and 0.5. When use support 0.5, Top-Down algorithm has the fastest time mining compared to other algorithms. However, when the minimum support is reduced, the time mining required for mining with Top-Down become more greater. In the case of movie database, Eclat algorithm has small time mining and stable when compared with FP-Tree and Top-Down algorithm.

B. Memory Comparison

Eclat, FP-Tree, and Top-Down algorithm has similarity in mining process, the algorithms are the same conduct tree formation which will be used in the database mining. The formation of the tree will use the available memory in computers. This section will discuss about the comparison of memory used in the mining process using the Eclat, FP-Tree, and Top-Down algorithm.

(11)

has a change in the amount of memory was stable between the three algorithms.

TABLE V. USINGBMS-WEBVIEW-1 DATABASE

Algorithms

0.02 0.2M 7.3M 3.8M

0.03 0.07M 5.5M 2.6M

0.04 0.05M 5.0M 0.04M

0.05 0.02M 4.6M 0.002M

TABLE VI. USINGMOVIEDATABASE

Algorithms

0.03 0.02M 10.0M 20.0M

0.04 0.007M 6.3M 4.7M

0.05 0.002M 1.9M 0.0008M

In table 6 can be seen that, the process of mining by using the Eclat algorithm was needed less amount of memory when compared with other algorithms. Meanwhile, at a minimum the use of support 0.4 and 0.5, it need larger memory when using the Top-Down algorithm and so did the difference amount of memory required when using support 0.3 and 0.4.

IV. CONCLUSION

In making this study, authors obtain some conclusions include:

x Disappearance process of unfrequent items makes

FPTree formation faster. Elimination of items produce a new table that contains frequent items that will accelerate the formation of tree. The formation of tree no longer access the original database, but only access the new table.

x FP-Tree has a fast time mining even use larger

size dataset. While lack of FP-Tree is this algorithm takes a large memory.

x Top-Down algorithm does not need to do pruning

like FP-Tree. Top-Down weakness is when the type of items and lots of frequent items and the minimum of the support will waste time because Top-Down continue to search until it found frequent itemset.

x Branch formation by using the Eclat algorithm

always look in advance whether the combination is contained in the database or not. This will avoid the formation of branches that are not needed. The weakness is the algorithm always doing support calculations despite the itemset is already frequent because this algorithm will stop if found unfrequent itemset.

REFERENCES

[1] R. Agrawal, R. Srikant, “Fast Algorithms for Mining Association Rules”, San Jose: IBM Almaden Research Centre.

[2] R. Agrawal, R. Srikant, “Mining sequential patterns: Generalization and performance improvements”, In Proceedings 1st International Conference Knowledge Discovery and Data Mining, 1993.

[3] C. Borgelt, “Efficient Implementation of Apriori and Eclat”, Unpublised, Magdeburg: Department of Knowledge Processing and Language Engineering School of Computer Science, Otto-von-Guericke-University of Magdeburg.

[4] B. Goethals, “Memory Issues in Frequent Itemset Mining”, Unpublised, Helsinki: Department of Computer Science, University of Helsinki.

[5] J. Han, J. Pei, Y. Yin, “Mining Frequent Pattern Without Candidate Generation”, Unpublised, School of Computer Science, Simon Fraser University.

[6] J. Han, M. Kamber, “Data Mining: Concepts and Techniques”, 1999.

[7] S. Parthasarathy, M.J. Zaki, M. Ogihara, W. Li, “New Algorithms for Fast Discovery of Association Rules”, New York: Computer Science Department, University of Rochester, July 1997.

[8] M.J. Zaki, “Scalable Data Mining for Rules”, New York: Department of Computer Science, University of Rochester, 1998.

Gunawan dan I Ketut Eddy Purnama ICGC RCICT 2010

ICGC-RCICT

2010

2-3 March 2010

ISSN: 2086-4868

Department of Electrical Engineering

and Information Technology

Faculty of Engineering

Gadjah Mada University

Jalan Grafika no 2 Kampus UGM

Yogyakarta, 55281, Indonesia

PROCEEDINGS OF

AND

THE FIRST INTERNATIONAL CONFERENCE

ON GREEN COMPUTING

THE SECOND AUN/SEED-NET

REGIONAL CONFERENCE ON ICT

DEPARTMENT OF ELECTRICAL ENGINEERING

AND INFORMATION TECHNOLOGY

FACULTY OF ENGINEERING

GADJAH MADA UNIVERSITY

Proceedings of

The First International Conference on Green Computing

and

The Second

AUN/SEED-Net Regional Conference on ICT

Table of Contents

Efficiency Comparison on Eclat, FP-Tree, and

Top-Down Algorithms in Search L-Itemsets