A Study of Efficient Algorithms for Fast Recovery of Frequent Itemset Mining

(1)

International Journal of Advanced Computer Engineering and Communication Technology (IJACECT) _______________________________________________________________________________________________

_______________________________________________________________________________________________

ISSN (Print): 2278-5140, Volume-5, Issue-1, 2016 17

A Study of Efficient Algorithms for Fast Recovery of Frequent Itemset Mining

1S.Vimala, ²D.Kerana Hanirex, ³K.P.Kaliyamurthie

1,2,3CSE Department, Bharath University, Chennai, Tamil Nadu.

Abstract : Data mining refers to extracting or mining knowledge from large amount of data. Thus, data mining should have been named as knowledge mining which emphasis on mining from large amount of data. It is the method of finding patterns in large data sets concerning methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall objective of the data mining process is to mine information from a data set and change it into an understandable formation for further use. Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify unrelated products that are often purchased together.

Other pattern detection problems include detecting fraudulent credit card transactions and identifying abnormal data that could represent data entry keying errors. This paper presents a literature review on different techniques for mining frequent itemsets.

Keywords: Data Mining, Frequent Itemset Mining

I. INTRODUCTION

Data Mining is a process of discovering various models, summaries, and derived values from a given collection of data. Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. These tools can answer business questions that traditionally were too time consuming to resolve. They search databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line.

FREQUENT ITEMSET MINING (FIM) is one of the most primary problems in data mining. It has practical importance in a wide range of application areas such as decision support, Web usage mining, bioinformatics, etc. Given a database, where each transaction contains a

set of items, FIM tries to find itemsets that occur in transactions more frequently than a given threshold.

II. PROBLEM STUDY

Studies of Frequent Itemset (or pattern) Mining is acknowledged in the data mining field because of its broad applications in mining association rules, correlations, and graph pattern constraint based on frequent patterns, sequential patterns, and many other data mining tasks. Efficient algorithms for mining frequent itemsets are crucial for mining association rules as well as for many other data mining tasks. The major challenge found in frequent pattern mining is a large number of result patterns. As the minimum threshold becomes lower, an exponentially large number of itemsets are generated. Therefore, pruning unimportant patterns can be done effectively in mining process and that becomes one of the main topics in frequent pattern mining. Consequently, the main aim is to optimize the process of finding patterns which should be efficient, scalable and can detect the important patterns which can be used in various ways.

III. LITERATURE SURVEY

The goal of frequent Item set mining is to discover sets of items that frequently co-occur in the data. The problem is non-trivial because datasets can be very large, consisting of many distinct items, and contain interesting Itemsets of high cardinality. Existing approaches can be classified into three major categories.

First, bottom -up algorithms such as the well known Apriori algorithm [1, 2] repeatedly scan the database to build Itemsets of increasing cardinality. They exploit monotonicity properties between frequent Itemsets of different cardinalities and are simple to implement, but they suffer from a large number of expensive database scans as well as costly generation and storage of candidate Item-sets. Second, top-down algorithms, in this the largest frequent Itemset is built first and itemsets of smaller cardinality are constructed afterwards, again using repeated scans over the database. Finally, prefix tree algorithms operate in two phases. In the first phase, the database is transformed into a prefix tree designed for efficient mining. The second phase extracts the frequent itemsets from this tree without further base data

(2)

_______________________________________________________________________________________________

access. Algorithms of this class require only a fixed number of database scans, but may require large amounts of memory. The major emphasis has been placed on finding efficient algorithm in finding the large Itemsets. In paper [9], the authors have proposed genetic algorithm based method for finding frequent itemsets. In paper [10], the authors have presented a TDTR approach for mining frequent itemsets. This TDTR approach reduces the number of transactions from the original database based on the minimum threshold value thus improving the performance.

The core algorithms and their performance discussed in this section are

a. Apriori Algorithm

Apriori uses "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm [1,2] terminates when no further successful extensions are found. It uses breadth-first search and a hash tree structure to count candidate item sets efficiently. It employs an iterative approach known as a level-wise search, where k- itemsets are used to explore (k+1)-itemsets. First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support. The resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each L_k requires one full scan of the database.

A two-step process is followed in Apriori consisting of join and prune action.

Advantages

Basic algorithm in mining frequent pattern, Suitable for both sparse and dense database

Disadvantages

Large no. of database scan, space and time complexity is high

b. Direct Hashing and Pruning Algorithm

Direct Hashing and Pruning (DHP) varies from the Apriori algorithm [5] in the sense that it uses hashing technique to remove the unwanted itemsets for the generation of next set of candidate itemsets. In Apriori algorithm, determining all the large itemsets from the set of candidate large itemsets for a particular pass by scanning the database becomes very expensive. But in DHP the size of the candidate itemsets is reduced and then it performs much faster than the Apriori algorithm.

But with DHP it generates a number of false positives and so Perfect Hashing and Pruning (PHP) is another algorithm which is suggested as an improvement over DHP. In this algorithm, during the first pass a hash table with size equal to the number of individual items in the database is created. Then each individual item is mapped to different item in the hash table. After the first pass the hash table consists of exact number of occurrences of

each item in the database. Then the algorithm removes all the transactions which have no items that are frequent and also trims the items that are not frequent.

Simultaneously, candidate k-itemsets are generated and the occurrences of each are found out. Thus at the completion of each pass, the algorithm gives a pruned database, occurrences of candidate k-itemsets and set of frequent k-itemsets and the algorithm terminates when no new frequent k-itemsets are found. Thus PHP is an improvement over DHP with no false positives and thus extra time required for counting frequency of each item set is eliminated. PHP also reduces the size of the database.

Advantages

Better than Apriori in small and medium database;

Suitable for medium databases Disadvantages

Not good for large database c. Partition Algorithm

It is a variation of Apriori algorithm. In Apriori and DHP, there is a problem of repeated passes on the database. In contrast, Partition algorithm [4] is composed of two passes on the database. The Partition algorithm logically partitions the database D into n partitions, and only reads the entire database at most two times to generate the association rules.

Advantages

Reduce the number of database scans, Suitable for large databases

Disadvantages

More time is required because first find local frequent and then global frequent

d. Sampling Algorithm

This algorithm [6] is used to overcome the limitation of I/O overhead by not considering the whole database for checking the frequency. It is just based in the idea to pick a random sample of itemset R from the database instead of whole database D. The sample is picked in such a way that whole sample is accommodated in the main memory.

Advantages

Memory utilization and less time required; Suitable for any kind of dataset

Disadvantages

Mostly not give accurate results e. Eclat Algorithm

Eclat algorithm [7,8] is basically a depth-first search algorithm using set intersection.. It uses a vertical database layout i.e. instead of explicitly listing all transactions; each item is stored together with its cover

(3)

_______________________________________________________________________________________________

and uses the intersection based approach to compute the support of an itemset. In this way, the support of an itemset X can be easily computed by simply intersecting the covers of any two subsets Y, Z ⊆X, such that Y U Z

= X. It states that, when the database is stored in the vertical layout, the support of a set can be counted much easier by simply intersecting the covers of two of its subsets that together give the set itself.

Advantages

Suitable for medium and dense datasets, time is small then Apriori algorithm

Disadvantages

Not suitable for small datasets f. FP-Growth Algorithm

The process of frequent patterns generation in FP- growth (frequent pattern growth) algorithm [3]

includes two sub processes: first is the construction of FP-Tree, and second is generating frequent patterns from the FP-Tree . To store the database in a compressed form, it uses an extended prefix tree (FP- tree) structure. FP-growth uses a divide-and-conquer approach to decompose both the mining tasks and the databases. FP-Tree, recovers the two disadvantages of the Apriori, it takes only two scan of the database and no candidate generation. So FP-Tree is faster than the Apriori algorithm. It is more effective in dense databases than in sparse databases. Its major cost is the recursive construction of the FP-trees.

Advantages

Only 2 scan of database, Suitable for large and medium datasets

Disadvantages

Recursive construction of the FP-trees and complex data structure required large space

IV CONCLUSION

The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. This paper gives a brief survey on different approaches for mining frequent itemsets. Frequent item set mining is the generation of all frequent item sets that exists in market basket like data with respect to minimal thresholds for support & confidence. The performance of algorithms reviewed in this paper depends on storage structure, technique used, nature and size of the datasets.

Out of these algorithms, FP- Growth algorithm is best and applicable to large and medium datasets such as supply chain management and query recommendation for search engines.

REFERENCES

[1] R. Agrawal, T. Imilienski, and A. Swami,

―Mining Association Rules between Sets of Items in Large Databases, Proc. of the ACM

SIGMOD Int’l Conf. On Management of data, May 1993.

[2] R. Agrawal, and R. Srikant, ―Fast Algorithms for Mining Association Rules,‖ Proc. Of the 20th VLDB Conference, Santiago, Chile, 1994 [3] J. Han, J. Pei, and Y. Yin, “Mining frequent

patterns without candidate generation,” in Proc.

ACM SIGMOD Int. Conf. Manage. Data, pp. 1–

12,2000.

[4] Savasere E. Omiecinski and Navathe S., “An efficient algorithm for mining association rules in large databases,” In Proc. Int’l Conf. Very Large Data Bases (VLDB), pp: 432– 443, 1995.

[5] Park. J.S, Chen M.S., Yu P.S., “An effective hash-based algorithm for mining association rules,” In Proc. ACMSIGMOD Int’l Conf.

Management of Data (SIGMOD), pp: 175–186, 1995.

[6] C Toivonen. H., “Sampling large databases for association rules,” In Proc. Int’l Conf. Very Large Data Bases (VLDB), pp: 134–145, 1996.

[7] M. Zaki, S. Parthasarathy, M. Ogihara, and W.

Li., “New Algorithms for Fast Discovery of Association Rules,” Proc. 3rd Int. Conf. on Knowledge Discovery and Data Mining (KDD’97), AAAI Press, Menlo Park, CA, USA, pp: 283–296, 1997.

[8] C.Borgelt.” Efficient Implementations of Apriori and Eclat,” Proc. 1st IEEE ICDM Workshop

on Frequent Item Set Mining

Implementations (FIMI 2003, Melbourne, FL).

CEUR Workshop Proceedings 90, Aachen, Germany 2003.

[9] D. Kerana Hanirex and K.P. Kaliyamurthie

“Mining Frequent Itemsets Using Genetic Algorithm” Middle-East Journal of Scientific Research 19 (6): 807-810, 2014,ISSN 1990-9233, IDOSI Publications, 2014.

[10] D.Kerana Hanirex And Dr.K.P.Kaliyamurthie

“An Adaptive Transaction Reduction Approach For Mining Frequent Itemsets: A Comparative Study On Dengue Virus Type1” Int J Pharm Bio Sci 2015 April; 6(2): (B) 336 – 340.

[11] A. Evfimievski, R. Srikant, R. Agrawal, and J.

Gehrke, “Privacy preserving mining of association rules,” in Proc. 8th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, pp.

217–228,2002 .

[12] Maurizio Atzori, F. Bonchi, F. Giannotti, and D.

Pedreschi, “Anonymity preserving pattern discovery,” VLDB J., vol. 17, no. 4, pp. 703–727, 2008.

(4)

_______________________________________________________________________________________________

[13] R. Bhaskar, S. Laxman, A. Smith, and A.

Thakurta, “Discovering frequent patterns in sensitive data,” in Proc. 16th ACM SIGKDD Int.

Conf. Knowl. Discovery Data Mining, pp. 503–

512, 2010.

[14] N. Li, W. Qardaji, D. Su, and J. Cao, “Privbasis:

Frequent itemset mining with differential

privacy,” Proc. VLDB Endowment, vol. 5, no.

11, pp. 1340–1351, 2012.

[15] F. McSherry and K. Talwar, “Mechanism design via differential privacy,” in Proc. 48th Annu.

IEEE Symp. Found. Comput. Sci, pp. 94–103, 2007.

