行政院國家科學委員會專題研究計畫成果報告 - CHUR

BTP-tree仍然可以在不同节点之间分配负载并减少计算时间。我们设计的并行算法可以并且确实利用集群系统的计算资源来减少计算时间。

97/8/26 –97/8/28 Hangzhou, China

The Proceeding of 2008 IEEE International Conference on Granular Computing 研討會論文集。

A Weighted Load-Balancing Parallel Apriori Algorithm for Association Rule Mining

Abstract

Introduction

Related Work

When the support and confidence are greater than or equal to the predetermined minimum support and minimum confidence, the association rule is considered a valid rule. Even though the Apriori algorithm takes a lot of time to compute combination of itemsets, the design of the data structure makes it easy for the algorithm to be parallelized.

Weighted Distributed Parallel Apriori (WDPA) Algorithm

Ye's algorithm distributes computing workload using the Trie Structure to speed up the computation, but this causes significant differences between the sizes of candidate itemsets distributed among processors. By storing the TIDs of itemsets and precisely calculating and distributing computational workloads, WDPA is able to effectively accelerate the computation of itemsets and reduce the required scan iterations to a database and balance the load, thereby reducing the idle time of the processor significantly reduced.

IWeightTid

Ilen

Step 1. Each processor reads the database DB
processor scans DB and creates the transaction identification set (TID)
Each processor calculates candidate k-itemset counts, when the count is greater than s, let
MN equally divides the freq k into p disjointed partitions and assigns itemsets i to p i . Itemsets i
Each processor receives the itemsets i and the combination candidate (k+1)-itemsets
Each processor candidate itemsets is calculated by counting the TID forms
When itemset count is greater than s then it is a
SPs send frequent itemsets to MN

Experiments

Each processor computes candidate k-itemset counts when the count is greater than s, let count when the count is greater than s, let freqk be frequent k-itemsets. Step_9.MN receives SP's itemset and repeats execution step 4 to setp9 until there are no more frequent itemsets.

T10I4D50N100K, T10I4D100N100K, T10I4D200N100K was used to examine the WDPA

Conclusion

In this paper, a weighted distributed parallel apriori algorithm (WDPA) is proposed in which the TID of itemsets are stored in a table to calculate their occurrence. WDPA effectively reduced the required scan iterations for a database as well as speeding up item set computation.

Most frequent pattern mining algorithms can be classified into two categories: generate-and-test approach (Apriori-like) and pattern growth approach (FP-tree). In recent years, many techniques have been proposed for regular pattern mining based on the FP-tree approach since it requires only two database scans. In this paper, two parallel mining algorithms are proposed; Tidset-based Parallel FP-tree (TPFP-tree) and Balanced Tidset-based Parallel FP-tree (BTP-tree) for regular pattern mining on PC Clusters and multi-cluster grids.

The basic problem in frequent pattern searching is finding the number of times a particular pattern appears in a database. For example, 250 (approximately 1015) candidate datasets may be needed to verify whether or not a set appears frequently in a 50-item database. 2004) propose a new data structure and method for mining frequent patterns: the Frequent Pattern (FP) tree data structure that stores only compressed, necessary information for mining. But even though FP-tree performed better, the execution time still increased significantly when the database was large.

ARTICLE IN PRESS

The objective of the proposed Tidset-based Parallel FP-tree (TPFP-tree) algorithm was to reduce both communication and tree insertion costs, thereby reducing the execution time. The experimental results show that the proposed algorithms – TPFP-tree and BTP-tree – could reduce the execution time for different datasets on a PC-Cluster and a multi-cluster grid, respectively. The execution time increases significantly if the database size is large or the given support is small.

6 and 7 show the execution time of BTP tree, TPFP tree and PFP tree with different number of processors with 50k and 0. Moreover, it can be seen that in a heterogeneous computing environment (three types of CPUs in this case) balancing the workload can reduce the execution time. However, the execution time increases significantly with an increase in database size and a decrease in the given threshold.

Fig. 1 is an example of a header and Tidset table for four proces- proces-sors. Fig. 1a shows the database equally partitioned into four parts with each transaction’s local identity (TID)

Parallel Branch-and-Bound Approach with MPI Technology in Inferring Chemical Compounds with Path Frequency

The BB-CIPF algorithm
Message Passing Interface
Parallel BB-CIPF (PB-CIPF)

The procedure of PB-CIPF

Produce candidate compounds. Run BFS (T target )
Gather candidate compounds as tasks. Block tasks by computing nodes number
For each block of tasks will be assigned to slave computing node

Chemical compounds are assigned a feature vector by the algorithm based on the frequency of small fragments. If K is large, the number of dimensions of a feature vector will be large (exponential of K). At BFS stage, the first step, master node loads the target connection and calculates the feature vector.

After inserting an atom, the candidate compound feature vector will be compared to target compound. If the feature vector of candidate compound and target compound have the same feature vector with parts of target compound structure, the atom will be retained. If the candidate compounds have the same feature vector with the target structure, then it has found a solution.

Figure 1 shows that giving a target x to find φ(x) with kernel method and then inferring the compound

Slave computing node

Receive tasks

Figure 5: An example of the BFS stage of PB-CIPF After the tasks were assigned by the master computing node, each slave computing node used the Depth-First-Search (DFS) approach to insert an atom into a candidate compound. If the candidate feature vector is different from the structure of the target compound, the atom will be dropped and continue using DFS approach to insert another atom into the candidate compound. The pseudocode for PB-CIPF is shown below; the implemented code has more details and will be presented later.

Store all candidate connections computed in Tqueue, which is a queue that stores all candidate connections.

Run DFS approach

Send result to a queue stored all matched result

If still needed to process go to step 2. Else end

Experimental results

Hydrogen atoms will only be added if the frequencies of the other atoms are the same as the frequencies in the target feature vector. ii) When calculating fnext from Tnext and fcur, only paths starting and ending at the new node are calculated. iii) Benzene rings can be added as if they were sheets, where structural information about benzene is used to calculate eigenvectors. iv). A benzene ring will be given as the initial structure when the compound is small and contains a benzene ring. We scale up the compute node to test whether scaling up the compute node can reduce the computing time.

If the computation time of one node is t0 and the computation time of 2 nodes is t1. If we increase the number of computing nodes to four nodes, we can find that the average speedup ratio is about 1.9. According to our experiment, it has proven that our proposed algorithm can reduce the computation time.

Figure 7 to 10 show the makespans from different target compound with different size

Chemical Compounds with Path Frequency Using Multi- Core Technology

1 Introduction

In these approaches, chemical compounds are mapped to feature vectors and then SVMs [9, 10] are used to learn the rules for classification of these feature vectors. Several feature vector mapping methods have been proposed; among them, mapping feature vectors based on the frequency of labeled paths [6, 7] or the frequency of small fragments in chemical compounds [4, 5]. In kernel methods, an object in the input space can be mapped to a point (or feature vector) in a space called feature space.

Using the appropriate function ∅, a given point y in the feature space is mapped back to an object in the input space. The problem arises when mapping a given y in feature space back to an object in input space such that y= ∅ (x) is satisfied, since x may not exist. In this study, we considered the situation that as chemical compounds become more and more complex, the computational time required to infer preimages from the feature vectors of these compounds increases much faster.

2 Related Work

In [1], a feature vector g is a multiple set of strings of labels of length at most K that represent the road frequency. In previous works [1, 2], a graph can be inferred from the number of occurrences of vertex-labeled paths. For example, if there are three objects, a, b, and c, that all correspond to the same feature vectors v.

Therefore, an important issue in the problem is how to produce all possible compounds derived from the same characteristic vector but differing in their molecular structures. In this paper, we extend the inference algorithm [3] to obtain all possible compounds that are mapped from the same feature vector but differ in their molecular structures. Our algorithm aims to obtain all possible compounds that can be derived from the same feature vector, but differ in their molecular structures.

3 Multi-Core Chemical Compound Inference from Path Frequency (MC-CIPF)

However, to extract more chemical compounds, it also means that the algorithm will consume more computation time. In the first step of MC-CIPF, the algorithm loads into the main kernel a target compound for the termination of all other chemical compounds that share the same feature vector. The main core uses the Breadth-First-Search (BFS) algorithm to analyze the target composition and obtain its path frequency for later job distribution.

Each job is initiated based on the atoms that exist in the target compound (Fig. 1). Each kernel applies the Depth-First-Search (DFS) algorithm to insert an atom into a candidate compound. After inserting an atom, the candidate compound will be compared with the target compound, and if the feature vector of the candidate compound has the same structure as the structure parts of the target compound, the inserted atom will be.

Fig. 2. An example of balancing the load in each core in MC-CIPF.

4 Experimental Results

In each case, we found that the computing time was reduced as the number of cores was increased. As a result, MC-CIPF spends less computing time searching for the combinations of target compound, since there are fewer allowed variations to precompute. However, the path frequency will be longer, so MC-CIPF needs to spend more execution time to recalculate the path frequency.

Consequently, the shortest computation time occurs when K is equal to 2 in the experiment, since the number of constraints has not increased too much and the path frequency length is not too long. More importantly, we want to compare the speedup ratios of MC-CIPF against the base number used in the experiments. In these figures, the acceleration ratios are increased from 1 core to 4 cores, with the best acceleration ratio close to 3 (Fig.

Table 1. Computing time of MC-CIPF for various chemical compounds.

5 Conclusions

Due to the exponential growth of the global information supply, companies have to deal with an increasing amount of digital information. One of the most important challenges in data mining is finding the relationship between data quickly and correctly. Calculating only the Lattice number and ignoring the TID length of the itemsets creates an uneven workload distribution.

To evaluate the performance of the proposed algorithm, WDPA was implemented together with the algorithm proposed by Ye [12]. From the experimental results, our proposed method balances the workload among processors and saves processor idle time due to the way CWT distributes item sets. As the database size increases, the TID length of the item sets in the table will increase.