• Tidak ada hasil yang ditemukan

Common Association Rules for Dispersed Information Systems

N/A
N/A
Protected

Academic year: 2024

Membagikan "Common Association Rules for Dispersed Information Systems"

Copied!
9
0
0

Teks penuh

(1)

Common Association Rules for Dispersed Information Systems

Item Type Conference Paper

Authors Moshkov, Mikhail;Zielosko, Beata;Tetteh, Evans Teiko Citation Moshkov, M., Zielosko, B., & Tetteh, E. T. (2022). Common

Association Rules for Dispersed Information Systems. Procedia Computer Science, 207, 4613–4620. https://doi.org/10.1016/

j.procs.2022.09.525 Eprint version Publisher's Version/PDF

DOI 10.1016/j.procs.2022.09.525

Publisher Elsevier BV

Rights Archived with thanks to Elsevier BV under a Creative Commons license, details at: http://creativecommons.org/licenses/by-nc- nd/4.0/

Download date 2024-01-16 22:27:30

Item License http://creativecommons.org/licenses/by-nc-nd/4.0/

Link to Item http://hdl.handle.net/10754/686399

(2)

ScienceDirect

Procedia Computer Science 207 (2022) 4613–4620

1877-0509 © 2022 The Authors. Published by Elsevier B.V.

This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)

Peer-review under responsibility of the scientific committee of the 26th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES 2022)

10.1016/j.procs.2022.09.525

10.1016/j.procs.2022.09.525 1877-0509

© 2022 The Authors. Published by Elsevier B.V.

This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)

Peer-review under responsibility of the scientific committee of the 26th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES 2022)

Procedia Computer Science 00 (2022) 000–000

www.elsevier.com/locate/procedia

26th International Conference on Knowledge-Based and Intelligent Information &

Engineering Systems (KES 2022)

Common Association Rules for Dispersed Information Systems

Mikhail Moshkov

a

, Beata Zielosko

b,

, Evans Teiko Tetteh

c

aComputer, Electrical and Mathematical Sciences and Engineering Division and Computational Bioscience Research Center King Abdullah University of Science and Technology (KAUST),

Thuwal 23955-6900, Saudi Arabia

bInstitute of Computer Science, University of Silesia in Katowice, B¸edzińska 39, 41-200 Sosnowiec, Poland

cDoctoral School, University of Silesia in Katowice, Bankowa 14, 40-007 Katowice, Poland

Abstract

Association rules are popular form for knowledge discovery domain. They are used for finding interesting relationships and patterns hidden in large data sets and in the area of associative classification, where usually rules with one item in the right-hand side are considered. There are many different approaches and algorithms for mining association rules. One of the most popular group are methods which are based on mining frequent itemsets, usually applied for data in transaction format. Such data can be transformed to binary information system which corresponds to matrix data format.

Technological development means that we are dealing with an increasing amount of data that can be heterogeneous, taking into account their format and location. In this paper, we assume that dispersed data is represented by a finite setSof information systems with equal sets of attributes. We discuss one of the possible ways to the study association rules common to all information systems from the setS: building a joint information system for which the set of true association rules that are realizable for a given rowr and have given attribute f on the right-hand side coincides with the set of association rules that are true for all information systems fromS, are realizable for the rowr, and have the attributef on the right-hand side. We show how to build a joint information system in a polynomial time.

When we build such an information system, we can apply to it various association rule learning algorithms.

© 2022 The Authors. Published by Elsevier B.V.

This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the scientific committee of the KES International.

Keywords: Dispersed data; Set of information systems; Association rules

Corresponding author

E-mail address:[email protected]

1877-0509 © 2022 The Authors. Published by Elsevier B.V.

This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the scientific committee of the KES International.

Procedia Computer Science 00 (2022) 000–000

www.elsevier.com/locate/procedia

26th International Conference on Knowledge-Based and Intelligent Information &

Engineering Systems (KES 2022)

Common Association Rules for Dispersed Information Systems

Mikhail Moshkov

a

, Beata Zielosko

b,

, Evans Teiko Tetteh

c

aComputer, Electrical and Mathematical Sciences and Engineering Division and Computational Bioscience Research Center King Abdullah University of Science and Technology (KAUST),

Thuwal 23955-6900, Saudi Arabia

bInstitute of Computer Science, University of Silesia in Katowice, B¸edzińska 39, 41-200 Sosnowiec, Poland

cDoctoral School, University of Silesia in Katowice, Bankowa 14, 40-007 Katowice, Poland

Abstract

Association rules are popular form for knowledge discovery domain. They are used for finding interesting relationships and patterns hidden in large data sets and in the area of associative classification, where usually rules with one item in the right-hand side are considered. There are many different approaches and algorithms for mining association rules. One of the most popular group are methods which are based on mining frequent itemsets, usually applied for data in transaction format. Such data can be transformed to binary information system which corresponds to matrix data format.

Technological development means that we are dealing with an increasing amount of data that can be heterogeneous, taking into account their format and location. In this paper, we assume that dispersed data is represented by a finite setSof information systems with equal sets of attributes. We discuss one of the possible ways to the study association rules common to all information systems from the setS: building a joint information system for which the set of true association rules that are realizable for a given rowr and have given attribute f on the right-hand side coincides with the set of association rules that are true for all information systems fromS, are realizable for the rowr, and have the attributef on the right-hand side. We show how to build a joint information system in a polynomial time.

When we build such an information system, we can apply to it various association rule learning algorithms.

© 2022 The Authors. Published by Elsevier B.V.

This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the scientific committee of the KES International.

Keywords: Dispersed data; Set of information systems; Association rules

Corresponding author

E-mail address:[email protected]

1877-0509 © 2022 The Authors. Published by Elsevier B.V.

This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the scientific committee of the KES International.

(3)

4614 Mikhail Moshkov et al. / Procedia Computer Science 207 (2022) 4613–4620

1. Introduction

One of the consequences of technological development is a large set of data that logically belong to the same system but physically exists scattered taking into account their format and location. These data sources are collected and processed in order to extract knowledge supporting decision-making processes. Dispersed data is often represented by a finite set of decision tables [1, 2]. However, data can also be represented by information systems [3,4].

Association rules are known and popular data mining technique. Mainly they are used for finding correla- tions and interesting relationships among data. The most popular application of association rules is market basket analysis (MBA) which leads to finding associations between different items that customers place in their shopping baskets. The are many approaches and algorithms for mining association rules. One of the first, popular approach, on the basis of which many modifications of algorithms were created, concern fre- quent itemsets, i.e., set of items, that appear frequently together in a transaction data set. Other approaches are based on vertical data format, GUHA method or mining association rules based on information system, including dispersed data [5].

In this paper, we assume that we have a finite set S = {I1, . . . , Ik} of information systems in which columns are labeled with the same attributesf1, . . . , fn. We fix a rowrfrom one of the information systems from S and an attribute fj ∈ {f1, . . . , fn}, and we consider the setArules(S, r, fj) of association rules of the form (fi1 =b1)∧ · · · ∧(fim =bm)(fj =b) that are true for each information system fromS and are realizable for the rowr (cover rowr). Our aim is to create tools for the work with association rules from this set.

To apply different algorithms to the set of information systemsS, we need to build an information system J (joint information system forS, r, and fj) such that Arules({J}, r, fj) =Arules(S, r, fj). In this paper, we show how to build a joint information systems in a polynomial time.

The rest of the paper is organized as follows. In Sect.2background information about heterogeneous and distributed data are presented. It also includes basic terms related to association rules and approaches for their construction. Section3, contains description of main notions related to common association rules and approach for study joint information systems. Section4contains short conclusions.

2. Background Information

In this section some basic information related to dispersed and distributed data and association rules are provided.

2.1. Heterogeneous and Distributed Data

Technological development means that we are dealing with an increasing amount of data. It is collected and processed in order to extract knowledge supporting decision-making processes. Data collected by differ- ent companies and institutions are often stored in heterogeneous and distributed systems [6]. Heterogeneity means that such systems can have different structures and functionality and use various data models. In addition, the data comes from geographic dispersed different data sources. One of the possible solution in order to provide decision makers access to heterogeneous and distributed data are integrating systems which ensure data transformation into common model and storage in a central system, called a data ware- house [7,8]. The ETL (Extraction Transformation Loading) software is one of the main elements of a data warehouse, which is carried out, among others, for transformation data into a common format, consolidation, cleaning and data aggregation [9]. Then, unified data is loaded into a central data warehouse and analytical applications are used for knowledge discovery by analyzing trends, anomalies or searching patterns.

A domain which is important in the context of dispersed data is DDM (Distributed Data Mining) which deals with the problem of finding data patterns in an environment with distributed data and computation [10, 11]. It can be said that DDM offers an alternative approach to the analysis of distributed data that requires minimal data communication. Usually DDM algorithms involve local data analysis and generation of a

(4)

1. Introduction

One of the consequences of technological development is a large set of data that logically belong to the same system but physically exists scattered taking into account their format and location. These data sources are collected and processed in order to extract knowledge supporting decision-making processes. Dispersed data is often represented by a finite set of decision tables [1,2]. However, data can also be represented by information systems [3, 4].

Association rules are known and popular data mining technique. Mainly they are used for finding correla- tions and interesting relationships among data. The most popular application of association rules is market basket analysis (MBA) which leads to finding associations between different items that customers place in their shopping baskets. The are many approaches and algorithms for mining association rules. One of the first, popular approach, on the basis of which many modifications of algorithms were created, concern fre- quent itemsets, i.e., set of items, that appear frequently together in a transaction data set. Other approaches are based on vertical data format, GUHA method or mining association rules based on information system, including dispersed data [5].

In this paper, we assume that we have a finite set S = {I1, . . . , Ik} of information systems in which columns are labeled with the same attributesf1, . . . , fn. We fix a rowrfrom one of the information systems from S and an attribute fj ∈ {f1, . . . , fn}, and we consider the set Arules(S, r, fj) of association rules of the form (fi1=b1)∧ · · · ∧(fim =bm)(fj=b) that are true for each information system fromS and are realizable for the row r (cover rowr). Our aim is to create tools for the work with association rules from this set.

To apply different algorithms to the set of information systemsS, we need to build an information system J (joint information system forS, r, andfj) such that Arules({J}, r, fj) =Arules(S, r, fj). In this paper, we show how to build a joint information systems in a polynomial time.

The rest of the paper is organized as follows. In Sect.2background information about heterogeneous and distributed data are presented. It also includes basic terms related to association rules and approaches for their construction. Section3, contains description of main notions related to common association rules and approach for study joint information systems. Section 4contains short conclusions.

2. Background Information

In this section some basic information related to dispersed and distributed data and association rules are provided.

2.1. Heterogeneous and Distributed Data

Technological development means that we are dealing with an increasing amount of data. It is collected and processed in order to extract knowledge supporting decision-making processes. Data collected by differ- ent companies and institutions are often stored in heterogeneous and distributed systems [6]. Heterogeneity means that such systems can have different structures and functionality and use various data models. In addition, the data comes from geographic dispersed different data sources. One of the possible solution in order to provide decision makers access to heterogeneous and distributed data are integrating systems which ensure data transformation into common model and storage in a central system, called a data ware- house [7,8]. The ETL (Extraction Transformation Loading) software is one of the main elements of a data warehouse, which is carried out, among others, for transformation data into a common format, consolidation, cleaning and data aggregation [9]. Then, unified data is loaded into a central data warehouse and analytical applications are used for knowledge discovery by analyzing trends, anomalies or searching patterns.

A domain which is important in the context of dispersed data is DDM (Distributed Data Mining) which deals with the problem of finding data patterns in an environment with distributed data and computation [10, 11]. It can be said that DDM offers an alternative approach to the analysis of distributed data that requires minimal data communication. Usually DDM algorithms involve local data analysis and generation of a

global data model by combining the results of the local analysis. Distributed Data Mining processes can be applied on either homogeneous or heterogeneous local levels [12] and can be classified into two main approaches: based on meta-learning system and based on multi-agent system [13]. Meta-learning system constitutes a learning method at local level. The predicted method from the local level is communicated to the global level. Learning at the meta-level is based on accumulating experience on the performance of multiple applications of a learning system [14]. Multi-agent system provides heterogeneous environment mining by collaborative working. The agent behavior in distributed environment is based on the data from different distributed sources. In such environment interoperability and performance measures are indicated as important features. Interoperability concerns working collaboratively with other agents in the entire system. Performance measures can be improved or impaired by the data distribution in local level.

The approach proposed in the paper allows to create a joint information system and study so-called common association rules based on distributed, homogeneous i.e, with equal sets of attributes, data sources.

2.2. Association Rules

Association rules are popular form for knowledge discovery domain [15]. Such rules allow to find interesting relationships and patterns hidden in large data sets. Therefore mining of association rules has many different applications, as for example, medicine, sales planning, banking, stock trading, telecommunication and others.

It can be said that the application relates to areas, domains where customers collectively purchase a certain set of items or services. Association rules are related to frequent patterns, i.e. itemsets that appear frequently in a data set [16]. Frequent itemset can be considered as a set of items, that appear frequently together in a transaction data set. Knowledge discovered on the basis of known relationships, patterns is used in decision support processes in various domain and at various levels. One of the first and the most popular application for mining association rules is market basket analysis. It leads to finding associations between different items that customers place in their shopping baskets. On this basis, it is possible to plan marketing strategies and support management in making business decisions regarding the sale of products. Other popular application of association rules is associative classification where such rules are used for the construction of classifiers [17, 18,19].

The concept of association rules was introduced in the framework of General Unary Hypotheses Automa- ton (GUHA) method [20]. It is a general theory of mechanized hypothesis formulation based on mathematical logic and statistic. There exists similarity with the notion of association rules introduced by Agrawal [21]

where association rules in the framework of GUHA method are considered as general relations (not only conjunctions of attribute-value pairs) between general Boolean attributes derived from columns of analysed data matrix.

Association rules can be defined in many ways. LetI={I1, . . . , Im}be an itemset, i.e., set of items and D be a database of transactions where each transaction is a nonempty itemset, T ⊆D. Let A be a set of items, and it can be said that transactionT containAifA⊆T. An association rule is presented in a form

A→B, (1)

where A⊂I, B⊂I,A∅,B∅, and A∩B =.

There are different measures of quality of association rules [22, 23]. The most popular ones are support and confidence. Support suppof the rule (1) in the transaction setD is a percentage of transactions in D that containA∪B. It indicates the probability that a transaction contains the union of setsAandB.

supp(A→B) =P(A∪B). (2)

(5)

4616 Mikhail Moshkov et al. / Procedia Computer Science 207 (2022) 4613–4620

Confidenceconf of the rule (1) in the transaction setD is the percentage of transactions inDcontaining Athat also containB, it indicates conditional probabilityP(B|A),

conf(A→B) =P(B|A). (3)

These two measures are associated in the way:

conf(A→B) =P(B|A) =sup(A∪B)

supp(A) . (4)

Rules that satisfy both minimum support (min_supp) and confidence thresholds (min_conf) are called strong association rules.

There are different kinds of association rules: boolean association rules, which are used in market basket analysis [16], multilevel association rules [24], qualitiative association rules [25] which are induced from buisness data, spatial association rules [26] and others.

In this paper, association rules with one attribute on the right-hand side are considered. Such form of rules was used in Apriori algorithm implemented by Borgelt [27] and also presented in other papers [28,29,30].

In general, mining of association rules can be considered as a process that consists of two main stages:

• Find all frequent itemsets, i.e., they occur at least as frequently as a predetermined minimum support threshold,min_supp,

• Generate strong association rules from the frequent itemsets, i.e., rules that satisfy minimum support and minimum confidence thresholds.

There are different algorithms for the construction and optimization of association rules for single in- formation system, for example different greedy heuristics [29, 31] and dynamic programming approach for optimization of association rules relative to length and coverage [32].

The classical algorithm for the construction of association rules is the Apriori one proposed by Agarwal [25]

and used for data in transaction format. However, it should be noted that such data can be transformed to binary information system. The name Apriori was chosen because the algorithm takes into account a prior knowledge of previous selected frequent itemset and in this case the search space becomes narrowed during process of association rules construction. The Apriori algorithm has many modifications mainly because of its limitation as the number of scans over the entire database of transactions. Some of these variations are described in [16,33] namely: Hashing and Pruning, Partitioning, Sampling, Dynamic Itemset Counting, Pattern-Growth Approach, Vertical Data Format, Closed and Max Patterns and many others [34,35, 23].

The Frequent pattern growth algorithm [16] (Fp-Growth) is similar to the Apriori one and it is somewhat an improved way to deal with the computational cost of this algorithm, by scanning the entire database for frequent itemsets. The Fp-Growth algorithm uses the divide-and-conquer approach for finding frequent itemsets.

Hashing and pruning technique takes into account tackling the reduction of the candidate items as pro- posed in [36]. The support in this method of association rule mining is counted in the manner of mapping each generated group of itemsets into buckets of hash tables where these buckets have unique integer labels based on some hash function and with their bucket counts. The bucket count less than the minimum support is discarded.

The partitioning algorithm [37] follows the notion of partitioning the database into m-parts which is expected to overcome memory issues that may arise as a result of large databases. Here the minimum support for the entire database denoted asmin_supis then accounted for in each partition as the min_sup

×number of transactions in the partition. Each partition then generates a frequent itemset which may not

(6)

Confidenceconf of the rule (1) in the transaction setDis the percentage of transactions inDcontaining A that also containB, it indicates conditional probabilityP(B|A),

conf(A→B) =P(B|A). (3)

These two measures are associated in the way:

conf(A→B) =P(B|A) = sup(A∪B)

supp(A) . (4)

Rules that satisfy both minimum support (min_supp) and confidence thresholds (min_conf) are called strong association rules.

There are different kinds of association rules: boolean association rules, which are used in market basket analysis [16], multilevel association rules [24], qualitiative association rules [25] which are induced from buisness data, spatial association rules [26] and others.

In this paper, association rules with one attribute on the right-hand side are considered. Such form of rules was used in Apriori algorithm implemented by Borgelt [27] and also presented in other papers [28,29,30].

In general, mining of association rules can be considered as a process that consists of two main stages:

• Find all frequent itemsets, i.e., they occur at least as frequently as a predetermined minimum support threshold,min_supp,

• Generate strong association rules from the frequent itemsets, i.e., rules that satisfy minimum support and minimum confidence thresholds.

There are different algorithms for the construction and optimization of association rules for single in- formation system, for example different greedy heuristics [29, 31] and dynamic programming approach for optimization of association rules relative to length and coverage [32].

The classical algorithm for the construction of association rules is the Apriori one proposed by Agarwal [25]

and used for data in transaction format. However, it should be noted that such data can be transformed to binary information system. The name Apriori was chosen because the algorithm takes into account a prior knowledge of previous selected frequent itemset and in this case the search space becomes narrowed during process of association rules construction. The Apriori algorithm has many modifications mainly because of its limitation as the number of scans over the entire database of transactions. Some of these variations are described in [16,33] namely: Hashing and Pruning, Partitioning, Sampling, Dynamic Itemset Counting, Pattern-Growth Approach, Vertical Data Format, Closed and Max Patterns and many others [34,35,23].

The Frequent pattern growth algorithm [16] (Fp-Growth) is similar to the Apriori one and it is somewhat an improved way to deal with the computational cost of this algorithm, by scanning the entire database for frequent itemsets. The Fp-Growth algorithm uses the divide-and-conquer approach for finding frequent itemsets.

Hashing and pruning technique takes into account tackling the reduction of the candidate items as pro- posed in [36]. The support in this method of association rule mining is counted in the manner of mapping each generated group of itemsets into buckets of hash tables where these buckets have unique integer labels based on some hash function and with their bucket counts. The bucket count less than the minimum support is discarded.

The partitioning algorithm [37] follows the notion of partitioning the database into m-parts which is expected to overcome memory issues that may arise as a result of large databases. Here the minimum support for the entire database denoted asmin_supis then accounted for in each partition as the min_sup

×number of transactions in the partition. Each partition then generates a frequent itemset which may not

be necessarily frequent in relation to the entire database. A combination of all frequent itemsets from all partitions forms the general frequent itemset of the entire database. However, a second scan is done for the collection of frequent itemsets from all partitions and pruned with the overall minimum support.

The sampling approach [38] basically takes samples of the transactions of the database that fits the main memory. However, the main drawback is that there may be other itemsets that were not chosen as samples.

In this case, a much lower relevant minimum support compared to the actual one is used as the threshold.

3. Common Association Rules

In this section, main notions for mining so-called common association rules and approach for creation of joint information system based on distributed data are presented.

3.1. Main Notions

An information system I is a table filled with numbers from the set ω = {0,1,2, . . .} of nonnegative integers in which columns are labeled with attributesf1, . . . , fn. Each rowrof the information system I is interpreted as an object, and the number in the intersection of the row r and the columnfi is interpreted as the valuefi(r) of the attributefi for the object r.

Any association rule over the set of attributes {f1, . . . , fn}can be represented in the form

(fi1 =b1)∧ · · · ∧(fim =bm)(fj =b), (5)

where fj ∈ {f1, . . . , fn}, fi1, . . . , fim ∈ {f1, . . . , fn} \ {fj}, and b1, . . . , bm, b ω. We will say that this rule is based on the attribute fj. The rule (5) is called realizable for a row r = (a1, . . . , an) ωn if ai1 = b1, . . . , aim =bm. This rule is called true for the information system I if, for any rowr ofI such that the rule (5) is realizable forr,fj(r) =b– see Fig.1.

I0=

f1 f2 f3 f4

1 0 1 0

0 0 0 1

0 0 1 1

0 1 0 0

(f1= 0)(f2= 0)(f4= 1)

Figure 1. Information systemI0and association rule, which is based on the attributef4, is true for the information systemI0, and is realizable for the row (0,0,0,1)

3.2. Joint Information Systems

Let S = {I1, . . . , Ik} be a finite nonempty set of information systems in which columns are labeled with the same attributes f1, . . . , fn. Let r = (a1, . . . , an) be a row of an information system from S and fj ∈ {f1, . . . , fn}. We denote by Arules(S, r, fj) the set of association rules over the set of attributes {f1, . . . , fn} each of which is based on the attribute fj, is realizable for the row r, and is true for each information system fromS.

Our aim is to construct so called joint information systemJ for which

Arules({J}, r, fj) =Arules(S, r, fj). (6)

In the information systemJ =J(S, r, fj), columns are labeled with the attributesf1, . . . , fn. This infor- mation system contains rowrand all rowsrfrom the information systemsI1, . . . , Iksuch thatfj(r)fj(r)

(7)

4618 Mikhail Moshkov et al. / Procedia Computer Science 207 (2022) 4613–4620

(we keep only one row from any group of equal rows) – see Fig.2. Note that the information systemJ can be constructed in a polynomial time.

It is easy to show that the set of rulesArules({J}, r, fj)∪Arules(S, r, fj) is a subset of the setAof rules of the form

(fi1=ai1)∧ · · · ∧(fim=aim)(fj =aj), (7)

wherefi1, . . . , fim ∈ {f1, . . . , fn} \ {fj}. To show that the equality (6) holds, it is enough to prove that, for any ruleρ∈ A, ρArules({J}, r, fj) if and only ifρ Arules(S, r, fj). It is clear that each rule from A is based on the attribute fj and is realizable for the row r. Let ρArules({J}, r, fj). Then the rule ρis not true forJ and there exists a rowr fromJ such thatρis realizable forr andfj(r)fj(r). It is clear that r is a row from an information systemIi from S. Then ρis not true for Ii and ρArules(S, r, fj).

LetρArules(S, r, fj). Then there exists an information systemIi ∈S for whichρ is not true and there exists a rowr fromIisuch thatρis realizable forr andfj(r)fj(r). It is clear thatr is a row from the information systemJ. Thenρis not true forJ, andρArules({J}, r, fj). Thus, the equality (6) holds.

I1=

f1 f2 f3

1 1 1

0 1 0

1 1 0

I2=

f1 f2 f3

1 1 1

0 1 1

1 0 0

J(S, r, f3) =

f1 f2 f3

1 0 0

1 1 1

0 1 1

Figure 2. Joint information systemJ(S, r, f3) for the set of information systemsS2={I1, I2}, rowr= (1,0,0), and attributef3

4. Conclusions

Association rules plays important role in knowledge discovery domain because they are used for finding patters and interesting relationships hidden in the data. They have many applications especially in the domains where customers collectively purchase a certain set of items or services. Knowledge discovered on this basis is used in decision support systems in various domains. There are plenty of algorithms for mining of association rules which are based on frequent itemsets (Apriori algorithm), vertical data format (Eclat algorithm) and information systems (greedy heuristics, dynamic programming approaches). It should be also noted that transaction data based on which association rules are mined can also be dispersed in different locations which affects the efficiency of induction methods.

In this paper we have shown how to reduce the problem of the study of common association rules for a dispersed set of information systems with equal sets of attributes to the study of association rules for a single information system. The proposed approach allows us to generalize known methods for the study of association rules for single information systems to the case of dispersed information systems with equal sets of attributes.

In the future works, different approaches to mining association rules, for distributed data, will be studied.

Acknowledgments

Research reported in this publication was supported by King Abdullah University of Science and Tech- nology (KAUST).

(8)

(we keep only one row from any group of equal rows) – see Fig.2. Note that the information systemJ can be constructed in a polynomial time.

It is easy to show that the set of rulesArules({J}, r, fj)∪Arules(S, r, fj) is a subset of the setAof rules of the form

(fi1 =ai1)∧ · · · ∧(fim =aim)(fj=aj), (7)

where fi1, . . . , fim ∈ {f1, . . . , fn} \ {fj}. To show that the equality (6) holds, it is enough to prove that, for any rule ρ∈A, ρArules({J}, r, fj) if and only if ρ Arules(S, r, fj). It is clear that each rule from A is based on the attribute fj and is realizable for the row r. Letρ Arules({J}, r, fj). Then the rule ρ is not true for J and there exists a rowr fromJ such thatρis realizable forr and fj(r)fj(r). It is clear that r is a row from an information system Ii from S. Then ρis not true for Ii and ρArules(S, r, fj).

Let ρArules(S, r, fj). Then there exists an information systemIi ∈S for which ρis not true and there exists a rowr from Ii such thatρis realizable forr andfj(r)fj(r). It is clear thatr is a row from the information systemJ. Thenρis not true for J, andρArules({J}, r, fj). Thus, the equality (6) holds.

I1=

f1 f2 f3

1 1 1

0 1 0

1 1 0

I2=

f1 f2 f3

1 1 1

0 1 1

1 0 0

J(S, r, f3) =

f1 f2 f3

1 0 0

1 1 1

0 1 1

Figure 2. Joint information systemJ(S, r, f3) for the set of information systemsS2={I1, I2}, rowr= (1,0,0), and attributef3

4. Conclusions

Association rules plays important role in knowledge discovery domain because they are used for finding patters and interesting relationships hidden in the data. They have many applications especially in the domains where customers collectively purchase a certain set of items or services. Knowledge discovered on this basis is used in decision support systems in various domains. There are plenty of algorithms for mining of association rules which are based on frequent itemsets (Apriori algorithm), vertical data format (Eclat algorithm) and information systems (greedy heuristics, dynamic programming approaches). It should be also noted that transaction data based on which association rules are mined can also be dispersed in different locations which affects the efficiency of induction methods.

In this paper we have shown how to reduce the problem of the study of common association rules for a dispersed set of information systems with equal sets of attributes to the study of association rules for a single information system. The proposed approach allows us to generalize known methods for the study of association rules for single information systems to the case of dispersed information systems with equal sets of attributes.

In the future works, different approaches to mining association rules, for distributed data, will be studied.

Acknowledgments

Research reported in this publication was supported by King Abdullah University of Science and Tech- nology (KAUST).

References

[1] M. Moshkov, Decision trees and reducts for distributed decision tables, in: B. Dunin-Keplicz, A. Jankowski, A. Skowron, M. S. Szczuka (Eds.), Monitoring, Security, and Rescue Techniques in Multiagent Systems, MSRAS 2004, Plock, Poland, June 7-9, 2004, Vol. 28 of Advances in Soft Computing, Springer, 2004, pp. 239–248.

[2] D. Śl¸ezak, Decision value oriented decomposition of data tables, in: Z. W. Ras, A. Skowron (Eds.), Foundations of Intelligent Systems, 10th International Symposium, ISMIS ’97, Charlotte, North Carolina, USA, October 15-18, 1997, Proceedings, Vol. 1325 of Lecture Notes in Computer Science, Springer, 1997, pp. 487–496.

[3] Z. Pawlak, Rough Sets - Theoretical Aspects of Reasoning About Data, Vol. 9 of Theory and Decision Library: Series D, Kluwer, 1991.

[4] Z. Pawlak, A. Skowron, Rudiments of rough sets, Inf. Sci. 177 (1) (2007) 3–27.

[5] K. W. Lin, S.-H. Chung, A fast and resource efficient mining algorithm for discovering frequent patterns in distributed com- puting environments, Future Generation Computer Systems 52 (2015) 49–58, special Section: Cloud Computing: Security, Privacy and Practice.

[6] P. Amuthabala, R. Santhosh, Robust analysis and optimization of a novel efficient quality assurance model in data ware- housing, Computers & Electrical Engineering 74 (2019) 233–244.

[7] M. Mannino, S. N. Hong, I. J. Choi, Efficiency evaluation of data warehouse operations, Decision Support Systems 44 (4) (2008) 883–898.

[8] V. Theodorou, P. Jovanovic, A. Abellò, E. Nakuçi, Data generator for evaluating etl process quality, Information Systems 63 (2017) 80–100.

[9] M. Souibgui, F. Atigui, S. Zammali, S. Cherfi, S. B. Yahia, Data quality in etl process: A preliminary study, Procedia Computer Science 159 (2019) 676–687, knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 23rd International Conference KES2019.

[10] A. Cuzzocrea, Editorial: Models and algorithms for high-performance distributed data mining, J. Parallel Distrib. Comput.

73 (3) (2013) 281–283.

[11] Y. Fu, Distributed data mining: An overview, Newsletter of the IEEE Technical Committee on Distributed Processing 4 (3) (2001) 5–9.

[12] H. Kargupta, C. Kamath, P. Chan, Distributed and parallel data mining: emergence, growth, and future directions, Advances in Distributed and Parallel Knowledge Discovery (2000) 409–416.

[13] S. Urmela, M. Nandhini, A framework for distributed data mining heterogeneous classifier, Computer Communications 147 (2019) 58–75.

[14] R. Vilalta, C. Giraud-Carrier, P. Brazdil, Meta-Learning - Concepts and Techniques, Springer US, 2010, pp. 717–731.

[15] U. Stańczyk, B. Zielosko, L. C. Jain, Advances in feature selection for data and pattern recognition: An introduction, in:

U. Stańczyk, B. Zielosko, L. C. Jain (Eds.), Advances in Feature Selection for Data and Pattern Recognition, Vol. 138 of Intelligent Systems Reference Library, Springer, 2018, pp. 1–9.

[16] J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, 3rd Edition, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2011.

[17] D. Bui-Thi, P. Meysman, K. Laukens, Momac: Multi-objective optimization to combine multiple association rules into an interpretable classification, Applied Intelligence 52 (3) (2022) 3090–3102.

[18] J. Mattiev, B. Kavsek, Coverage-based classification using association rule mining, Applied Sciences 10 (20) (2020).

[19] F. Säuberlich, W. Gaul, Decision tree construction by association rules, in: R. Decker, W. Gaul (Eds.), Classification and Information Processing at the Turn of the Millennium, Springer Berlin Heidelberg, Berlin, Heidelberg, 2000, pp. 245–253.

[20] P. Hájek, T. Hávranek, Mechanizing Hypothesis Formation - Mathematical Foundations for a General Theory, Springer- Verlag Berlin Heidelberg, 1978.

[21] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, A. I. Verkamo, Advances in knowledge discovery and data mining, American Association for Artificial Intelligence, 1996, Ch. Fast Discovery of Association Rules, pp. 307–328.

[22] S. Greco, R. Słowiński, I. Szcz¸ech, Properties of rule interestingness measures and alternative approaches to normalization of measures, Information Sciences 216 (0) (2012) 1–16.

[23] A. Wieczorek, R. Słowiński, Generating a set of association and decision rules with statistically representative support and anti-support, Inf. Sci. 277 (2014) 56–70.

[24] J. Han, Y. Fu, Discovery of multiple-level association rules from large databases, in: U. Dayal, P. M. D. Gray, S. Nishio (Eds.), VLDB, Morgan Kaufmann, 1995, pp. 420–431.

[25] R. Agrawal, T. Imieliński, A. Swami, Mining association rules between sets of items in large databases, in: SIGMOD ’93, ACM, 1993, pp. 207–216.

[26] A. J. Lee, R.-W. Hong, W.-M. Ko, W.-K. Tsao, H.-H. Lin, Mining spatial association rules in image databases, Inf. Sci.

177 (7) (2007) 1593–1608.

[27] C. Borgelt, R. Kruse, Induction of association rules: Apriori implementation, in: 15th Conference on Computational Statis- tics (Compstat 2002, Berlin, Germany), Physica Verlag, Heidelberg, 2002, pp. 395–400.

[28] R. Agrawal, R. Srikant, Fast algorithms for mining association rules in large databases, in: J. B. Bocca, M. Jarke, C. Zaniolo (Eds.), VLDB, Morgan Kaufmann, 1994, pp. 487–499.

[29] F. Alsolami, T. Amin, M. Moshkov, B. Zielosko, K. Żabiński, Comparison of heuristics for optimization of association rules,

(9)

4620 Mikhail Moshkov et al. / Procedia Computer Science 207 (2022) 4613–4620

Fundam. Informaticae 166 (1) (2019) 1–14.

[30] M. J. Zaki, K. Gouda, Fast vertical mining using diffsets, in: L. Getoor, T. E. Senator, P. Domingos, C. Faloutsos (Eds.), KDD, ACM, 2003, pp. 326–335.

[31] M. J. Moshkov, M. Piliszczuk, B. Zielosko, Greedy algorithm for construction of partial association rules, Fundam. Infor- maticae 92 (3) (2009) 259–277.

[32] B. Zielosko, Application of dynamic programming approach to optimization of association rules relative to coverage and length, Fundam. Informaticae 148 (1-2) (2016) 87–105.

[33] G. G. Rupali, Apriori based algorithms and their comparisons, International Journal of Engineering REsearch & Technology 02 (2013).

[34] H. S. Nguyen, D. Śl¸ezak, Approximate reducts and association rules - correspondence and complexity results, in: N. Zhong, A. Skowron, S. Ohsuga (Eds.), RSFDGrC, Vol. 1711 of LNCS, Springer, 1999, pp. 137–145.

[35] J. Rauch, Observational Calculi and Association Rules, Vol. 469 of Studies in Computational Intelligence, Springer Berlin Heidelberg, 2013.

[36] J. S. Park, M.-S. Chen, P. S. Yu, An effective hash based algorithm for mining association rules, in: M. J. Carey, D. A.

Schneider (Eds.), SIGMOD Conference, ACM Press, 1995, pp. 175–186.

[37] A. Savasere, E. Omiecinski, S. Navathe, An efficient algorithm for mining association rules in large databases, in: In Proc.

International Conference on Very Large Data Bases (VLDB), ACM Press, 1995, pp. 432–443.

[38] H. Toivonen, Sampling large databases for association rules, in: T. M. Vijayaraman, A. P. Buchmann, C. Mohan, N. L.

Sarda (Eds.), VLDB, Morgan Kaufmann, 1996, pp. 134–145.

Referensi

Dokumen terkait