Query Decomposition Strategy
for Integration of Semistructured Data
Handoko
School of Computer Sci. and Software Eng. University of Wollongong, Australia
[email protected]
J.R. Getta
School of Computer Sci. and Software Eng. University of Wollongong, Australia
[email protected]
ABSTRACT
Data integration systems provide a unified view of various sources of data distributed over the wide-area networks. Us-er requests issued at a central site must be decomposed into a number of sub-requests, that are later on processed at the remote sites. The results are integrated at a central site and returned to a user.
A decomposition strategy of global user requests and sche-duling of sub-requests at a central site has a significant im-pact on performance of data integration process. This paper proposes an efficient decomposition strategy for the systems that integrate semistructured data. We define a new system of operations on XML documents to represent XQuery user requests and the results of decompositions of such requests. A cost-based optimisation is used to find the optimal size of sub-requests and their optimal scheduling at a central site.
Categories and Subject Descriptors
H.2.4 [Information systems]: Systems—Query process-ing, distributed database
General Terms
Theory, Algorithms, Performance
Keywords
Query decomposition, data integration, semistructured data
1.
INTRODUCTION
The proliferation of IT infrastructures drives the rising number of diverse endpoints. Many different applications, software and hardware systems, data formats and database systems emerge and exist on the Internet. Unfortunately, many of these endpoints are not compatible with other end-points and data integration is needed to connect the diverse systems and uncorrelated data formats. Global information systems provide their users with a centralized and transpar-ent view of heterogeneous and distributed sources of data.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. iiWAS ’14December 05 - 06 2014, Hanoi, Viet Nam
ACM 978-1-4503-3001-5/14/12 ...$15.00. http://dx.doi.org/10.1145/2684200.2684343.
The requests to access data at a central site are decomposed and processed at the remote sites. The results returned back to a central site are integrated into the final outcomes. A data integration component of the system processes data re-trieved and transmitted from the remote sites accordingly to the earlier prepared data integration plans. A decompo-sition strategy of user requests issued at a central site and scheduling of sub-requests for processing at the remote sites has a significant impact on performance of entire data inte-gration process.
After XML became a ubiquitous data representation stan-dard, a lot of research efforts have been invested into more efficient integration of semistructured data. Unlike in the relational model, an XML document can have a structure of a typical text document, or a semistructured database in which it can be disseminated in a form of XML streams upon request or by broadcasting. Research in integration of XML streams addresses the problems of fragmentation handling and evaluation of XML queries in distributed environments. This paper concentrates on optimal decomposition of XML queries for processing at the remote data sites.
Query decomposition is one of key components of da-ta integration system. The earlier developed decomposi-tion techniques for SQL queries cannot be easily adopted to semistructured data because to transform its structure into relational structures significantly decreases the perfor-mance. We need a new cost-based strategy for decomposi-tion of XML queries that is purely based on the properties and efficiency indicators of elementary operations on XML documents. In this work we describe a decomposition pro-cess that starts from translation of XML query into XML al-gebra expressions. The expressions are later on decomposed into a number of the largest sub-expressions such that each one of them can be entirely processed at one remote site without any transfer of data between the remote sites. In the next step, we reduce the sub-expressions obtained earli-er into the sub-expressions whose processing at the remote sites, transmission of the results to a central site, and final processing at a central site costs less than before. In this step the algorithm finds an optimal balance between the amounts of processing at the remote sites and at a central site. At the end of the process we obtain an optimal inte-gration strategy for processing the results of user requests at the remote sites.
5 presents an algorithm for optimal decomposition of XML queries and generation of data integration process. Section 6 concludes the paper.
2.
PREVIOUS WORK
A number of XML algebra systems have been proposed in order to get better performance of data integration. XAL system [2] is based on a model of rooted connected directed cyclic or acyclic graph. Its operators are very similar to the operators of relational algebra. XAnswer algebra [6] includes the operators on relational-like data structures. TAX(Tree algebra for XML) [4] is a system of operations based on tree view of XML documents. XML algebra proposed in [3] is a
tree-basedalgebra generalizing the relational algebra. Integration of heterogeneous sources of semistructured da-ta proposed by Burati [1], Salem [7], and Yan [9] are designed to integrate data into Data Warehousing (DW) or for Busi-ness Intelligence (BI) system. Salem [7] proposed near real-time requirements and active data integration to minimize time-consuming task.
Thuy [8] and Ling [5] present decomposition of XML queries, but they miss a problem of balancing between the amount of processing at remote site and at central site, which is solved in this work.
3.
XML ALGEBRA
Integration of semistructured data requires system of el-ementary operations on the containers with semistructured data. A system of operations on XML documents presented in this section allows for incremental processing of semistruc-tured data against the entire XML documents. A structure of an XML document is based on a contents of Extended Regular Tree Grammar [3].
Definition 1. AnExtended Tree Grammar(ETG) is a 5-tupleG = hN, T, A, S, Pi where: N is a finite set of non-terminals, T is a finite set of terminals, A is a finite set of attributes,S ∈N is a start symbol,P is a finite set of production rules X → t[a]{r} or X →[a] whereX ∈ N,
t∈T,a∈Ais a set of attributes,r is a regular expression overNand operators ?,+,∗,|, and brackets () for grouping.
Definition 2. Aschemagof an XML documentX is de-fined as an ETG that describe the structure of document.
Definition 3. LetGbe a set of schemas. Adata container
is defined asD(G) ={hX(g), idXi: X(g) is an XML doc-ument with a schemag whereg∈ G,idX is a unique and immutable identifier assigned to the documentX(g)}.
The operators of XML algebra act on the data containers and each one of them returns a data container, which auto-matically obtains a new identifier. The system includes the following set of basic operators: hrestructuring (π),filtering
(σ), join (⊲⊳), antijoin (∼), and union (∪)i. The opera-tors are conceptually consistent with the basic operaopera-tors of relational data model.
Definition 4. Let D(G) be a data container with XML documents. Restructuring is a unary operator denoted by
πy(D(G)) = {hX(y), idXi : (∀g ∈ G y ⊑ g)}where y is a schema for result documents. Each document in D(G),
hX(g), idXi is transformed so that its structure satisfies a sentence of grammar y. Transformation can be done by
traversing schema y from a starting symbol and associate the non-terminal symbols with XML nodes. Nodes which are included in the schema ywill be retained, and the rest of a document will be removed.
Filtering operation preserves or removes the documents from a data container accordingly to a given condition.
Definition 5. Let D(G) be a data container with XM-L documents. Filtering is a unary operator denoted as
σϕ(D(G)) ={hX(g), idXi:hX(g), idXi ∈D(G) andf(X(g), ϕ)}wheref(X(g), ϕ))→ {true, f alse},ϕis a triplehP, min, maxi,P is a valid path,minis the minimum occurrence of
P (default -1), andmax is the maximum occurrence ofP
(default -1).
In the definitions below Di(Gi) and Dj(Gj) denote data containers with XML documents.
Definition 6. Unionoperator is defined asDi(Gi)∪Dj(Gj) = {hX(g), idXi : hX(g), idXi ∈ Di or hX(g), idXi ∈ Dj} where a schemag∈Gi∪Dj.
Definition 7. Join operator is defined asDi(Gi)⊲⊳ρ,ϕDj (Gj) = {hX(g), idXi : X(g) = ρ(Xi(gi), Xj(gj)) and fϕ (Xi(gi), Xj(gj))}whereρis parent node for resulting docu-ments,ϕis a join condition,gis a new schema associated to resulting documents, and f is an evaluation function such thatfϕ(Xi(gi), Xj(gj))→ {true, f alse}.
Definition 8. Antijoin operator is defined as Di(Gi) ∼ϕ Dj(Gj) ={hX(g), idXi: hX(g), idXi ∈Di and fϕ(Xi(gi), Xj(gj))} where g∈ Gi(Di) is a schema associated the re-sult documents, and f is an evaluation function such that
fϕ(Xi(gi), Xj(gj))→ {true, f alse}.
4.
DATA INTEGRATION SYSTEM
In a data integration system, a query issued at a central site is translated into XML algebra expression and decom-posed into sub-queries which are later on sent to the remote sites for processing. The partial results are sent back to a central site to be integrated into the final answer.
Definition 9. Let {x1, ..., xn} be a set of pointers to the data containers with XML documents located at the remote sites. Aglobal query expressione(x1, ..., xn) is an expression built from the operations of XML algebra defined above and the pointers to remote data containers.
In the first step of query processing at a central site a query expressed in a high level query language like XQuery is transformed through normalisationinto XQuery Core. Next, XQuery Core is translated into aglobal query expres-sionand optimized using the standard techniques of syntax-based optimisation, e.g. moving filtering before binary oper-ations. At this point, aglobal query expression e(x1, ..., xn) contains: (1) all arguments which point to data containers with XML documents located at the remote sites, (2) sub-expressions which operates over the arguments from a single database located at a remote site, (3) sub-expressions which operates over the arguments from multiple database located at the remote sites. (4) identical sub-expressions.
remote sites. Performing all computation at a central site allows for a strict control of the process. Outsourcing the computations to the remote cites creates the risks that some of them may face unexpected delays. On the other hand, performing all computations at a central site may require a huge amounts of data transmitted from the remote sites. Usually, more processing done at the remote sites means less data transmitted over a network.
Definition 10. Query decomposition is as a process, that transforms a global query expression e(x1, . . . , xn) into an expressionf(q1, . . . , qk) where for alli= 1, . . . , k qi=ei(xi1,
. . . ,xij) and{xi1, . . . ,xij} ⊆ {x1, . . . ,xn}andxi1, . . . ,xij point to the same remote site and the results of processing off(q1, . . . , qk) are the same as the results of processing of e(x1, . . . , xn).
In a definition ofquery decomposition, every sub-expression
qiis considered as an expression such that all its arguments come from the same remote site. Decomposition of aglobal query expression creates frome(x1, . . . , xn) a set of queries
{q1, . . . , qk}wherek≤n. Next, each queryqiis sent to and computed at one of the remote sites. It returns to a central site a data containerDi with the partial results. A central site combines the data containersD1, . . . , Dk to produce a final result.
Definition 11. Let{q1, . . . , qk}be a set of queries obtained from a decomposition ofe(x1, . . . , xn). Let,f(q1, ..., qk) be an expression that combines the results of queries such that
f(q1, ..., qk) = e(x1, . . . , xn). A data integration expression f(D1, ..., Dk) is an expression obtained fromf(q1, . . . , qk) by a systematic replacement of the symbolsq1, . . . , qkwith the data container D1, . . . , Dk being the results of processing q1, . . . , qkat the remote sites.
5.
QUERY DECOMPOSITION STRATEGY
An optimalquery decomposition is a strategy that finds a set of sub-expressionsq1, . . . , qk whose processing at the remote sites and later on processing of data integration ex-pressionf(D1, . . . , Dk) at a central site requires the lowest costs. Providing an accurate cost estimation is one of the hardest problems in query decomposition. The total costs of query processing is defined as:
T Cost(α) =P Cost+CCost (1)
where T Cost(α) represents the total costs of processing a sub-expressionα,P Costrepresents enumeration of the cost of processing individual operators in a subexpression, and
CCostrepresents the communication costs required to trans-mit the results from the remote sites to a central site.
The costs of processing the individual operators is deter-mined by IO cost to access data in a secondary storage and CPU cost to execute binary operations. IO cost is signif-icantly larger than CPU cost, therefore CPU cost is typ-ically ignored. At a remote site, P Cost depends on how actual documents are read from a semistructured database, how many disk blocks are accessed and characteristics of sec-ondary storage. At a central site, it is mainly determined by IO cost to read and write temporary results of computation. Communication cost (CCost) is also an important factor as remote sites do not posses a uniform communication char-acteristics. It is also determined by the size of documents to be transferred from the remote sites to a central site.
Query decomposition process starts from identification of the largest sub-expressions in a global query expression
such that all data containers processed by the largest sub-expressions are located at the same remote site. The largest sub-expressions are found by systematic labeling of the op-eration nodes in a syntax tree ofglobal query expressionwith the identifiers of the remote sites as its arguments. Label-ing starts from the operations just above a leaf level in the syntax tree and it continues towards the root node. At the end labeling process, the largest sub-expressions consist of all nodes labeled with exactly the same label.
Next, we consider the sub-expressions found in the previ-ous step and we find an optimal processing strategy sepa-rately for each one of them. For example, let α(p, q) be a sub-expression with an operationαand the sub-expressions
pand qsuch that all data containers processed bypandq
are located in the same remote site. Then, there exist five possible strategies of processingα(p, q).
(1) Both sub-expressionspandq, and operationαare pro-cessed at a remote site and the results ofαare transmitted to a central site. Then, the total processing costs of sub-expressionα(p, q) areT Cost(1)(α) =P Costr(α)+CCost(α)+ T Costr(p)−CCost(p) +T Costr(q)−CCost(q). Such case is possible only if all arguments of sub-expressions pandq
are located at the same site.
(2) Both sub-expressions p and q are processed at the re-mote sites, the results are transmitted to a central site and operation αis processed at a central site. Then, the total processing costs T Cost(2)(α) = T Costr(p) +T Costr(q) + P Costc(α), whereT Costr(p) andT Costr(q) includeCCost(p) andCCost(q).
(3) A sub-expression pis processed at a remote site while the other sub-expression is processed at a central site and an operation αis processed at a central site. Then, the total processing costs T Cost(3)(α) = T Costr(p) +T Costc(q) + P Costc(α), whereT Costr(p) includesCCost(p).
(4) A sub-expression q is processed at a remote site while the other sub-expression is processed at a central site and an operation αis processed at a central site. Then, the total processing costs T Cost(4)(α) = T Costc(p) +T Costr(q) + P Costc(α), whereT Costr(q) includesCCost(q).
(5) Both sub-expressionspandqare processed at a central site and operationαis processed at a central site. Then, the total processing costsT Cost(5)(α) =T Costc(p)+T Costc(q)+ P Costc(α).
The total costs in each one of the cases listed above are compared and a variant with the lowest processing costs is selected for processing of sub-expressionα(p, q).
Enumeration of all possible strategies of processing of a sub-expression is performed in three steps. First, we label the nodes in the sub-expression with the positive natural numbers such that if a node is labeled withnthen its chil-dren nodes are labeled with (n∗2) and (n∗2 + 1).
labeln∗2 or/andn∗2+1 in the current labeling. A function
F indIndexOf(stringlabel, char-), is employed to get the
left most dash character ’-’ in the label. Character ’-’
represents the next node that is still not set.
Algorithm 1:Find all combinations
Input: Sub-expression in array of node (N[max])
Result: A set of sub-expressions
1 strSeed={’-’,. . . ,’-’}; j=0; Remote=’R’; Central=’C’; 2 string strResult[max];
3 if(strSeed[1]=’-’)then
4 stack.push(SetData(N,strSeed, Central, 1)); 5 stack.push(SetData(N,strSeed, Remote, 1));
end
6 while(!stack.empty())do
7 string strCombination=stack.pop();
8 int thisPos=FindIndexOf(strCombination, ’-’); 9 if (thisPos>0)then
10 stack.push(SetData(N,strSeed, Central, thisPos)); 11 stack.push(SetData(N,strSeed, Remote, thisPos));
else
12 strResult[j++]=strCombination;
end end
13 returnstrCombination;
Algorithm 2:Create a seed for combination SetData(N[max], str, value, pos)
Input: An array of node, and a combination string
Result: A new combination string
1 const Remote=’R’; const Central=’C’; hasSub=false; 2 if(iPos>=max)then
3 returnstr; 4 if(str[pos]==’-’)then
5 str[pos] = value;
6 if(N[pos*2] or N[pos*2+1] exists in N)then 7 hasSub=true;
8 if!(value=Central||(value=Remote & !hasSub))then 9 str =SetData(N,str, Remote, pos*2);
10 str =SetData(N,str, Remote, pos*2 + 1);
11 returnstr
Finally, the labelings found in the previous step are used to create the total cost formulas which are later on evaluat-ed in order to find the labeling with the lowest total costs. Cost formula for every labeling includes costs at a ”remote” site, costs at a ”central site”, and communication costs. The nodes with a ”remote” label have their processing costs cal-culated in a way determined by the computing resources available at a particular remote site. Processing costs of the ”central” nodes are determined by the computing resources available at the central site. The data transmission costs from a ”remote” to a ”central” node contribute to the total costs as a communication cost component. An algorithm 3 is a sample implementation of this step.
As an illustrative example, consider a user query issued at a central site and transformed into a global query expression as shown in syntax tree Figure 1(a). It identifies a sub-expression (α) which has all arguments from a remote site
S. For a sub-expressionα(x1, . . . , x5) we construct a syntax
tree as shown in Figure 1(b), such that all nodes are labeled with a positive natural number and the root node is labeled with a number ”1”.
Algorithm 3:Find the best combinations
Input: A set of combinations Result[r]
Result: The best cost combination
1 BestIndex=0; s=””; Central=’C’; Remote=’R’; BestCost=0; 2 for(x=0tor-1)do
3 s=Result[x];
4 TotalCost=TCost(s,1,2,3); 5 if(TotalCost<BestCost)then 6 BestCost=TotalCost; BestIndex=x;
end end
7 returnResult[BestIndex];
Algorithm 4:Find total cost TCost(s,α, p, q)
Input: A decomposition label
Result: Total Cost for sub-expression a
1 T Lef t= 0;T Right= 0; 2 if(s[p]!=” ”)then
3 T Lef t=T Cost(p, p∗2, p∗2 + 1);
4 if(s[q]!=” ”)then
5 T Right=T Cost(q, q∗2, q∗2 + 1);
6 if(s[a]=”R”)then
7 T otal=P Costr(α) +CCost(α) +T Lef t−CCost(p) + T Right−CCost(q);
else
8 T otal=P Costc(α) +T Lef t+T Right;
9 returnT otal
In algorithm 1 and 2, we construct an array of charac-ter to represent all possible decomposition labelings of the sub-expression. Number labeling attached to the node can simplify the process of decomposition labeling with represen-tation of an array starts from index 1. An array{R,R,R, ,R}
represents that all nodes are processed at a remote site. In-dex 4 contains ” ” (underscore) character to represent that a left child node of node labeled number 2 does not exist in the syntax tree. At the end of these algorithms, we get all possible labelings as results: {R,R,R, ,R},{C,R,R, ,R},
{C,R,C, ,R},{C,C,R, ,R}, {C,C,R, ,C}, {C,C,C, ,R}, and
{C,C,C, ,C}.
The next step, finding the best decomposition strategy requires an evaluation of all possible strategies constructed from previous steps. For a strategy which is represented as {C,R,C, ,R} in Figure 1(b), a central site sends a sub-expression β(x1, x2, x3) =β(x1, θ(x2, x3)) to a remote site
S and waits for the computing results for further process-ing. A central site receives the results of computing from a remote site S in a data container D1. The total cost
of sub-expression α(β(x1, θ(x2, x3)), γ(x4, x5)) with the
s-trategy labeled {C,R,C, ,R} is T Cost(α) = T Costr(β) + T Costc(γ) +P Costc(α), where:
T Costr(β) =P Costr(β) +T Costr(θ) +CCost(β)−CCost(θ) T Costc(γ) =T Costr(x4) +T Costr(x5) +P Costc(γ)
T Costr(θ) =P Costr(θ) +CCost(θ) T Costr(x4) =P Costr(x4) +CCost(x4)
T Costr(x5) =P Costr(x5) +CCost(x5)
and x4, x5 in a sub-expression γ(x4, x5) represent
Figure 1: (a) A syntax tree of aglobal query expression (b) An example of decomposition strategy to balance query processing between a remote and a central site (c) Integration data with balancing strategy
and communication costs are needed to transfer the docu-ments to a central site. The costs of these filtering oper-ations can be ignored, because their communication cost-s are typically larger, therefore T Costr(x4) = CCost(x4)
and T Costr(x5) = CCost(x5). Then, total cost of
sub-expressionαisT Cost(α) =P Costr(β)+P Costr(θ)+CCost (β) +P Costc(γ) +CCost(x4) +CCost(x5) +P Cost(α).
When CPU cost is typically ignored, the processing cost of a simple sub-expressiona(r, s) at a remote site (P Costr(a)) is determined by IO cost to read documents from data con-tainer from a secondary storage at a remote site pointed by
rands, and algorithm used to implement physical XML al-gebra operations. For a simple sub-expressiona(r, s) which represents a join operation (r ⊲⊳ s), its processing cost is
P Costr(a) = |nr| ∗ |ns|, where|nr|and |ns|are estimated number of documents inrandsrespectively. In a more com-plex situation, IO cost might include the number of blocks accessed, the estimated average size of documents, and the main memory size of a remote site.
Result from the remote sites are received at a central site in a form of data containers, which are expected to fit in a main memory. In some cases, results of computation at a central site might be stored in a secondary storage to min-imize computation. Operations which operates on data s-tored in a secondary storage need to include IO cost to read the temporary result. It might also append an IO cost to write the result of computation to a secondary storage, when updating of temporary results are performed.
Communication cost (CCost) is determined by the amoun-t of daamoun-ta amoun-to be amoun-transferred amoun-to a cenamoun-tral siamoun-te. CCost(x4) and
CCost(x5) in the example above are communication costs
to transfer documents in a data container located atx4and
x5at a remote siteSto a central site. MeanwhileCCost(β)
is a communication cost to bring computation result of sub-expressionβto a central site. If|n|is an estimated amount of result documents andsis an estimated average size of a document to be transferred, thenCCost(β) =|n| ∗s.
Using a decomposition strategy in Figure 1(b), a global query expression in Figure 1(a) is transformed into a data integration expression in 1(c).
6.
SUMMARY AND FUTURE WORK
Query decomposition strategy proposed in this paper op-timizes a process of semistructured data integration through careful balancing of the computations between a central site and the remote sites. A data integration process is formal-ly represented as aglobal data integration expression whose arguments are the containers with XML documents located
at the remote sites and whose operations belong to a set of XML algebra operations. The objective of the optimisation is to partition the expression into the remotely computed sub-expressions and and centrally computedfinal data inte-gration expression in a way that minimizes the remote and central data processing costs and data transmission costs. Online algorithm for optimal processing of the final data integration expression remains an open problem.
7.
ACKNOWLEDGMENTS
This work is supported by Directorate General of Higher Education (Dikti), Indonesian Ministry of National Educa-tion.
8.
REFERENCES
[1] G. Buratti.A Model and an Algebra for
Semi-Structured and Full-Text Queries. PhD thesis, Informatica, Universit`a di Bologna, Padova, 2007. [2] F. Frasincar, G.-J. Houben, and C. Pau. XAL: an
algebra for XML query optimization.Australia Computer Science Communication, 24(2):49–56, January 2002.
[3] Handoko and J. R. Getta. An XML algebra for online processing of XML documents. InThe 15th Int. Conference on Information Integration and Web-based Applications & Services, IIWAS’13, Vienna, Dec 2013. [4] H. V. Jagadish, L. V. S. Lakshmanan, D. Srivastava,
and K. Thompson. TAX: A tree algebra for XML. InIn Proc. DBPL Conf, pages 149–164, 2001.
[5] L. Li, M. Lee, and W. Hsu. Rewriting queries for XML integration systems. InDatabase and Expert Systems Applications, volume 4080 ofLecture Notes in Computer Science, pages 138–148. Springer Berlin Heidelberg, 2006.
[6] M. Lukichev, B. Novikov, and P. Mehra. An XML-algebra for efficient set-at-a-time execution.
ComSIS, 9(1):64–80, January 2012.
[7] R. Salem, O. Boussa¨ıd, and J. Darmont. Active XML-based Web data integration.Information Systems Frontiers, 15(3):371–398, 2013.
[8] L. Thuy and D. Duong. Query decomposition using the XML declarative description language. In
Computational Science and Its Applications - ICCSA 2005, volume 3481 ofLecture Notes in Computer Science, pages 1066–1075. Springer Berlin Heidelberg, 2005.
[9] X.-Q. Yan and Y. Liu. XQuery optimization in