An XML Algebra For Online Processing of XML Documents
Handoko
University of WollongongWollongong, Northfields Ave., NSW 2522
[email protected]
J.R. Getta
University of WollongongWollongong, Northfields Ave., NSW 2522
[email protected]
ABSTRACT
In the last decade XML became a ubiquitous standard for representation of data. Despite the significant research ef-forts invested in the efficient processing techniques of XML documents we still need the operators of XML algebra specif-ically optimized foronline processing of XML data. Online processing means that operators act on theoretically infinite sequences of XML documents and they never “see” an entire set of data. Most of XML algebras proposed so far are too resource expensive for online processing of XML documents. This paper proposes a new system of XML operators based on a formal model of extended tree grammars. We define a minimal set of basic operators and we show how the other operators can be derived from the basic ones. Our system of operators allows for processing of XML documents to any possible depth. The system eliminates the limitations of the previous approaches to online processing XML documents by allowing each operator be computed in theincremental and/ordecremental way. The paper compares the function-ality of the new system of operators with a number of XML algebras defined earlier.
Categories and Subject Descriptors
H.m [Information systems]: Miscellaneous—XML query languages
General Terms
Theory, Algorithms, Performance, Languages
Keywords
XML algebra, online processing, semi-structured data, ex-tended tree grammar
1.
INTRODUCTION
After XML became a ubiquitous data representation stan-dard, a lot of research efforts have been invested into more
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
iiWAS2-4 December, 2013, Vienna, Austria
Copyright 2013 ACM 978-1-4503-2113-6/13/12 ...$15.00.
efficient processing techniques of XML data. The new al-gebraic systems of operators on XML documents have been proposed and many new algorithms have been invented to optimize the processing of XML algebra expressions. Un-fortunately, the existing XML algebras are not well suited foronline processing of XML documents. Online processing means that theoretically infinite sequences of input data are processed in apiece-by-piecemode without having the entire set of data available from the very beginning. Online algo-rithms that implement online processing of data are common for the applications, that continuously process ever expand-ing streams of data, like for example, processexpand-ing the streams of data collected from the sensors in real-time monitoring systems. Efficient implementation of online algorithms is based on a principle ofincrementaland/ordecremental pro-cessing of data where the current state of propro-cessing is com-bined with the increments and/or decrements of incoming data in order to obtain a new complete state of processing. Such idea requires from the operators on XML documents to be defined in a way that enables their processing in the incremental and/or decremental ways.
Unlike in the relational model, an XML document can have a structure of a typical text document, or a semi-structured database in which it can be disseminated in for-m of XML streafor-ms upon request (pull-based) or by broad-casting (push-based). Then, processing the streams of XML documents includes two stages: fragmentation and recon-struction handling, and evaluation of XML queries. This paper is focused on evaluation of XML queries. Most of the algorithms used for evaluation of XML streams apply a tuple-based approach, which adopts relational database view of data. Unfortunately in the reality XML documents have hierarchical structures which are more complex than just columns and rows. Thereforeunnest and nest opera-tions are needed to process XML documents in a way similar to relational database model, which require more resources of CPU and memory.
Figure 1: A TAX Tree Pattern(a), Input Tree(b) and Resulting witness trees(c)
paper.
2.
RELATED WORK
In XAL (XML Algebra)[4], an XML document is mod-eled as a rooted connected directed cyclic or acyclic graph. XAL groups the operators into the following three groups: extraction operators which are used to retrieve data from XML documents, meta-operators that provide mechanism to express collections, andconstruction operators that can be used to build output document from extracted data. X-AL operators are very similar to the operators of relational algebra [3].
XAnswer algebra [6] includes the operators on relational-like data structures. It is based on some elements of XAT and Galax whose operators are defined over the ordered sets of tuples. It uses a data structure called envelope, which is a triple <h|b|r> where h represents a header, b repre-sents a body andr is a result. XAnswer consists of unary operators likefunction execution, selection, projection, sort, index, nest, unnest,duplicate and binary operators such as union, cross product, andleft outer join. The main advan-tage of XAnswer is that all operators are of set-at-a-time type.
TAX(Tree algebra for XML) [5] is an XML algebra which represents the document as an ordered labeled tree. XML el-ement is represented as a node with three attributes:tagfor single-valued attribute to indicate the type of element; con-tentto represent atomic value which can be any of atomic types, andpedigreewhich carry the information of elemen-t’s predecessor which is very useful for manipulation and comparison. TAX has an important feature oftree pattern, which identify subset of nodes of interest in any tree in a set of tree and directly manipulates trees. TAX includes the operations ofselection, projection, product, grouping, aggre-gation, renaming, reordering, copy, and paste, value updates, node deletion, andnode insertion andsetoperators. Burat-ti[3] gives an example of tree pattern in Fig. 1(a) which if applied to input tree in Fig. 1(b), will give us witness trees as shown in Fig. 1(c).
Bose[2] proposed an XML algebra for fragmented XML data and its optimization framework. The system uses the operators ofunnestingandnestingin hierarchical structure.
3.
XML STRUCTURES
A structure of an XML document can be formally defined by aRegular Tree Grammar (RTG) [7].
Definition 1. ARegular Tree Grammar is a context free grammar defined as a tuple G = (N,T,S,P), where N is a fi-nite set of non-terminal symbols, T is a fifi-nite set of terminal symbols, S∈N is a start symbol, and P is a finite set of pro-duction rules either X→t{r}or X→t, where X∈N, t∈T, andr is a regular expression over N and a set of operators
{?,+,*.|}and brackets () for grouping.
Example 1. The following grammar is an RTG of an XML structure.
N={LIBRARY,BOOK,TITLE,AUTHORS,AUTHOR,NAME,EMAIL} T={library,book,title,author,name,email} S={LIBRARY} P={LIBRARY→library{BOOK*},BOOK→book{TITLE AUTHORS}, TITLE→title,AUTHORS→authors{AUTHOR+},NAME→name, AUTHOR→author{NAME EMAIL},EMAIL→email}
Using the grammar above, a set of sentences can be derived as follows.
library{book{title authors{author{name email}}} book{title authors{author{name email}}
authors{author{name email}}}}
AnInstance Grammar(IG) as a context sensitive grammar, which transforms the sentences of RTG into the instances of XML documents.
Definition 2. Let G=(N,T,S,P) be RTG and a subsen-tence of G is any sensubsen-tence that can be derived from s-tart symbol S on any non-terminal symbol in N. An In-stance Grammar (IG) of an RTG is a pair (T,P), where T is a finite set of terminal symbols from G, P is a fi-nite set of production rules of the formt{x}→<t>x</t>or t→<t>#PCDATA</t>|<t/>,t∈T, andxis a subsentence of G
Example 2 shows a sample IG for RTG from the previous example.
Example 2. An IG example
T={library,book,title,author,name,email} P={library{x}→<library>x</library>
book{x}→<book>x</book>,authors{x}→<authors>x</authors>, author{x}→<author>x</author>,name→<name>#PCDATA</name>, title→<title>#PCDATA</title>,
email→<email>#PCDATA</email>
If associations between the classesbook and author are many-to-many then an Extended Tree Grammar use the attributes ID and IDREF to eliminate the redundancies.
Definition 3. AnExtended Tree Grammar (ETG) is a 5-tuple G=(N,T,A,S,P), where: N is a finite set of non-termi-nals, T is a finite set of terminon-termi-nals, A is a finite set of attribute terminals, S∈N is a start symbol, P is a finite set of produc-tion rules X → t[a]{r} or X → t[a] where X∈N, t ∈T, a is a set of attributes, r is a regular expression over N and operators{?,+,*,|}and brackets () for grouping.
ETG can be used to add the attributes to some productions from Example 1. For exampleTITLE→title[language].
An Extended Instance Grammar (EIG) is a context sen-sitive grammar that transforms the sentences of ETG into the instances of XML documents.
Definition 4. Let G=(N,T,A,S,P) be an extended tree-grammar. AnExtended Instance Grammar of G is a 3-tuple I=(T,A,P), where T is a finite set of element terminals de-rived from G, A is a finite set of attribute terminals dede-rived from G, P is a finite set of production rules of the form t[a]{x}→<t @a=’#PCDATA’>x</t>or
t[a]→<t a=’#PCDATA’>#PCDATA</t>|
<t @a=’#PCDATA’/>, t∈T, andxis a subsentence of G.
Figure 2: Transformation rules (a) Removal of a non-terminal symbol or a production rule (b) Re-moval of a sub-tree (c) Extraction of a sub-tree
Figure 3: A recursive structure
book_author[autref]→<book_author @autref=#PCDATA/>
A complete XML document is a pair<ETG,EIG>where ETG represents its structure, while EIG provides the rules for creating the instances XML documents.
Definition 5. LetGi and Gj be ETGs. Let Gj includes
the following production rulesY →ty{ry(X)},
X →tx{rx(Z)}, Z →tz{rz} wherery(X) is a regular
ex-pression that includes a non-terminal symbolX andrx(Z)
is a regular expression that includes a non-terminal symbol Z. We say thatGi is a sub-grammar ofGj and we
denot-ed by Gi ⊑Gj when Gi can be obtained from Gj by the
application of the following transformation rules:
1. Removal of a production rulereplaces a non-terminal sym-bol on the right-hand side of production rule with the related non-terminal symbols. For example, removal of a production rule X → tx{rx(Z)} changes a rule that
has a non-terminal symbolX on its right hand side into Y →ty{ry(rx(Z))}.
2. Removal of a sub-tree removes a given production rule and all rules dependent on the production rule. For ex-ample, removal of a sub-tree with a root represented by a non-terminal symbolX removes the productionsX →
tx{rx(Z)},Z→tz{rz}and changes a production rule for
Y toY →ty{ry}.
3. Extraction of a sub-treewith a root represented by a non-terminalX replacesX with a starting symbolS and re-moves all non-terminal and terminal symbols which are no longer needed. For example, extraction of a sub-tree with a root atXprovides the productionsS →tx{rx(Z)}
andZ→tz{rz}.
Fig. 2 illustrates transformation rules in a tree.
Elimination of production rules may lead to nested regular expression. In the following cases it is possible to eliminated nesting with the simplifications rules (X?)∗ →X∗,(X?)+→
X∗,(X+)?→X∗, and (X+)∗ →X∗. To enable the refer-ences to a particular level in a recursively defined nested structure like the one in 3 we add indexing to ETG.
Definition 6. AnIndexed Extended Tree Grammar (IET-G) is a 5-tuple G= (N,T,A,S,P), where: N is a non-finite
set of non-terminals, T is a finite set of terminals, A is a finite set of attribute terminals, S∈N is a start symbol, P is a finite set of production rules of the formXi→ti[a]{r}
or Xi→ti[a] where Xi∈N, t∈T,ais a set of attribute,r
is a regular expression over N including Xi+1 and a set of
operators{?,+,*,|}and brackets () for grouping.
IETG can be used to describe a structure of documents which have some elements with the same name in the doc-ument. Contrary to ETG, IETG applies an index for every different level of repeated name, so that we get a precise information about location of each element.
4.
XML ALGEBRA
XML algebra defined in this work consists of three basic operators: restructuring, filtering, and cross-product) and operators of set algebra. The basic operators can be used to derive other operators such asjoin,antijoin,semijoin, etc.
Definition 7. Restructuring is a unary operator denoted by πy(D) = {R1, R2, ..., RD : ∀grǫGR, gr = y}, where D
is a set of documents{A1, A2, ..., AD},y is ETG such that
∀gdǫGD, y ⊑ gd, R1, R2, ..., RD are result documents, and
GD, GR are the grammars of set documents.
A sample implementation of restructuring operator is pre-sented in Algorithm 1.
Algorithm 1. Restructuring operator 1: procedure Restructuring(y, D)
2: s←y ⊲ sis a sentence derived fromy
3: for all<din D>do ⊲ dis an XML Document 4: e←t1 ⊲ eset to the first terminal symbol of s
5: for all<tins>do ⊲ tis terminal symbol of s 6: ifmatch(d, t)then
7: remove node betweene−t;e←t;<Nextt>
8: else
9: <Nextd>
10: end if 11: end for 12: end for
13: returnR ⊲R is a return set of documents 14: end procedure
end
Filtering operation preserves or removes the documents from a set of documents accordingly to a given condition.
Definition 8. Filtering is an unary operator denoted as σϕ(D) ={R1, R2, ..., RD:RiǫX}, where D is a set of
docu-ments,ϕis a triple<P, min, max>,P is a valid path,min is the minimum occurrence ofP (default 1), andmaxis the maximum occurrence ofP (default 1).
To differentiate between preserve and remove semantics of the operator, we introduce the symbols σ+ and σ−
. σ+
denotes preserve operator, whereasσ−
denotesremove op-erator. Filtering process can be implemented accordingly to Algorithm 2:
Algorithm 2. Filtering Operator 1: procedure Filtering(< P, min, max >, D)
2: for all<din D>do ⊲ dis an XML Document 3: total←countConditionMatch(P)
4: if(σ+ opr and (min≤total≤max))then 5: Append(R, d)
7: Append(R, d) 8: end if
9: end for
10: returnR ⊲return set of documentR
11: end procedure end
Cross product is a binary operator which creates all pos-sible pairs of the documents from two sets.
Definition 9. Cross productis defined asR×ρS={ρ{r s}|r∈
R∧s∈S}, whereRandS are sets of document;ρis a valid name of element which will be the parent node of every com-bination of documents fromRandS.
Cross product operator can be computed accordingly to Algorithm 3:
Algorithm 3. Cross Product operator 1: procedure CrossProduct(R,S)
2: for all<rin R>do ⊲ ris an XML Document 3: for all<sin S>do ⊲ sis an XML Document 4: d←Merge(r, s)
5: Append(N, d) ⊲appenddto setN
6: end for 7: end for
8: returnN ⊲return set of documentN
9: end procedure end
The algebra includes three set operatorsunion, intersec-tionanddifference. Unionreturns all XML documents from the first input set of document followed by those from the second set of document. Intersection operator returns all XML documents which are exist in both sets of documents. Difference operator returns all documents in the first set which do not exist in the second set.
The basic operators can be used to derive other operators likejoin(⊲⊳), semijoin(⋉), andantijoin(∼). Join is a
bina-ry operator, which will return combined trees from two sets which has common attributes. Join can be defined by ap-plying filtering (σ) operator to the results of cross product (×).
Semijoin operator returns all trees in left-hand side set which has common attribute with other set. It can be de-fined through application of restructuring (π) operator to the results of join (⊲⊳) operator. In the other words, antijoin operator produces all XML documents form the left-hand site argument which do not share common attributes with the documents from the right-hand side argument.
5.
DISCUSSION
A convenience way to model a structure of XML document is to use RTG. Our data model of an XML document is a pair of<ETG, EIG>, where ETG represents the structures or a schema of the document and EIG can be sued to create an instance XML document. ETG is a version of RTG ex-tended with representation of attributes (including ID and IDREF attributes). Using ETG we can define XML struc-tures and we can also manipulate it. Whereas EIG provides a definition for creating instances of XML documents. The main differences between ETG and EIG is that, production rules in ETG employs non terminal symbols (represented by capital letters) on the left-hand side, and a non-terminal symbol followed by terminal symbol in regular expression on the other side. In contrast with EIG, non-terminal symbols
are used on left-hand side of the production rules, and tags with value on the other side.
TAX includes the attributes in its tree-pattern when pro-cessing queries, while SAL [1] treats the attributes in the same way of elements. Our algebra enables the separate definitions of attributes to provide a better and more formal definition when we need to deal with attribute ID and IDRef later on.
When, considering processing of XML documents, there might be a modification of document structure by pruning tree, or simply cut the top of tree, or extract or remove a subtree from document. Therefore the result can be a doc-ument in which its structure is sub-grammar of the original document.
Our XML algebra provides three basic operators with or-thogonal semantics. These operators can be used to define other operators needed to process XML document queries. When using restructuring operator we are able to get a por-tion of data by removing a non-terminal symbol or a pro-duction rule, or by keeping or removing a sub-tree of XML document. It is very useful when we need to get only a por-tion of data to be processed. Restructuring operator uses an ETG to determine a structure of its output. Removing a part of XML document can be defined by deleting one or more non-terminal symbols or/and production rules from it-s ETG, and deleting non-terminal it-symbolit-s may lead to the need of re-assignment of their related production rule(s). For example tree pattern as in Fig. 1(a), can be replaced by ETG as:
N={BOOK,YEAR,AUTHOR} T={book,year,author} A={} S={BOOK} P={BOOK→book{YEAR AUTHOR+},YEAR→year,AUTHOR→author}
Deletion of a top level of XML document removes pro-duction rule(s) which includes the starting symbol. If the deleted production rules have more than one non-terminal symbols on the right-hand side, we will get the same number of starting symbols which means that we have also the same number of new ETGs.
Filtering operator operates on a set of document. In TAX, operationσP,SL(C)) usingSLas in Fig. 1 will results
two copies of input tree. In our system, filtering operation σ/book/author+ AND/book/year/text()<1998(N) returns a copy of input tree as the output which is more appropriate result.
The advantage of TAX is that it has a mechanism to com-bineprojectionandselectionoperation together in one oper-ation. On the other side tree pattern has limitation in pro-cessing huge queries and it fails to express queries in recur-sive way. It is impossible to express query like: "retrieve all books which have at most 1 author"in TAX. On the other hand, our XML algebra expresses this query using ET-G in the following way.
N={BOOK,TITLE,AUTHOR} T={book,title,author} A={} S={BOOK} P={BOOK→book{TITLE AUTHOR?},TITLE→title,AUTHOR→author}
Cross product operation and other operators that can be derived from it are pretty much the same as in TAX. In both algebras cross product operation is not associative, so that R×S6=S×R.
In batch processing, indexing and materialization makes ad-hoc queries efficient. In contrast, online processing of continuous queries requires incremental and/or decremen-tal computation. Such computations find the new results based on a prior state and new inputs without recomput-ing the prior results regardless it will work in time-based mode (for example using timestamp) or tuple-based mode (sliding-windows).
The arguments of our XML algebra are the sets of doc-ument and we assume that every increment or decrement of an argument is an XML document as well. Suppose we have an expression ofe(A1,A2,...,AD), andδAi is an
incre-ment/decrement of a set of documentsAi, i.eAi⊕δAi. We
should be able to computeincremental and/ordecremental dataf(A1,...,δAi,AD) and then integrate the result with
pre-vious result, so that: e(A1,...,Ai⊕δAi,AD) = e(A1,...,AD)
+f(A1,...,δAi,AD).
For example, union operation (∪) overincrementaland/or decremental data in expression e(r1⊕δr1)∪r2 can be com-puted by applying distributive ∪ over ⊕. Then, e(r1 ⊕
δr1)∪r2 = (r1∪r2)⊕δr1 where (r1∪r2) is the previous re-sult and⊕δr1 is a function to computeincremental and/or decremental data. f is a function that needs to be defined so that all operations overincremental and/ordecremental data follows the form ofe(r) +f(δr).
Although an operator of cross product (×) and join (⊲⊳) for two infinite streams might need to keep data from the be-ginning of XML stream, processing of increment/decrement documents (δr) on those operators doesn’t need
recomput-ing of previous results. XML algebra proposed will use prior results and only compute theincrementaland/or decremen-tal data which usually has much smaller size than previous set(δr≪r).
In contrast with other operators,incrementof documents in the right-hand side of antijoin (∼) or minus operator will decrease the result and vise versa. Although computing difference (-) and antijoin (∼) of two infinite streams is a challenging problem, ability to find a function (f) will help us in obtaining correct result.
It is possible to show that function (f) is always either a function of⊕or a function of (−/∼). Then we can draw a conclusion that a function can be always found to compute incremental and/or decremental data and therefore avoid recomputing previous results.
For online processing, our XML algebra which uses in-dexed extended tree grammar has some strong advantages. First, indexed extended tree grammar by nature has the same structure as XML document so we keep our XML al-gebra consistent with relational model and no need to trans-form XML structure into relational table (nest andunnest) which we believe will needs extra computation. Second, the concept of sub-grammar can be used to work with fragment-ed hierarchical data structure. Using sub-grammar we can justify whether a chunk of XML data in online mode follows patterns defined or not. Last, using indexed extended tree grammar will give us possibility to have theoretically an in-finite tree pattern and translate complex tree patterns such as in recursive structures.
6.
CONCLUSIONS
The XML algebra proposed in this work uses a pair of <ETG,EIG>to define a structure of XML document. ETG extends RTG in representation of its model and operations. ETG appends a tuple to provide the definition of attribute ID and IDRef. EIG (Extended Instance Grammar) gives rules for creating instance XML documents.
Our XML algebra contains of three operators ( restruc-turing,filtering, dancross product), three set operators ( u-nion,intersection, anddifference) and some derived opera-tors (join,semijoin, andantijoin). It is consistent with rela-tional data model algebra and can express XQuery without aggregation and ordering features at this moment, but still have space to include these features in the future.
Operators of XML algebra act on sets of document and return output as sets of document as well. Restructuring operator can express the operation of projection in gener-al. Meanwhile, filtering operator do selection and deletion in the same way. This algebra has the ability to collec-t cercollec-tain documencollec-ts which sacollec-tisfy condicollec-tions in a cercollec-tain number of occurrence in the documents. The XML algebra proposed meets the needs of online processing for XML doc-uments in some reasons. First, it works in tree structure to avoidunnest andnest operations. The rest, concept of sub-grammar helps to process fragmented data whereas indexed extended tree grammar gives better performance in repre-senting tree-pattern in recursive structures. By obtaining a function (f) for every operator, we can show that our alge-bra works well for incremental and/or decremental sets of document.
7.
ACKNOWLEDGMENTS
This work is supported by Satya Wacana Christian Univ. and scholarship from Directorate General of Higher Educa-tion (Dikti), Indonesian Ministry of NaEduca-tional EducaEduca-tion.
8.
REFERENCES
[1] C. Beeri and Y. Tzaban. SAL: An algebra for semistructured data and XML. InInformal Proc. of Workshop on The Web and Databases, ACM SIGMOD, pages 37–42. ACM Press, 1999.
[2] S. Bose, L. Fegaras, D. Levine, and V. Chaluvadi. A query algebra for fragmented XML stream data. In Proceeding of 9th International Conference on Data Base Programming Languages (DBPL), pages 275–277, Potsdam, Germany, September 6-8 2003.
[3] G. Buratti.A Model and an Algebra for
Semi-Structured and Full-Text Queries. PhD thesis, Informatica, Universit`a di Bologna, Padova, 2007. [4] F. Frasincar, G.-J. Houben, and C. Pau. XAL: an
algebra for XML query optimization.Aust. Comput. Sci. Commun., 24(2):49–56, January 2002.
[5] H. V. Jagadish, L. V. S. Lakshmanan, D. Srivastava, and K. Thompson. TAX: A tree algebra for XML. InIn Proc. DBPL Conf, pages 149–164, 2001.
[6] M. Lukichev, B. Novikov, and P. Mehra. An XML-algebra for efficient set-at-a-time execution. ComSIS, 9(1):64–80, January 2012.