AN X M L ALGEBRA FOR
ON LI N E PROCESSI N G OF
X M L D OCUM EN TS
By : H a n dok o a n d Ja n u sz R. Ge t t a
Ou t lin e
v
Int roduct ionv
Relat ed Work§ XML Data Model & Algebra exists
v
XM L St ruct uresv
XM L Algebrav
Conclusion & Fut ure w orksI n t r odu ct ion - M ot iv a t ion
v Since XM L document accept ed as a st andard in Informat ion Syst em, online processing of semist ruct ured dat a becomes more import ant . v Online processing applies online algorit hm w hich
process dat a piece by piece.
v The performance of semist ruct ured dat a processing is affect ed by some aspect s:
§ Semistructured data has more complex structure than columns and rows.
§ Data model exists requires data to be completed before can be processed, while online processing needs to process data piece-by-piece.
§ In some of data model exists, semistructured data is treated as relational-type data model:
• Evaluations of XML Stream are in tuple-based approach. • Unnestand nestoperations have to be employed. • Requires more resources of CPU and memory
3
An Ex a m ple of On lin e Pr oce ssin g - On lin e I n t e gr a t ion
APPLI CATI ONS D AT A
SOUR CES D AT A SOUR CES
PREVI OUS RESULT RECENT DATA
INCLUDES:
1. Fragmentation Handling 2. XML Algebra
3. Execution Plan Algorithms 4. Scheduling Algorithms
Re la t e d W or k -
X M L D a t a M ode l & Alge br av
XM L Algebra (XAL) [4]§ XML as rooted connected directed graph cyclic or acyclic
§ Vertices represent elements, edges represent simple values
§ Has three groups of operators:
• Extractionoperators
– Projection, Selection, distinct, join, sort, product
• MetaOperators
– Map, Kleene Star
• Construction operator
– Create vertex, create edge, copy
5
Lit e r a t u r e Re v ie w -
X M L D a t a M ode l & Alge br av
XAnsw er [6]§ Algebra in relational-like data structure
§ Uses data structure called Envelope <he|be|re>
§ Header heis unordered set of attribute (A), body beis a set of pair (A,v) where vis value, and rerepresents result
§ XAnswer provides Unary operators (Function Execution,
selection, projection, sort, index, nest, unnest, and duplicate) and Binary operators such as union, cross product, and left outer joinoperators.
§ Union operator in XAnswer does not remove duplicates.
§ XAnswer also provides left-outer-join operation instead of expressing it using selection, cross product and union operators.
Lit e r a t u r e Re v ie w -
X M L D a t a M ode l & Alge br av
TAX (Tree Algebra for XM L)[5]§ Represents the document in an ordered labeled tree. Every XML element will be represented as a node which has:
• tagattribute: single-valued attribute which indicates the type of element;
• contentattribute: representing atomic value which can be any of atomic types;
• pedigreeattributes: carry the information of element's predecessor which will be very useful for manipulation and comparison.
§ TAX proposed an idea of tree pattern (TP) but represents a very different concept from classical relational algebra.
§ TAX provides Selection, Projection, Product, Grouping,
Aggregation, Renaming, Reordering, Copy and Paste, Value Updates, Node Deletion and Node Insertion operation,and some set operators.
7
X M L D a t a M ode l –
Ex t e n de d Tr e e Gr a m m a rv
A st ruct ure of an XM L document is a pair <ETG, EIG> w here ETG(Ext ended Tree Grammar)Ex a m ple of ETG a n d it s se n t e n ce
9
X M L D a t a M ode l –
Ex t e n de d I n st a n ce Gr a m m a rApplying EI G t o ETG sent ence
v
When w e get any t erminal symbol library(...) t hen w e apply it t o product ion rule w hich mat ch t o t he t erminal symbol. library(x)→<library>x</library>v
x in st ep 1 consist s of nodes books and aut hors at t he same level, so w e need t o apply t hecorresponding product ion rules
books(x)→<books>x</books> and authors(x)→<authors>x</authors>
Our inst ance XM L document w ill become:
<library>
<books>x</books><authors>x<authors>
</library>
11
v
As reaching a t erminal symbol w hich has no inner st ruct ure (t erminal w hich is not follow ed byopening curly bracket ), w e t ranslat e it int o relat ed product ion rule. For example
title→<title>#PCDATA</title> w ill be t ranslat ed
int o: <title>Basic XML</title>
v
Terminal symbol w it h no inner st ruct ure and follow ed by square bracket ([) w ill be t ranslat ed using product ion rules defined. For exampleX M L D a t a M ode l – I n de x e d ETG
v
Then w e ext end ETG t o accommodat e recursive st ruct ures as:13
X M L Alge br a
v
XM L algebra in t his syst em consist s of: § Basic operations• Restructuring Operation (π) • Filtering Operation (σ) • Cross Product operation (×) § Set Operations (∪,∩, and -)
§ Derived Operations (join, semijoin, and antijoin)
X M L Alge br a - Su b Gr a m m a r
v
Let Giand Gjbe ETGs. Let Gjincludes t he follow ing produc on rules Y→
ty{ry(X)}, X→
tx{rx(Z)}, Z→
tz{rz} w here ry(X) is a regular expression t hat includes a non-t erminal symbol X and rx(Z) is a regularexpression t hat includes a non-t erminal symbol Z.
v
We say t hat Giis a sub-grammar of Gjw hen Gican be obt ained from Gjby t he applicat ion of t he follow ing t ransformat ion rules:15
X M L Alge br a – Su b gr a m m a r
v
Transformat ion of document st ruct ure:X M L Alge br a - Re st r u ct u r in g
v
Rest ruct uring operat ion can be defined as:17
Alg or it h m f or Re st r u ct u r in g
X M L Alge br a - Filt e r in g
v
Filt ering operat ion can be defined as:Definit ion 8. Filt ering is an unary operat or denot ed as
σ
ϕ(D) = {R1,R2, …,RD: Ri∈
X, w here D is a set of docum ent s,ϕ
is a t riple <P,min,max>, P is a valid pat h, min is t he minimum occurrence of P (default -1), and max is t he maximum occurrence of P(default -1).
v
To different iat e bet w een preserve and remove semant ics of t he operat or, w e int roduce t he symbolsσ
+andσ
-19
X M L Alge br a – Cr oss Pr odu ct
v
Cross product is a binary operat or w hich creat es all possible pairs of t he document s from t w o set s.Definit ion 9. Cross product is defined as
RxρS = {
ρ
{r s}:r∈
R∧
s∈
S, w here R and S are set s of document ;ρ
is a valid name of element w hich w ill be t he parent node of every combinat ion ofdocument s from R and S.
21
Algor it h m for Cr oss Pr odu ct
Tr e e Pa t t e r n r e pr e se n t a t ion
23
N={BOOK,YEAR,AUTHOR}
T={book,year,aut hor}
A={}
S={BOOK}
P={BOOK→book{YEAR AUTHOR+},
YEAR→year,
AUTHOR→aut hor}
Qu e r y a ddit ion
v
Rat her difficult t o do query:§ retrieve all books which have at most 1 author
X M L Alge br a – On lin e Pr oce ssin g
v The argument s of t he XM L algebra are set s of document s, and w e assume t hat every increment / decrement of an argument is an XM L document
v For online int egrat ion, consider an int egrat ion as a UNION operat ion, w e should be able t o comput e
increment / decrement dat a (δAi) and int egrat e t he result w it h t he previous one:
e(A1,…,Ai⊕δAi,…,AD) = e(A1,…,Ai,…,AD) ⊕f(A1,…,δAi,…,AD) v Example, applying U over ⊕:
e(R1⊕δ1)UR2= (R1UR2) ⊕ δ1
v fis a funct ion t hat need t o be defined so t hat all operat ors over increment / decrement dat a follow s t he form.
v We found t hat f is a funct ion of eit her ⊕or -.
25
Con clu sion
v
XM L Algebra proposed is consist ent w it h relat ional algebra.v
It meet s t he need of online processing:§ It works in tree structure to avoid nestand unnestoperations.
§ It possible to find a function to process increment data
Re fe r e n ce s
[1] C. Beeri and Y. Tzaban. SAL: An algebra for semist ruct ured dat a and XM L. In Informal Proc. Of Workshop on The Web and Dat abases, ACM SIGM OD, pages 37{42. ACM Press, 1999.
[2] S. Bose, L. Fegaras, D. Levine, and V. Chaluvadi. A query algebra for fragment ed XM L st ream dat a. In Proceeding of 9t h Int ernat ional Conference on Dat a Base Programming Languages (DBPL), pages 275{277, Pot sdam, Germany, Sept ember 6-8 2003.
[3] G. Burat t i. A M odel and an Algebra for Semi-St ruct ured and Full-Text Queries. PhD t hesis, Informat ica, Universit a di Bologna, Padova, 2007.
[4] F. Frasincar, G.-J. Houben, and C. Pau. XAL: an algebra for XM L query opt imizat ion. Aust . Comput . Sci. Commun., 24(2):49{56, January 2002.
[5] H. V. Jagadish, L. V. S. Lakshmanan, D. Srivast ava, and K. Thompson. TAX: A t ree algebra for XM L. In In Proc. DBPL Conf, pages 149{164, 2001.
[6] M . Lukichev, B. Novikov, and P. M ehra. An XM L-algebra for eficient set -at -a-t ime execu-a-t ion. ComSIS, 9(1):64{80, January 2012.
[7] M . M urat a, D. Lee, M . M ani, and K. Kaw aguchi. Taxonomy of XM L schema languages using formal language t heory. ACM Trans. Int ernet Technol., 5(4):660{704, Nov. 2005.
Re se a r ch Pr ogr e ss
v
A st ruct ure of an XM L document can be formally defined by a Regular Tree Grammar29
Re se a r ch Pr ogr e ss
v
We int roduce a grammar for creat ing inst ance XM L document w hich is called Inst ance Grammar (IG). IG is a cont ext sensit ive grammar w hich ist ransformat ion of regular t ree grammar sent ences int o inst ances of XM L document .
X M L D a t a M ode l –
Re gu la r Tr e e Gr a m m a rv
A st ruct ure of an XM L document can be formally defined by an RTG (Regular Tree Grammar).31
X M L D a t a M ode l – I n st a n ce Gr a m m a r