An XML document is commonly considered as a tree where nodes are the ele- ments or attribute names, and edges represent the child node’s membership of the parent node and where leaves are the contents of the elements or attribute values. An XML database is a forest of XML document trees. The Fig.2 illus- trates an example of a tree representation of an XML document (Fig.1).
Fig. 1.XML document.
Fig. 2.Tree representation of an XML document.
The use of XML documents requires the ability to extract information and reformulate it for applications. Thus, there are many languages to retrieve infor- mation from a semi-structured document. Here we present the most popular:
XPath and XQuery.
2.1 XPath
XPath allows to designate one or more nodes in an XML document, using path expressions. Thus, an XPath expression is a sequence of steps.
[/]step1/step2/.../stepn. An XPath step consists of an axis, a filter and a pred- icate (optional): axe :: f iltre[predicat] The axis indicates a search direction.
The most used axes are parent-child (represented byA/B) and descendant axis (A//B). The filter selects a node type. For example the expressionA/Breturns all elementsB children of an elementA. Predicates select content.
2.2 XQuery
XQuery [7] is the query language recommended by the W3C to extract informa- tion from many types of XML data sources. XQuery is the XML equivalent of
SQL language, for retrieving data contained in relational databases and inherits the properties of several other languages. From XPath it uses the path expres- sion syntax for addressing elements in XML documents. From SQL it takes up the idea of a series of clauses based on keywords that provide a model for data restructuring (the SQL SELECT-FROM-WHERE model). XQuery queries have several expression forms, the most famous is the FLWOR form. The acronym FLWOR comes from the reserved words of the language which make help to define the main clauses of this type of expression: For - Let - Where - Order By - Return.
Each clause in a FLWOR expression plays a particular role in the query and some of these clauses are optional. Thus, a FLWOR instruction consists of the following parts:
– For:iteration on an XML document part list – Let:allows the assignment of values to a variable – Order by:sorting results
– Where:restriction clause (constraints) – Return:form of the expression to be returned
An example of this type of request is given in Fig.3. This query select the email addresses and skills of people who have a Java skill level above 3 and have more than two years of experience.
Fig. 3.Example of an XQuery query
2.3 Open Source Implementations of XQuery
There are many open source implementations of XQuery. We present here a non-exhaustive list of them:
– BaseX is a very light-weight, high-performance and scalable XML Database engine and XPath/XQuery 3.0 Processor, including full support for the W3C Update and Full Text extensions all developed in Java. It comes with inter- active user interfaces (desktop, web-based) that give users great insight into their data [8].
– eXist-db is a high-performance open source native XML database - a NoSQL document database and application platform built entirely around XML tech- nologies. A Browser-based IDE allows managing and editing all artifacts belonging to an application. Syntax-coloring, code-completion and error- checking help to get it right. Being a complete solution, eXist-db tightly integrates with XForms for complex form development [12].
– Galax is an open-source implementation of XQuery, the W3C XML Query Language. It includes several advanced extensions for XML updates, script- ing, and distributed programming. Implemented in O’Caml, Galax comes with a state of the art compiler and optimizer. Most of Galax’s architecture is formally documented, making it ideal for users interested in teaching XQuery, in building new language extensions, or developing new optimizations [11].
– Oracle Berkeley DB XML is an open source, embeddable XML database with XQuery-based access to documents stored in containers and indexed based on their content. Implemented in C, Oracle Berkeley DB XML is built on top of Oracle Berkeley DB and inherits its rich features and attributes.
XML queries (XPath or XQuery) to be evaluated on XML documents (trees), need to be represented in a model (a tree representation) in order to facilitate their evaluation. Therefore, evaluating the query is equivalent to apply the cor- responding model to the XML tree trough atree pattern matching process.
2.4 Generalized Tree Pattern (GTP)
The concept of the generalized tree model (GTP) is introduced in [4] and allows to express more precisely the semantics of XQuery. The arcs of a GTP may be Parent-Child (PC), Ancestor-Descendant (AD) or optional. They are indicated by solid edges, double solid edges and dotted edges, respectively. A mandatory arc links an sub-expression corresponding to clauses FOR and WHERE with the rest of the query. An optional arc links an subexpression corresponding to clausesLET andRETURN with the rest of the query.
Definition 1. A generalized tree pattern is a couple G= (T, F) whereT is a tree andF a Boolean formula such as.
– Each node in the tree T is labeled with different variables and has a group number.
– To each arc ofT is associated a pair of labels< x, m >, wherex∈ {P C, AD} specifies the axis (parent-child and ancestor-descendant, respectively) and m∈ {mandatory, optional}specifies the arc’s status.
– F is a Boolean combination of predicates applicable to nodes.
Zhimin Chen et al. [4] also propose an algrithm for translating an XQuery expression1 in GTP. The request is put in a canonical form and is then parsed clause by clause while the GTP is progressively built up to the last clause, we invite you to read [4] for more details. The GTP is intended to be mapped (Pattern matching) to the XML tree.
Definition 2. APattern Match of a GTPG= (G, F) in a tree collection C is a subtreehpartial:h:G→C such that:
1 The GTP model deals with a very significant subset of the XQuery language and supports nesting, quantizer, grouping and aggregation.
– h contains at least group 0 of G.
– h preserves the relational structure of G. This means that whenever h is defined on two nodes u, v and there is a PC arc (respectively AD) (u, v) in G, then h(v) is a son (respectively a descendant) of h(u).
– h satisfies the Boolean formula F of G (Fig.4).
Fig. 4.Example of an XQuery expression and corresponding GTP query.