Wrox Professional XML 2nd Edition May 2001 ISBN 1861005059 pdf

(1)

(2)

Professional XML 2nd Edition

Mark Birbeck

Jason Diamond

Jon Duckett

Oli Gauti Gudmundsson

Pete Kobak

Evan Lenz

Steven Livingstone

Daniel Marcus

Stephen Mohr

Nikola Ozu

Jon Pinnock

Keith Visco

Andrew Watt

Kevin Williams

Zoran Zaev

Wrox Press Ltd.



TE

AM

FL

Y

(3)

except in the case of brief quotations embodied in critical articles or reviews.

The author and publisher have made every effort in the preparation of this book to ensure the accuracy of the information. However, the information contained in this book is sold without

warranty, either express or implied. Neither the authors, Wrox Press, nor its dealers or distributors will be held liable for any damages caused or alleged to be caused either directly or

indirectly by this book.

Published by Wrox Press Ltd,

Arden House, 1102 Warwick Road, Acocks Green, Birmingham, B27 6BH, UK

(4)

(5)

Mark Birbeck

Mark Birbeck is Technical Director of Parliamentary Communications Ltd. where he has been responsible for the design and build of their political portal, ePolitix.com. He is also managing director of XML consultancy x-port.net Ltd., responsible for the publishing system behind spiked-online.com. Although involved in XML for a number of years, his special interests lay in metadata, and in particular the use of RDF. He particularly welcomes Wrox's initiative in trying to move these topics from out of the shadows and into the mainstream.

Mark would particularly like to thank his long-suffering partner Jan for putting up with the constant smell of midnight oil being burned. He offers the consolation that at least he will already be up when their first child Louis demands attention during the small hours.

Jon Duckett

Jon has been using and writing about XML since 1998, when he co-authored and edited Wrox's first XML publication. Having spent the past 3 years working for Wrox in the Birmingham UK offices, Jon is currently working from Sydney, so that he can get a different view out of the window while he is working and supping on a nice cup of tea...

Oli Gauti Gundmundsson

Oli is working for SALT, acting as one of two Chief System Architects of the SALT Systems, and Development Director in New York. He is currently working on incorporating XML and XSL into SALT’s web authoring and content management systems. He has acted as an instructor in the Computer Science course (Java) at the University of Iceland, and Java is one of his greatest

strengths (and pleasures!). As a hobby he is trying to finish his BS degree in Computer Engineering.

His nationality is Icelandic, but he is currently situated in New York with his girlfriend Edda. He can be reached at [email protected].

Pete Kobak

Pete Kobak built and programmed his first computer from a kit in 1978, which featured 256 bytes of RAM and a single LED output. After a fling as an electrical engineer for IBM, Pete gradually moved into software development to support mainframe manufacturing. He earned geek programmer status in the late '80s when he helped to improve Burroughs' Fortran compiler by introducing vectorization of DO loops. Justified by his desire to continue to pay his mortgage, Pete left Burroughs in 1991 to put lives in jeopardy by developing medical laboratory software in OS/2. In 1997, Pete somehow convinced The Vanguard Group to hire him to do Solaris web development, even though he could barely spell “Unix”. He has helped to add new features to their web site since then, specializing in secure web communication.

Pete's current interest is in web application security, trying to find the right techniques to enforce the strong security needed by a serious financial institution while meeting their need to rapidly extend business relationships. Pete is thankful to be able to introduce interesting web

(6)

Evan Lenz

Evan Lenz currently works as a software engineer for XYZFind Corp. in Seattle, WA. His primary area of expertise is in XSLT, and he enjoys exploring new ways to utilize this technology for various projects. His work at XYZFind includes everything from XSLT and Java

development to writing user's manuals, to designing the XML query language used in XYZFind's XML database software. Wielding a professional music degree and a philosophy major, he hopes to someday bring his varying interests together into one grand, masterful scheme.

Thanks to my precious wife, Lisa, and my baby son, Samuel, for putting up with Daddy's long nights. And praise to my Lord and Savior, Jesus Christ, without whom none of this would be possible or meaningful.

Steven Livingstone

Steven Livingstone is an IT Architect with IBM Global Services in Winnipeg, Canada. He has contributed to numerous Wrox books and magazine articles, on subjects ranging from XML to E-Commerce. Steven’s current interests include E-Commerce, ebXML, .NET, and Enterprise Application Architectures.

Steven would like to thank everyone at Wrox, especially for the understanding as he emigrated from Scotland to Canada (and that could be another book itself ;-) Most importantly he wants to thank Loretito for putting up with him whilst writing – gracias mi tesoro.

Congratulations Celtic on winning the Treble :)

Daniel Marcus

Dr. Marcus has twenty years of experience in software architecture and design. He is co-founder, President, and Chief Operating Officer at Speechwise Technologies, an applications software company at the intersection of speech, wireless, and Internet technologies. Prior to starting Speechwise, he was Director of E-Business Consulting at Xpedior, leading the strategy,

architecture, and deployment of e-business applications for Global 2000 and dot-com clients. Dr. Marcus has been a Visiting Scholar at Princeton's Institute for Advanced Study, a research scientist at the Lawrence Livermore National Laboratory, and is the author of over twenty papers in computational science. He is a Sun-Certified Java Technology Architect and holds a Ph.D. in Mechanical Engineering from the University of California, Berkeley.

Stephen Mohr

Stephen Mohr is a software systems architect with Omicron Consulting, Philadelphia, USA. He has more than ten years' experience working with a variety of platforms and component technologies. His research interests include distributed computing and artificial intelligence. Stephen holds BS and MS degrees in computer science from Rensselaer Polytechnic Institute.

(7)

of dirt road – out where the virtual community is closer than town, but only flows at 24kb/s, and still does not deliver pizza.

His current project involves bringing semantic databases, text searching, and multimedia components together with XML – on the road to Xanadu. Other recent work has included the usual web design consulting, some XML vocabularies, and an XML-based production and full-text indexing system for a publisher of medical reference books and databases.

In the early 90s, Nik designed and developed a hypertext database called Health Reference Center; followed by advanced versions of InfoTrac. Both of these were bibliographic and full-text databases, delivered as monthly multi-disc CD-ROM subscriptions. Given the large text databases involved, some involvement with SGML was unavoidable. His previous work has ranged from library systems on mainframes to embedded micro systems (telecom equipment, industrial robots, toys, arcade games, and videogame cartridges). In the early 70s, he was thrilled to learn programming using patch boards, punch cards, paper tape, and printouts (and Teletypes, too).

When not surfing the 'net, he surfs crowds, the Tetons, and the Pacific; climbs wherever there is rock; and tries to get more than a day's walk from the nearest road now and then. He enjoys these even more when accompanied by his teenage son, who's old enough now to appreciate the joy of mosh pits and sk8ing the Mission District after midnight.

To Noah: May we always think of the next (23 - 1) generations instead of just our own 20.

My thanks to the editors and illustrators at Wrox and my friend Deanna Bauder for their help with this project. Also, thanks and apologies to my family and friends who endured my disappearances into the WriterZone for days on end.

Jonathan Pinnock

Jonathan Pinnock started programming in Pal III assembler on his school's PDP 8/e, with a massive 4K of memory, back in the days before Moore's Law reached the statute books. These days he spends most of his time developing and extending the increasingly successful

PlatformOne product set that his company, JPA, markets to the financial services community. He seems to spend the rest of his time writing for Wrox, although he occasionally surfaces to say hello to his long-suffering wife and two children. JPA’s home page is www.jpassoc.co.uk.

Keith Visco

Keith Visco currently works for Intalio, Inc., the leader in Business Process Management, as a manager and project leader for XML based technologies. Keith is the project leader for the open source data-binding framework, Castor. He has been actively working on open source projects since 1998, including the Mozilla project where he is the original author of Mozilla's XSLT processor (donated by his previous employer, The MITRE Corporation) and is the current XSLT module owner.

(8)

invaluable feedback on this chapter. I would also like to thank my family for their

unconditional support and incessant input into all phases of my life. A special thanks to Cindy Iturbe, whose encouragement means so much to me and for teaching me that with a little patience and hard work all things are possible, no matter how distant things may seem.

Andrew Watt

Andrew Watt is an independent consultant who enjoys few things more than exploring the technologies others have yet to sample. Since he wrote his first programs in 6502 Assembler and BBC Basic in the mid 1980s, he has sampled Pascal, Prolog, and C++, among others. More recently he has focused on the power of web-relevant technologies, including Lotus Domino, Java and HTML. His current interest is in the various applications of the Extensible Markup Meta Language, XMML, sometimes imprecisely and misleadingly called XML. The present glimpse he has of the future of SVG, XSL-FO, XSLT, CSS, XLink, XPointer, etc when they actually work properly together is an exciting, if daunting, prospect. He has just begun to dabble with XQuery. Such serial dabbling, so he is told, is called “life-long learning”.

In his spare time he sometimes ponders the impact of Web technologies on real people. What will be the impact of a Semantic Web? How will those other than the knowledge-privileged fare?

To the God of Heaven who gives human beings the capacity to see, think and feel. To my father who taught me much about life.

My heartfelt thanks go to Gail, who first suggested getting into writing, and now suffers the consequences on a fairly regular basis, and to Mark and Rachel, who just suffer the consequences.

Kevin Williams

Kevin’s first experience with computers was at the age of 10 (in 1980) when he took a BASIC class at a local community college on their PDP-9, and by the time he was 12, he stayed up for four days straight hand-assembling 6502 code on his Atari 400. His professional career has been focused on Windows development – first client-server, then onto Internet work. He’s done a little bit of everything, from VB to Powerbuilder to Delphi to C/C++ to MASM to ISAPI, CGI, ASP, HTML, XML, and any other acronym you might care to name; but these days, he’s focusing on XML work. Kevin is a Senior System Architect for Equient, an information management company located in Northern Virginia. He may be reached for comment at

[email protected].

Zoran Zaev

Zoran is a Sr. Web Solutions Architect with Hitachi Innovative Solutions, Corp. in the

(9)

to thank Ellen Manetti for her strong project management example, and Pete Johnson, founder of Virtualogic, Inc., for his vision inspiring influence. Finally, thanks to Beckie and Marsha from Wrox for their always-timely assistance and to Jan from "Images by Jan".

(10)

(11)

(12)

Introduction

eXtensible Markup Language (XML) has emerged as nothing less than a phenomenon in computing. It is a concept elegant in its simplicity driving dramatic changes in the way Internet applications are written. This book is a revision to the first edition to keep pace with this fast-changing technology as many technologies have been superseded, and new ones have emerged.

What Does This Book Cover?

This book explains and demonstrates both the essential techniques for designing and using XML documents, and many of the related technologies that are important today. Almost everything in this book will be based around a specification provided by the World Wide Web Consortium (W3C). These specifications are at various levels of completion and some of the technologies are nascent, but we expect them to become very popular when their specifications are finalized because they are useful or essential. The wider XML community is increasingly jumping in and offering new XML-related ideas outside the control of the W3C, although the W3C is still central and important to the development of XML.

The focus of this book is on learning how to use XML as an enabling technology in real-world

applications. It presents good design techniques, and shows how to interface XML-enabled applications with web applications. Whether your requirements are oriented toward data exchange or presentation, this book will cover all the relevant techniques in the XML community.

Most chapters contain a practical example (unless the technology is so new that there were no working implementations at the time of writing). As XML is a platform-neutral technology, the examples cover a variety of languages, parsers, and servers. All the techniques are relevant across all the platforms, so you can get valuable insight from the examples even if they are not implemented using your favorite platform.

TE

AM

FL

Y

(13)

2 Who Is This Book For?

This book is for the experienced developer, who already has some basic knowledge of XML, to learn how to build effective applications using this exciting but simple technology. Web site developers can learn techniques, using XSLT stylesheets and other technologies, to take their sites to the next level of sophistication. Other developers can learn where and how XML fits into their existing systems and how they can use it to solve their application integration problems.

XML applications can be distributed and are usually web-oriented. This book focuses on this kind of application and so we would expect the reader to have some awareness of multi-tier architecture - preferably from a web perspective. Although we will retread over XML, in case some of the XML fundamentals have been missed in your experience, we will cover the full specification thoroughly and fairly quickly.

A variety of programming languages will be used in this book, and we do not expect you to be proficient in them all. The techniques taught in this book can be transferred to other programming languages. As XML is a cross-platform language, Java will be a language used in this book, especially because it has a wealth of tools to manipulate XML. Other languages covered include JavaScript, VBScript, VB, C#, and Perl. We expect the reader to be proficient in a programming language, but it does not matter which one.

How is this Book Structured?

Although many authors have contributed towards this book, we have tied the chapters together under unifying themes. As you will read below, the book has effectively been split into six sections. A standard example using a toy company has been used in chapters where possible, so you can see how different technologies can explain, describe, or transform the same data in different ways.

A small number of the chapters, e.g. Chapter 23, rely heavily on a previous chapter, but this will be made clear. Most of the chapters will be relatively self-contained.

Learning Threads

XML is evolving into a large, wide-ranging field of related markup technologies. This growth is powering XML applications. With growth comes divergence. Different readers will come to this book with different expectations. XML is different things to different people.

Foundation

Chapter 1 introduces the XML world in general, discussing the technologies that are relevant today and may be relevant tomorrow, but with very little code. Chapters 2 (Basic XML Syntax) and 3 (Advanced XML Syntax) cover the fundamentals of XML 1.0. Chapter 2 gives you the basic syntax of an XML document, while Chapter 3 covers slightly more advanced issues like namespaces. These chapters form the irreducible minimum you need to understand XML and, depending on your experience, you may want to skip these introductory chapters. Chapter 4 teaches you about the Infoset, a standard way of describing XML, which provides an abstract representation for XML data.

(14)

3

In Chapter 8, we explain the XPath specification – a method of referring to specific fragments of XML that is relevant to and used by other XML technologies. These include XSLT, described in Chapter 9. Here we teach you how to transform your XML documents into anything else, based on certain stylesheet declarations. In Chapter 10, we show various linking technologies, such as XLink and XPointer and describe the XML Fragment Interchange specification.

These ten chapters are enough for you to learn about all of the immediately useful XML technologies – for those who just use XML. You may already have a lot of experience of XML and so some of these chapters will be re-treading over well-walked ground, but everybody should be able to learn something new, especially because XML Schema acquired Proposed Recommendation status, the penultimate stage of the W3C specifications, just two months before this book was printed. Although a wealth of XML techniques lie ahead, you will have a firm foundation upon which to build.

So the Foundation thread includes:

❑ Chapter 1: Introducing XML

❑ Chapter 2: Basic XML Syntax ❑ Chapter 3: Advanced XML Syntax

❑ Chapter 4: The XML Information Set ❑ Chapter 5: Validating XML: Schemas

❑ Chapter 6: Introducing XML Schema ❑ Chapter 7: XML Schema Alternatives ❑ Chapter 8: Navigating XML – XPath ❑ Chapter 9: Transforming XML

❑ Chapter 10: Fragments, XLink, and XPointer

XML Programming

XML is both machine and human readable and, not surprisingly, some standard APIs have been created to manipulate XML data. These APIs are implemented in JavaScript, Java, Visual Basic, C++, Perl, and many other languages. These provide a standard way of manipulating, and developing for, XML documents.

In Chapter 11, we consider the first API, which emerged from the HTML world, the DOM. This has been released as a specification from the W3C, and Level 2 of this specification has recently been released. XML data can be thought of as hierarchical and object-oriented, and the DOM provides methods and properties for retrieving and manipulating XML nodes. Chapter 12 discusses the SAX, a lightweight alternative to the DOM. When manipulating the DOM, the entire document has to be read into memory; with the SAX, however, it only retrieves as much data as is necessary to retrieve or manipulate a specific node.

(15)

4

The Programming thread therefore includes:

❑ Chapter 11: The Document Object Model ❑ Chapter 12: SAX 2

❑ Chapter 13: Schema Based Programming

XML as Data

There are four chapters in this section, all targeted specifically at the storage, retrieval, and

manipulation of data – as it relates to XML. Chapter 14, Data Modeling, explains how to plan your project 'properly', and so model your XML on your data and build better applications because of it. Chapter 15 extends this concept by covering the binding of the data to XML (and vice versa). Querying XML covers a nascent technology known as XML Query. It aims to provide the power of SQL in an XML format. This short chapter teaches you how to use the technology as it stands at the time of writing.

The final chapter covered, is a case study, which describes how to relate your databases to your XML data and so integrate your XML and RDBMS in the best way possible.

This means that the Data thread contains:

❑ Chapter 14: Data Modeling

❑ Chapter 15: Data Binding ❑ Chapter 16: Querying XML

❑ Chapter 17: Case Study: XML and Databases

Presentation of XML

Chapter 18 covers an XML technology called SVG – Scalable Vector Graphics. This XML technology, when coupled with an appropriate viewer (for example, Adobe SVG Viewer), allows quite detailed graphics files to be displayed and manipulated. In Chapter 19, we describe VoiceXML, an XML technology to allow voice recognition and processing on the Web. XML data can be converted to VoiceXML and using the appropriate technology, can be spoken and interacted with over a telephone.

Chapter 20 covers the final technology in this section, XSL-FO. This is an emerging technology that allows the layout of pages to be specified exactly, much in the same way as PDF does now. The main difference is, this is XML too and so can be manipulated using the same XML tools you may be used to. Also, XSL-FO can be converted to PDF if necessary for users without XSL-FO viewers.

In the Presentation thread, therefore, we cover:

❑ Chapter 18: Presenting XML Graphically ❑ Chapter 19: VoiceXML

(16)

5 XML as Metadata

In this thread, we discuss how XML can be used to represent metadata – that is, the meaning or semantics of data, rather than the data itself. In Chapter 21, we cover the setting up of an index of XML data. This chapter uses a Java indexing application, but the techniques are applicable to any indexing tool. Chapter 22 is where we really get to the meat of the topic, where we talk about RDF – a language to describe metadata. We cover the elements and syntax of this technology. In Chapter 23, we go over some practical examples of RDF technology, before describing RDDL – a method of bundling resources at the URL of a namespace, so that a RDDL-enabled application can learn what the technology of which the namespace is referring to, actually is and access schema and standard transforms.

In the Metadata thread, we cover:

❑ Chapter 21: Case Study: Generating a Site Index

❑ Chapter 22: RDF

❑ Chapter 23: RDF Code Samples and RDDL

XML used for B2B

The final section of this book describes what is quite possibly the most important use of XML – B2B and Web Services. In the past, the communication protocols for B2B (e.g. EDI) have been proprietary, and expensive – both in terms of cost, and processor power. Using XML vocabularies, an open and programmable model can be used for B2B transactions.

In Chapter 24, we describe Simple Object Access Protocol. SOAP was a mostly Microsoft initiative (although the W3C are developing the XML Protocol specification, which should be very similar to SOAP), which allows two applications to specify services using XML. We cover the intricacies of this protocol, so that you can use it to web-enable any service you would care to mention.

Chapter 25 covers Microsoft's BizTalk Server. This server can control all B2B transactions, using the open BizTalk framework. BizTalk is just one method of using SOAP to conduct business transactions, but it is Microsoft's and is very popular. In Chapter 26, we have a case study discussing E-Business integration using XML. There are a number of business standards for commerce, and this chapter explains how you can integrate all of the standards, without having to write code for every possible B2B transaction between competing standards.

We end in Chapter 27, with a discussion of the Web Services Description Language, which allows us to formalize other XML vocabularies by defining services that a SOAP, or other client, can connect to. WSDL describes each service and what it does. In addition, in this chapter, we cover UDDI (Universal Description, Discovery, and Integration), which is a way of automating the discovery and transactions with various services. In many cases, it should not be necessary for human interaction to find a service, and using public registration services, UDDI makes this possible. Both of these technologies are nascent but their importance will grow as more and more businesses make use of them.

In summary, in the B2B thread, we describe in each chapter the following:

❑ Chapter 24: SOAP

❑ Chapter 25: B2B with Microsoft BizTalk Server

❑ Chapter 26: E-Business Integration

(17)

6 What You Need to Use this Book

The book assumes that you have some knowledge of HTML, some procedural object-oriented programming languages (e.g. Java, VB, C++), and some minimal XML knowledge. For some of the examples in this book, a Java Runtime Environment (http://java.sun.com/j2se/1.3/) will need to be installed on your system, and some other chapters, require applications such as MS SQL Server, MS Index Server, and BizTalk.

The complete source for larger portions of code from the book is available for download from: http://www.wrox.com/. More details are given in the section of this Introduction called, "Support, Errata, and P2P".

Conventions

To help you get the most from the text and keep track of what's happening, we've used a number of conventions throughout the book.

For instance:

These boxes hold important, not-to-be forgotten information, which is directly relevant to the surrounding text.

While this style is used for asides to the current discussion.

As for styles in the text:

When we introduce them, we highlight important words

We show keyboard strokes like this: Ctrl-A

We show filenames, and code within the text like so: doGet()

Text on user interfaces is shown as: File | Save

URLs are shown in a similar font, as so: http://www.w3c.org/

We present code in two different ways. Code that is important, and testable is shown as so:

In our code examples, the code foreground style shows new, important, pertinent code

Code that is an aside, shows examples of what not to do, or has been seen before is shown as so:

Code background shows code that's less important in the present context, or has been seen before.

(18)

7

> java com.ibm.wsdl.Main -in Arithmetic.WSDL

>> Transforming WSDL to NASSL .. >> Generating Schema to Java bindings .. >> Generating serializers / deserializers .. Interface 'wsdlns:ArithmeticSoapPort' not found.

Support, Errata, and P2P

The printing and selling of this book was just the start of our contact with you. If there are any problems, whatsoever with the code or the explanation in this book, we welcome input from you. A mail to [email protected], should elicit a response within two to three days (depending on how busy the support team are).

In addition to this, we also publish any errata online, so that if you have a problem, you can check on the Wrox web site first to see if we have updated the text at all. First, pay a visit to www.wrox.com, then, click on the Books | By Title(Z-A), or Books | By ISBN link on the left hand side of the page. See below:

(19)

8

All of the code for this book can be downloaded from our site. It is included in a zip file, and all of the code samples in this book can be found within, referenced by chapter number.

In addition, at p2p.wrox.com, we have our free "Programmer to Programmer" discussion lists. There are a few relevant to this book, and any questions you post will be answered by either someone at Wrox, or someone else in the developer community. Navigate to http://p2p.wrox.com/xml, and subscribe to a discussion list from there. All lists are moderated and so no fluff or spam should be received in your Inbox.

Tell Us What You Think

We've worked hard to make this book as useful to you as possible, so we'd like to know what you think. We're always keen to know what it is you want and need to know.

We appreciate feedback on our efforts and take both criticism and praise on board in our future editorial efforts. If you've anything to say, let us know on:

[email protected]

Or via the feedback links on:

(20)

(21)

(22)

Introducing XML

In this chapter, we'll look at the origins of XML, the core technologies and specifications that are related to XML, and an overview of some current, and future applications of XML. The later sections of this introduction should also serve as something of a road map to the rest of the book.

Origins and Goals of XML

"XML", as we all know, is an acronym for Extensible Markup Language – but what is a markup language? What is the history of markup languages, what are the goals of XML, and how does it improve upon earlier markup?

Markup Languages

Ever since the invention of the printing press, writers have made notes on manuscripts to instruct the printers on matters such as typesetting and other production issues. These notes were called "markup". A collection of such notes that conform to a defined syntax and grammar can certainly be called a "language". Proofreaders use a hand-written symbolic markup language to communicate corrections to editors and printers. Even the modern use of punctuation is actually a form of markup that remains with the text to advise the reader how to interpret that text.

These early markup languages use a distinct appearance to differentiate markup from the text to which it refers. For example, proofreaders' marks consist of a combination of cursive handwriting and special symbols to distinguish markup from the typeset text. Punctuation consists of special symbols that cannot be confused with the alphabet and numbers that represent the textual content. These symbols are so necessary to understanding printed English that they were included in the ASCII character set, and so have become the foundation of modern programming language syntax.

TE

AM

FL

Y

(23)

2

The ASCII character set standard was the early basis for widespread data exchange between various hardware and software systems. Whatever the internal representation of characters; conversion to ASCII allowed these disparate systems to communicate with each other. In addition to text, ASCII also defined a set of symbols, the C0 control characters (using the hexadecimal values 00 to 1F), which were intended to be used to markup the structure of data transmissions.

Only a few of these symbols found widespread acceptance, and their use was often inconsistent. The most common example is the character(s) used to delimit the end of a line of text in a document. Teletype machines used the physical motion-based character pair CR-LF (carriage-return, line-feed). This was later used by both MS-DOS and MS-Windows; UNX uses a single LF character; and the MacOS uses a single CR character. Due to conflicting and non-standard uses of C0 control characters, document interchange between different systems still often requires a translation step, since even a simple text file cannot be shared without conversion.

Various forms of delimiters have been used to define the boundaries of containers for content, special symbol glyphs, presentation style of the text, or other special features of a document. For example, the C and C++ programming languages use the braces {} to delimit units of data or code. A typesetting language, intended for manual human editing, might use strings that are more readable, like ".begin" and ".end".

Markup is a method of conveying metadata (information about another dataset).

XML is a relatively new markup language, but it is a subset of, and is based upon a mature markup language called Standard Generalized Markup Language (SGML). The WWW's Hypertext Markup Language (HTML) is also based upon SGML; indeed, it is an application of SGML. There is a new version of HTML 4 that is called Extensible Hypertext Markup Language (XHTML), which is similarly an application of XML. All of these markup languages are for metadata, but SGML and XML may be further considered meta-languages, since they can be used to create other metadata languages. Just as HTML was expressed in SGML, XHTML and others will use XML.

SGML-based markup languages all use literal strings of characters, called tags to delimit the major components of the metadata, called elements.

Tags represent object delimiters and other such markup, as opposed to its content (no matter whether it's simple text or text that is program code). Of course, there has often been conflict between different sets of tags and their interpretation. Without common delimiter vocabularies, or even common internal data formats, it has been very difficult to convert data from one format to another, or otherwise share data between applications and organizations.

For example, the following two markup excerpts (Chapter_01_01.html & Chapter_01_01.xml) shows familiar HTML and similar XML elements with their delimiting tags:

<TITLE>Product Catalog (Toysco-only)</TITLE> </HEAD>

<BODY>

(24)

3

<H2>Product Descriptions</H2> <HR WIDTH=33% ALIGN=LEFT> <H3>Mega Wonder Widget</H3>

<P>The <EM>Mega Wonder Widget</EM> is a popular toy with a 20 oz. capacity. It costs only $12.95 to make, whilst selling for $33.99 (plus $3.95 S&H).<BR> <H3>Giga Wonder Widget</H3>

<P>The <EM>Giga Wonder Widget</EM>is even more popular, because of its larger 55 oz. capacity. It has a similar profit margin (costs $19.95, sells for $49.99).

... <HR>

<P><I>Updated:</I> 2001-04-01 <I>by Webmaster Will</I> </BODY>

</HTML>

This rather simplistic document uses the few structural tags that exist in HTML, such as <TITLE>, <H1>, <H2>, and <H3> for headers, and <P> for paragraphs. This structure is limited to a very basic presentation

model of a document as a printed page. Other tags, such as <HR> and <EM>, are purely about the appearance of the data. Indeed, most HTML tags are now used to describe the presentation of data, interactive logic for user input and control, and external multimedia objects. These tags give us no idea what structured data (names, prices, etc.) might appear within the text, or where it might be in that text.

On the other hand, XML allows us to create a structural model of the data within the text. Presentation cues can be embedded as with HTML tags, but the best XML practice is to separate the data structure from presentation. An external style sheet can be used with XML to describe the data's presentation model. So, we might convert – and extend – the above HTML example into the following XML data file (Chapter_01_01.xml):

<?xml version="1.0" ?> <!DOCTYPE ProductCatalog [

<!ELEMENT ProductCatalog (HEAD?, BODY?) >

<!ELEMENT HEAD (TITLE, Updated, Author+, Security*) > <!ELEMENT BODY (H1, H2, (H3, Products)+ ) >

<!ELEMENT Products (Product+) >

(25)

4

]>

<TITLE>Product Catalog</TITLE> <Updated>2001-04-01</Updated> <Author>Webmaster Will</Author>

<Security>Toysco-only (TRADE SECRET)</Security> </HEAD>

<BODY>

<H1>Product Catalog</H1> <H2>Product Descriptions</H2> <Products>

The <Prodname>&MWW;</Prodname> is a popular toy with a <Capacity unit="oz.">20</Capacity> capacity. It costs only <Cost currency="USD">12.95</Cost> to make, whilst selling for <Price currency="USD">33.99</Price> (plus

The <Prodname>&GWW;</Prodname> is a popular, because of its larger <Capacity unit="oz.">55</Capacity> capacity. It has a similar profit margin (costs <Cost currency="USD">19.95</Cost>, sells for <Price currency="USD">33.99</Price>).<BR/>

</Product> ...

</Products> </BODY> </ProductCatalog>

The XML document looks very similar to the HTML version, with comparable text content, and some equivalent tags (as XHTML). XML goes far beyond HTML by allowing the use of custom tags (like <Prodname> or <Weight>) that preserve some structured data that is embedded within the text of the description. We can't do this in HTML, since its set of tags is more or less fixed, changing slowly as browser vendors embrace new features and markup. In contrast, anyone can add tags to their own XML data. The use of tags to describe data structure allows easy conversion of XML to an arbitrary DBMS format, or alternative presentations of the XML data such as in tabular form or via a voice synthesizer connected to a telephone.

We have also assumed that we will use a stylesheet to format the XML data for presentation. Therefore, we are able to omit certain labels from our text (such as the $ sign in prices, and the "oz." after the capacity value). We will then rely upon the formatting process to insert them in the output, as

appropriate. In a similar fashion, we have put the document update information in the header (where it can be argued that it logically belongs). When we transform the data for output, this data can be displayed as a footer with various string literals interspersed. In this way, it can appear to be identical to the HTML version.

(26)

5 SGML and Document Markup Languages

SGML is an acronym for Standard Generalized Markup Language, an older and more much complex markup language than XML. It has been codified as an international standard by the ISO (International Organization for Standardization) as ISO 8879 and WebSGML.

The ISO doesn't put very much of its standards information online, but they do maintain a website at http://www.iso.ch, and offer the paper version of ISO 8879 for sale at

http://www.iso.ch/cate/d16387.html. General SGML information and links can be found at

http://www.w3.org/MarkUp/SGML and http://xml.coverpages.org. WebSGML (ISO 8879:1986 TC2. Information technology – Document Description and Processing Languages) is described online at http://www.sgmlsource.com/8879rev/n0029.htm.

SGML has been widely used by the U.S. government and its contractors, large manufacturing

companies, and publishers of technical information. Publishers often construct paper documents, such as books, reports, and reference manuals from SGML. Often, these SGML documents are then transformed into a presentation format such as PostScript, and sent to the typesetter and printer for output to paper. Technical specifications for manufacturing can also be exchanged via SGML documents. However, SGML's complexities and the high cost of its implementation have meant that most businesses and individuals have not been able to afford to embrace this powerful technology.

SGML History

In 1969, a person walked on the Moon for the first time. In the same year, Ed Mosher, Ray Lorie, and Charles F. Goldfarb of IBM Research invented the first modern markup language, Generalized Markup Language (GML). GML was a self-referential language for marking the structure of an arbitrary set of data, and was intended to be a meta-language – a language that could be used to describe other

languages, their grammars and vocabularies. GML later became SGML. In 1986, SGML was adopted as an international data storage and exchange standard by the ISO. When Tim Berners-Lee developed HTML in the early 1990s, he made a point of maintaining HTML as an application of SGML.

With the major impact of the World Wide Web (WWW) upon commerce and communications, it could be argued that the quiet invention of GML was a more significant event in the history of technology than the high adventure of that first trip to another celestial body. GML led to SGML, the parent of both HTML and XML. The complexity of SGML and lack of content tagging in HTML led to the need for a new markup language for the WWW and beyond – XML.

Goals of XML

In 1996, the principal design organization for technologies related to the WWW, the World WideWeb Consortium (W3C) began the process of designing an extensible markup language that would combine the flexibility of SGML and the widespread acceptance of HTML. That language is XML.

The W3C home page is at http://www.w3.org, and its XML pages begin with an overview at

http://www.w3.org/XML. Most technical documents can be found at http://www.w3.org/TR...

(27)

6

The current XML 1.0 Recommendation (which we'll abbreviate as XML 1.0 REC) can be found at

http://www.w3.org/TR/REC-xml.

The W3C developed ten design goals for XML, to quote from the Recommendation:

The design goals for XML are:

1.

XML shall be straightforwardly usable over the Internet.

2.

XML shall support a wide variety of applications.

3.

XML shall be compatible with SGML.

4.

It shall be easy to write programs that process XML documents.

5.

The number of optional features in XML is to be kept to the absolute minimum, ideally zero.

6.

XML documents should be human-legible and reasonably clear.

7.

The XML design should be prepared quickly.

8.

The design of XML shall be formal and concise.

9.

XML documents shall be easy to create.

10.

Terseness in XML markup is of minimal importance.

Like all specifications intended to be standards, XML has been defined in a formal and concise manner, using a formal notation, Extended Backus-Naur Form (EBNF) that satisfies design goal 8. The other design goals have been met by several characteristics of XML 1.0 and its "normative" (pre-requisite) references to existing Internet standards. We can categorize these as:

❑ Extensibility and separation of semantics and presentation (an implicit goal) ❑ Simplicity (design goals 4, 5, 6, 7, and 10)

❑ Internationalization (1, 2, 6, and 9) ❑ Usable over the Internet (1 and 2) ❑ Interoperability with SGML (3)

We'll look at a few of these in slightly greater depth, and then show some additional resources for XML information, vocabularies, and software tools.

Extensibility

(28)

7 Separation of Semantics and Presentation

HTML is a markup language that describes data and its presentation. Despite the advent of external Cascading Style Sheets (CSS) to format HTML data, most web pages still use numerous presentation tags embedded within the data.

XML is all about the description of data, with nothing said about its presentation. HTML combines some rudimentary descriptive markup, plus a great deal of markup that describes the presentation of the data.

The XML specification not only describes the XML data format and grammar, it also specifies a two-tier client architecture for handling XML data. The first tier is the XML Processor (also known as the XML parser, which is the term we'll use in this book). The parser ensures that the presumed XML data is well-formed (has the correct structure and syntax), and may be used to check the validity of the user's data structure. The parser must comply with the XML specification, and pass the content and structure of the XML data to a second tier application (the XML Application) in a prescribed manner.

The XML parser can use a separate document, generically called a schema, to describe and validate that instance of XML data. One type of schema, called a Document Type Definition (DTD), is specified in the XML 1.0 REC. There are other forms of schemas in use and under development. We will look at several of these, including the W3C's XML Schema and XML-Data Reduced (XDR), currently used by Microsoft as their schema language – although they will be using XML Schema in the future, just not at this time of writing.

The initial layer of XML processing is the XML parser, which can optionally use a DTD or schema to describe and validate the XML data.

As we mentioned earlier, the presentation of XML data is also defined in a separate document, the style sheet, which uses the Extensible Stylesheet Language (XSL). XSL is to XML, as CSS is to HTML.

XML can be transformed for presentation or just simple data conversion using an XSL Transformations (XSLT) tool, such as MSXML, Saxon, or XT. These tools can be used on server side XML data, transforming it to a combination of HTML and CSS, for display in existing web browsers. An XSL processor can also be implemented on the client side (such as MSXML in the IE5 browser), so that XML data is sent directly to the client, without requiring server side transformation.

The application layer of XML processing, such as a browser or editor, can use an XSL style sheet to describe the presentation of the XML data.

One stylesheet can ensure a uniform presentation style for multiple XML documents. Contrariwise, a single XML document might be presented in a variety of ways, using multiple style sheets. The application layer can choose to present the XML data as synthetic speech over a telephone or radio using VoiceXML, or reformat it to fit a PDA display screen (using WML, for example).

(29)

8

XML data can be used for computer-to-computer messages, as well as for human-readable documents.

This non-document use of XML is one of the most exciting applications of XML and its supporting tools and specifications. Just as XML might be used to present web pages that are more sophisticated to users and tagged data to search engines, XML may also serve in the underlying technical infrastructure. E-commerce applications may use XML to describe business rules and XML (as SOAP or XML-RPC) for distributed processing calls and messages. Financial transactions may be encoded in signed XML packets, and so on, right on down to the configuration and administration of the very computers that implement the world-wide e-commerce system.

Internationalization (I18N)

Although the WWW is already an international phenomenon, XML was designed for much better support of non-European languages and internationalization (also known as "i18n" or "I18N". This is yet another shorthand notation, obviously from the minds of programmers, that is derived from the first and last letters, and the count of 18 letters between. XML is based upon several ISO standards, including the Universal Character Set (UCS), defined in the ISO/IEC 10646 character set standard (which is

currently congruent with the somewhat better-known Unicode standard).

The current Unicode 3.0 specification can be found at: http://www.unicode.orgISO/IEC 10646

documentation can be ordered at http://www.iso.ch

Like most aspects of XML, names have been extended beyond the old-fashioned Anglo-centric ASCII-only limitation to allow the use of most any of the world's languages.

XML text and names may use any of the world's different alphabets, scripts, and writing systems as defined in the ISO/IEC 10646 and Unicode 3.0 standards.

The value of this design goal extends far beyond merely presenting text in different human languages. The XML metadata can also be described in the local vernacular, and style. XML is the basis of a truly international Internet, accessible to people all over the world, in their native language.

XML Works with the Internet

XML is based upon a simple text format. Even though this means Unicode text, not just simple ASCII text, it may be converted to the UTF-8, or ASCII encoding for reliable transmission over the oldest of Internet connections and hardware (Teletype, anyone?). This also eliminates some considerable issues related to the interpretation of binary data formats on different computer hardware and operating systems.

XML also uses existing Internet protocols, software, and specifications wherever possible, for easier data processing and transmission. These range from basic syntax, like Uniform Resource Identifiers (URIs), to directories of code numbers, like ISO Country Codes. We will look the more important of these Internet specifications in some detail, in Chapter 3.

(30)

9

XML is a text format that is easily transmitted over the Internet and other

communications links. XML works with basic WWW protocols, including HTTP or HTTPS.

Like HTML, XML is often transmitted using the WWW's Hypertext Transfer Protocol (HTTP). This means that XML can be handled easily by existing web server software, and pass through corporate network firewalls.

XHTML 1.0 is described at http://www.w3.org/MarkUp. The current HTTP specification can be

found at http://www.w3.org/Protocols.

Although XML is not a direct replacement for HTML, future versions of HTML will be expressed in XML syntax as XHTML. XML enables enhanced web architecture by moving more of the burden of presentation from the server to the client's browser or other application. XML provides a syntax that can be used for almost any data, its descriptive metadata, and even the message protocols used to move the XML data between server and clients.

XML will enable an enhanced WWW architecture. XML can also be used as a universal data exchange and long-term storage format, with or without the Internet.

Improved searching is another benefit – instead of attempting to find a price buried within a lump of text, enclosed in HTML <P> tags, the price information can be found easily and reliably using explicitly tagged XML data. This same tagging will provide for vastly improved data exchange between a website and its users, between co-operating websites, and/or between software applications. XML will enable a much more powerful Web, and it will also empower most other computing applications.

XML is Simplified SGML

A major design goal for XML was ease-of-use, so the XML design team was able to use SGML as an already working starting point, and focus upon simplifying SGML. Due to its many optional features, SGML was so complex that is was difficult to write generic parsers, whereas XML parsers are much simpler to write. XML is also intended to be easy to read and write by developers using simple and commonly available tools.

XML is constrained by design to be interoperable with SGML.

This design constraint allowed early adopters of XML to use SGML tools. However, it also means that there are some quirky constraints on XML data, declarations, and syntax necessary to maintain SGML compatibility. This is the downside of XML being a subset of SGML. At some point in the future, there may be a break between XML and SGML, but for some years to come, SGML-based XML 1.0 syntax is likely to be the norm.

Resources

(31)

10

The current XML 1.0 REC is at http://www.w3.org/TR/REC-xml. This is the Second Edition, and so there is a very useful color-coded version showing changes from the 1998 edition at

http://www.w3.org/TR/2000/REC-xml-20001006-review.html.

The first edition is at http://www.w3.org/TR/1998/REC-xml-19980210, with Bray's annotated version available at http://www.xml.com/axml/axml.html.

There is an XML 1.0 FAQ (Frequently Asked Questions) website, maintained by Peter Flynn, et al. on behalf of the W3C's XML Special Interest Group.

The XML 1.0 FAQ can be found at http://www.ucc.ie/xml.

There are some other non-commercial resources that are very useful XML information sources, and serve as depositories for communally developed XML vocabularies, namespaces, DTDs, and schemas. There are also numerous e-mail lists devoted to various XML-related issues.

The Organization for the Advancement of Structured Information Standards (OASIS) is a non-profit, international consortium that is devoted to accelerating the adoption of product-independent

vocabularies based upon public standards, including SGML, HTML, and XML. This organization is working with the United Nations to specify a modular electronic business framework based on XML (ebXML), and with various other organizations to produce other XML vocabularies.

OASIS hosts The XML Industry Portal for news and information of XML at XML.org, and The XML Cover Pages, one of the best websites for researching all aspects of XML, including current tools and vocabularies.

OASIS has also become the host for XML-DEV, the XML developers mailing list. This list is primarily for the developers of XML parsers and other tools, so the technical level is quite high and focused upon some of the more esoteric issues of XML syntax, grammar, schemas, and specifications. Any questions concerning XML may be posted to this list, but browse the archives first for pertinent threads (and a sense of the list's scope) before posting any questions – this is not the list for simple XML questions.

OASIS is at http://www.oasis-open.org. The XML Industry Portal is at

http://www.xml.org, The XML Cover Pages are at http://www.oasis-open.org/cover/, and

the XML-DEV home page and archives are at http://www.xml.org/xml-dev/index.shtml.

The XML-L and dev-xml e-mail lists are much better choices than XML-DEV for basic questions, and for developers of XML applications. Questions about XSL should be posted to the xsl-list e-mail list, rather than posting to any of the more generic XML lists. In addition, cross-posting between these lists is strongly discouraged.

The XML-L home page is at http://listserv.heanet.ie/xml-l.html, dev-xml is available at

http://groups.yahoo.com/group/dev-xml, and xsl-list is at http://www.biglist.com/lists/xsl-list/. All of these sites provide subscription information and list archives.

(32)

11 The Various Stages of W3C Specifications

Before we delve deeper into XML and all of the specifications of its related technologies, it would be a good idea to explain what each level of the specifications actually means. More detail can be found at the W3C at: http://www.w3.org/Consortium/Process-20010208/tr.html than is given here, but we give a quick overview to help understand how near completion the various standards are.

Once the W3C wants to publish a standard, it moves through five stages before reaching its final form. They are detailed below, from the first appearance at Working Draft, until it reaches the final

Recommendation status.

Every specification enters the W3C through a Note; it is then considered by a working group who will want to move it through the various stages so it can become a Recommendation. There are various processes that have to be performed and conditions to be satisfied before it can be moved up. A specification can be returned to an earlier stage at any time before it becomes a Recommendation, so its position in the different stages is no guarantee that it is any nearer completion.

Working Draft

At this stage, there is no guarantee as to the quality of the specification; it just means that a working group is working with the specification, redrafting it in association with external parties.

Last Call Working Draft

After a number of conditions have been met, the specification is put through to Last Call Working Draft. It generally remains at this stage for three weeks only. It can last longer if the "...technical report is complex or has significant external dependencies" and the length of this review period has to be specified at the start of this stage. During this stage, the working group must address all comments on the specification from all parties, including external agencies. If the Director is satisfied that all objections have been noted and all comments addressed, it may move up to Candidate or Proposed Recommendation status. Once it moves up from this stage, the technical report or specification, will change very little, unless it is rejected further up the process and sent back to Working Draft status.

Candidate Recommendation

At this stage, the comments made during the Last Call have to have been addressed, and the Working Group have to attempt to implement all features of the technical report. The technical report can be updated during this stage for clarity, and the period lasts as long as it takes to implement all the details.

Proposed Recommendation

For the specification has reached this level, a working implementation of the technical report has to exist. All issues raised during the previous implementation period have to be addressed, possibly by amending the technical report. During this stage, the working group should address all "...informed and relevant issues..." raised by the public or other Working Groups. The specification must remain at this stage for at least four weeks before moving on. It can either move up to Recommendation status or move back down to Candidate Recommendation, or Working Draft, status.

Recommendation

This is the final stage of the process. The Director must be satisfied that there is significant support for the technical report before progressing to this stage. The W3C should make every effort to maintain the Recommendation, updating any errata and assisting in the creation of test bed software.

We will now move on to summarize the rest of XML, starting with the XML core.

TE

AM

FL

Y

(33)

12 The XML Core

The core of XML and its key components and extensions are:

❑ XML 1.0 syntax, including Document Type Definitions (DTDs) ❑ Namespaces in XML

❑ XML Schema (or one of its alternatives or supplemental validation tools: XDR, SOX,

RELAX, TREX, and The Schematron)

These basic specifications describe the syntax of XML 1.0 and provide a standard validation grammar (DTDs). The extensions support multiple and shared vocabularies (Namespaces), and more rigorous and powerful validation (XML Schema et al.). In conjunction with the XML parser (which is also defined in the XML 1.0 REC), these comprise the first tier of XML processing.

Technology without application is useless, so several important (second tier) applications of XML are also becoming part of XML-based systems. These related specifications provide some of the key features that are commonly required in XML applications.

❑ Describing XML data structure: The XML Information Set (XML Infoset) and XML Path Language (XPath)

❑ Navigating & Linking: XML Linking (XLink), XML Pointer Language (XPointer), XML Inclusions (XInclude), XML Fragment Interchange (XFI), and XML Query Language (XQuery).

❑ Transforming & Presenting: XSLT and XSL-FO (XSL Formatting Objects)

We will look at the core syntax first, and then we'll look at the practical applications of these technologies and some widely shared XML vocabularies.

XML 1.0 Syntax

As we've seen before, the basic syntax of XML is described in a W3C recommendation called

Extensible Markup Language (XML) 1.0 (Second Edition). This recent revision (2000-10-06) is strictly a documentation update, including some clarifications and minor code changes. There are no

fundamental changes to XML as described in the original XML 1.0 recommendation (1998-02-10).

Several facets of basic XML should be understood to fully appreciate and effectively use XML for markup.

Self-Describing Data

(34)

13

<?xml version="1.0" ?> <!DOCTYPE ProductCatalog [

]>

<H1>Product Catalog</H1> <H2>Product Descriptions</H2> <Products>

A stylesheet might be used to present some of this data in a Products table:

Name Cost

12.95 33.99 20 The Mega Wonder Widget is a popular toy with a 20 oz. capacity. It costs...

Giga Wonder Widget

19.95 49.99 55 _{The Giga Wonder Widget is...}

... ...

More significantly, a search engine or a user shopping agent (a "shopbot") can build a dynamic custom product list by extracting pertinent data from XML web pages and other XML data sources.

Use of Existing Standards

XML is built on top of existing Internet and other international standards. We've already mentioned the ISO and Unicode in conjunction with the definition of legal XML characters. Several other ISO

standards are part of XML, including code numbers for countries and languages of the world. Numerous Internet Engineering Task Force (IETF) specifications, in the form of RFCs, also serve to define certain components of XML, such as URI/URL/URN references, and country or language codes.

(35)

14

Extensions to basic XML 1.0, such as XML Schemas, also use existing technical standards. For example, XML data types can be defined in terms of IEEE floating-point numbers or ISO date-time codes. The WWW is woven of myriad standards ranging from character encoding to resource references to protocols, and XML is a new thread in this web.

Well-Formed XML

All data objects (documents) that conform to the basic XML 1.0 syntax specification are known as well-formed XML data. Such documents can be used without a DTD or schema to describe their structure, and are known as DTD-less XML documents. These documents cannot rely upon external declarations, and attribute values will receive no special processing or default values.

A well-formed XML document contains one or more elements (delimited by start- and end-tags) that nest properly within each other. There is one element, the document element, which contains any and all other elements within the document. All elements form a simple hierarchical tree, and so the only direct element-to-element relationship is that of parent-child. Sibling relationships can often be inferred using data structures internal to the XML application, but these are neither direct, nor reliable (due to the possibility of elements being inserted between the common parent and one or more of its children). Document content can include other markup and/or character data.

Well-formed XML data conforms to the XML syntax specification, and includes no references to external resources (unless a DTD is provided). It is comprised of elements that form a hierarchical tree, with a single root node (the document element).

The XML text example earlier in this chapter is also an example of well-formed XML data.

The existence of well-formed documents allows use of XML data without the burden of constructing and referencing an external description of that data. The term "well-formed" has a similar meaning in formal mathematical logic – an assertion is well-formed if it meets grammatical rules, without any concern as to whether the assertion is "true" or not.

Valid XML

Any XML data object is considered a valid XML document if it is well-formed, meets certain further validity constraints, and matches a grammar describing the document's content (the DTD). Like SGML, XML can provide such a description of document structure and its grammar in the form of a Document Type Definition (DTD). See Chapter 5 more for details about these.

Valid XML data is well-formed, and it conforms to additional validity constraints of the XML syntax specification and those defined in a DTD or schema.

Validation using a DTD ensures that element parent-child relationships are respected, that attributes have valid values, that all referenced entities have been properly defined, and that numerous other specific validity constraints are obeyed.

SGML Note: The SGML equivalent of a well-formed document is known as tag-valid. The SGML

(36)

15

Given its SGML origins, XML will continue to be used for complex documents and websites. However, much current interest in XML is as a basis for electronic commerce. This application of XML requires more complex and robust methods of organizing XML vocabularies, and a way to maintain strong data typing when moving data to and from modern object and RDBMS systems. Since the restrictions imposed upon DTDs make them inadequate for this task, the W3C is in the process of defining a more robust validation method using XML Schema. When an XML application uses these extensions, validation of the XML data using a DTD becomes unnecessary. It is likely that there will be something of divide in the use of XML: the "traditional" XML documents will be validated using DTDs, and "non-document" applications of XML like SOAP and ebXML will be based upon XML Schema validation.

Parsers

In addition to specifying the syntax of XML, the W3C described some of the behavior (but not the specific implementation) of the lower tier of XML's client architecture (the XML parser). There are two types of parsers:

❑ Non-validating – the parser merely ensures that a data object is well-formed XML

❑ Validating – the parser uses a DTD (or other type of schema) to ensure the validity of a well-formed data object's form and content

Some parsers work as both types, with configuration switches that determine whether or not the document will be validated. We will discuss parsers in Chapters 2 and 5, "Basic XML Syntax" and "Validating XML: DTDs".

The behavior of XML parsers has been defined with the intent of easing the burden upon an application's handling of XML data. For example, the character sequences used to delimit the end of text records are often OS-specific. However, an XML application needn't be concerned with this, because the XML parser will normalize all standard text line delimiters to a single line-feed

(hexadecimal 0A) character. Text macro strings, (known as general entity strings), are expanded by the parser for the use of the application.

Any XML parser that encounters a construct within the XML data that is not well-formed must report this error to the application as a "fatal" error. Fatal errors need not cause the parser to terminate – it may continue processing in an attempt to find other errors, but it may not continue to pass character data and/or XML structures to the application in a normal fashion. Similarly, a validating parser must report any validity errors encountered when processing the XML data.

This approach to error-handling is the result of XML's design goal of compactness, and the intention that XML be used for much more than just document display. This rather brutal requirement for error handling was designed to keep XML parsers simple, by avoiding the proliferation of non-well-formed XML documents. Hopefully this will avoid the problems associated with HTML, where parsers have to do their best to handle the large quantities of badly written HTML that are available on the Web.Let's continue looking at some of the other technologies that use and extend XML.

Description and Validation

(37)

16 DTDs

As we mentioned in the "Valid XML" section, and we will describe in Chapter 5, DTDs are an integral part of the XML 1.0 Recommendation, although they do not use XML syntax. Any validating parser will be able to validate XML data using a DTD. The DTD can also supply definitions for XML entities, which are a form of text macros that are expanded by the parser. We will discuss the various

declarations that can be used in DTDs to define and validate XML data.

Our earlier XML data example uses the following simplistic data model:

ProductCatalog

HEAD TITLE

Updated

Author (one or more)

Security

BODY H1

H2

Products (one or more)

H3

Product (any of)

(text)

Prodname

Capacity

Cost

Price

Shipfee

A partial set of the DTD declarations for an excerpt from this example might look something like this:

<!DOCTYPE ProductCatalog [

<!ELEMENT ProductCatalog (HEAD?, BODY?) >

(38)

17

<!ELEMENT BODY (H1, H2, (H3, Products)+ ) > <!ELEMENT Products (Product+) >

The above example describes the some of the structure of our example <ProductCatalog> elements (using <!ELEMENT> declarations). Don't worry about the syntax (or missing declarations) in the above example; we will deal with these in detail in the Chapter 5.

We can convert many existing SGML DTDs to XML DTDs. Though there are certain features of SGML that are not supported in XML, there are workarounds for most of these differences.

Schemas

We've touched on some limitations to DTDs as a validation schema. The W3C has created another form of schema called: XML Schema. Unlike DTDs, this new form uses XML 1.0 syntax and can therefore be edited and processed with generic XML tools. A fragment of an XML Schema that is almost equivalent to the above DTD excerpt would be:

<?xml version="1.0" ?>

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" > <xsd:element name="ProductCatalog" >

<xsd:complexType content="eltOnly" > <xsd:sequence>

<xsd:element name="Products" >

<xsd:complexType content="eltOnly" > <xsd:sequence>

<xsd:element ref="H3" />

<xsd:element ref="Product" maxOccurs="unbounded" /> </xsd:sequence>

</xsd:element>

<xsd:element name="Product" >

<xsd:complexType content="mixed" > <xsd:choice>

<xsd:element name="Prodname" type="text" />

<xsd:element name="Capacity" type="decimal" minOccurs="0" maxOccurs="1" />

<xsd:element name="Cost" type="decimal" />

<xsd:element name="Price" type="decimal" maxOccurs="5" />  <xsd:element name="Shipfee" type="decimal" minOccurs="0"