Elliotte Rusty Harold
IDG Bo o ks Wo rldwide, Inc . An Internatio nal Data Gro up Co mpany
Fo ster City, CA 94404
www.idgbooks.com(IDG Bo o ks Wo rldwide Web site) Co pyright © 1999 IDG Bo o ks Wo rldwide, Inc . All rights reserved. No part o f this bo o k, inc luding interio r design, c o ver design, and ic o ns, may be repro duc ed o r transmitted in any fo rm, by any means (elec tro nic , pho to c o pying, rec o rding, o r o therwise) witho ut the prio r written permissio n o f the publisher.
ISBN: 0-7645-3236-7
Printed in the United States o f Americ a 10 9 8 7 6 5 4 3 2 1
1O/ QV/ QY/ ZZ/ FC
Distributed in the United States by IDG Bo o ks Wo rldwide, Inc .
Distributed by CDG Bo o ks Canada Inc . fo r Canada; by Transwo rld Publishers Limited in the United Kingdo m; by IDG No rge Bo o ks fo r No rway; by IDG Sweden Bo o ks fo r Sweden; by IDG Bo o ks Australia Publishing Co rpo ratio n Pty. Ltd. fo r Australia and New Zealand; by TransQuest Publishers Pte Ltd. fo r Singapo re, Malaysia, Thailand, Indo nesia, and Ho ng Ko ng; by Go to p Info rmatio n Inc . fo r Taiwan; by ICG Muse, Inc . fo r Japan; by No rma Co munic ac io nes S.A. fo r Co lo mbia; by Interso ft fo r So uth Afric a; by Eyro lles fo r Franc e; by Internatio nal Tho mso n Publishing fo r Germany, Austria and Switzerland; by Distribuido ra Cuspide fo r Argentina; by Livraria Cultura fo r Brazil; by Edic io nes ZETA S.C.R. Ltda. fo r Peru; by WS Co mputer Publishing Co rpo ratio n, Inc ., fo r the Philippines; by Co ntempo ranea de Edic io nes fo r Venezuela; by Express Co mputer Distributo rs fo r the Caribbean and West Indies; by Mic ro nesia Media Distributo r, Inc . fo r Mic ro nesia; by Grupo Edito rial No rma S.A. fo r Guatemala; by Chips Co mputado ras S.A. de C.V. fo r Mexic o ; by Edito rial No rma de Panama S.A. fo r Panama; by Americ an Bo o ksho ps fo r Finland.
Autho rized Sales Agent: Antho ny Rudkin Asso c iates fo r the Middle East and No rth Afric a.
at 800-434-3422.
Fo r info rmatio n o n where to purc hase IDG Bo o ks Wo rldwide’s bo o ks o utside the U.S., please c o ntac t o ur Internatio nal Sales department at 317-596-5530 o r fax 317-596-5692.
Fo r c o nsumer info rmatio n o n fo reign language translatio ns, please c o ntac t o ur Custo mer Servic e department at 800-434-3422, fax 317-596-5692, o r e-mail [email protected].
Fo r info rmatio n o n lic ensing fo reign o r do mestic rights, please pho ne +1-650-655-3109.
Fo r sales inquiries and spec ial pric es fo r bulk quantities, please c o ntac t o ur Sales department at 650-655-3200 o r write to the address abo ve.
Fo r info rmatio n o n using IDG Bo o ks Wo rldwide’s bo o ks in the c lassro o m o r fo r o rdering examinatio n c o pies, please c o ntac t o ur Educ atio nal Sales department at 800-434-2086 o r fax 317-596-5499.
Fo r press review c o pies, autho r interviews, o r o ther public ity info rmatio n, please c o ntac t o ur Public Relatio ns department at 650-655-3000 o r fax 650-655-3299.
Fo r autho rizatio n to pho to c o py items fo r c o rpo rate, perso nal, o r educ atio nal use, please c o ntac t Co pyright Clearanc e Center, 222 Ro sewo o d Drive, Danvers, MA 01923, o r fax 978-750-4470.
Library o f Co ngress Catalo ging-in-Public atio n Data Haro ld, Ellio te Rusty.
XML bible / Ellio te Rusty Haro ld. p. c m.
ISBN 0-7645-3236-7 (alk. paper)
1. XML (Do c ument markup language) I. Title. QA76.76.H94H34 1999 99-31021
005.7’2--dc 21 CIP
LIMIT OF LIABILITY/ DISCLAIMER OF WARRANTY: THE PUBLISHER AND AUTHOR HAVE USED THEIR BEST EFFORTS IN PREPARING THIS BOOK. THE PUBLISHER AND AUTHOR MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS BOOK AND SPECIFICALLY DISCLAIM ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. THERE ARE NO WARRANTIES WHICH EXTEND BEYOND THE DESCRIPTIONS CONTAINED IN THIS PARAGRAPH. NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES REPRESENTATIVES OR WRITTEN SALES MATERIALS. THE ACCURACY AND COMPLETENESS OF THE INFORMATION PROVIDED HEREIN AND THE OPINIONS STATED HEREIN ARE NOT GUARANTEED OR WARRANTED TO PRODUCE ANY PARTICULAR RESULTS, AND THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY INDIVIDUAL. NEITHER THE PUBLISHER NOR AUTHOR SHALL BE LIABLE FOR ANY LOSS OF PROFIT OR ANY OTHER COMMERCIAL DAMAGES, INCLUDING BUT NOT LIMITED TO SPECIAL, INCIDENTAL, CONSEQUENTIAL, OR OTHER DAMAGES.
Trade marks:All brand names and pro duc t names used in this bo o k are trade names, servic e marks, trademarks, o r registered trademarks o f their respec tive o wners. IDG Bo o ks Wo rldwide is no t asso c iated with any pro duc t o r vendo r mentio ned in this bo o k.
Eleventh Annual
IDG is the world’s leading IT media, research and exposition company. Founded in 1964, IDG had 1997 revenues of $2.05 billion and has more than 9,000 employees worldwide. IDG offers the widest range of media options that reach IT buyers in 75 countries representing 95% of worldwide IT spending. IDG’s diverse product and services portfolio spans six key areas including print publishing, online publishing, expositions and conferences, market research, education and training, and global marketing services. More than 90 million people read one or more of IDG’s 290 magazines and newspapers, including IDG’s leading global brands — Computerworld, PC World, Network World, Macworld and the Channel World family of publications. IDG Books Worldwide is one of the fastest-growing computer book publishers in the world, with more than 700 titles in 36 languages. The “...For Dummies®” series alone has more than 50 million copies in print. IDG offers online users the largest network of technology-specific Web sites around the world through IDG.net (http://www.idg.net), which comprises more than 225 targeted Web sites in 55 countries worldwide. International Data Corporation (IDC) is the world’s largest provider of information technology data, analysis and consulting, with research centers in over 41 countries and more than 400 research analysts worldwide. IDG World Expo is a leading producer of more than 168 globally branded conferences and expositions in 35 countries including E3 (Electronic Entertainment Expo), Macworld Expo, ComNet, Windows World Expo, ICE (Internet Commerce Expo), Agenda, DEMO, and Spotlight. IDG’s training subsidiary, ExecuTrain, is the world’s largest computer training company, with more than 230 locations worldwide and 785 training courses. IDG Marketing Services helps industry-leading IT companies build international brand recognition by developing global integrated marketing programs via IDG’s print, online and exposition products worldwide. Further information about the company can be found
at www.idg.com. 1/24/99
people read one or more IDG publications each month.
Launched in 1990, IDG Books Worldwide is today the #1 publisher of best-selling computer books in the United States. We are proud to have received eight awards from the Computer Press Association in recognition of editorial excellence and three from Computer Currents’ First Annual Readers’ Choice Awards. Our best-selling ...For Dummies® series has more than 50 million copies in print with translations in 31 languages. IDG Books Worldwide, through a joint venture with IDG’s Hi-Tech Beijing, became the first U.S. publisher to publish a computer book in the People’s Republic of China. In record time, IDG Books Worldwide has become the first choice for millions of readers around the world who want to learn how to better manage their businesses.
Our mission is simple: Every one of our books is designed to bring extra value and skill-building instructions to the reader. Our books are written by experts who understand and care about our readers. The knowledge base of our editorial staff comes from years of experience in publishing, education, and journalism — experience we use to produce books to carry us into the new millennium. In short, we care about books, so we attract the best people. We devote special attention to details such as audience, interior design, use of icons, and illustrations. And because we use an efficient process of authoring, editing, and desktop publishing our books electronically, we can spend more time ensuring superior content and less time on the technicalities of making books.
You can count on our commitment to deliver high-quality books at competitive prices on topics you want to read about. At IDG Books Worldwide, we continue in the IDG tradition of delivering quality for more than 30 years. You’ll find no better book on a subject than one from IDG Books Worldwide.
De ve lopme nt Editor
Terri Varveris
Contributing Write r
Heather Williamso n
Te chnical Editor
Greg Guntle
Amanda Kaufman Nic o le LeClerc Vic to ria Lee
Production
IDG Bo o ks Wo rldwide Pro duc tio n
Proofre ading and Inde xing
Yo rk Pro duc tio n Servic es
About the Author
Ellio tte Rusty Haro ld is an internatio nally respec ted writer, pro grammer, and educ ato r bo th o n the Internet and o ff. He go t his start by writing FAQ lists fo r the Mac into sh newsgro ups o n Usenet, and has sinc e branc hed o ut into bo o ks, Web sites, and newsletters. He lec tures abo ut Java and o bjec t-o riented pro gramming at Po lytec hnic University in Bro o klyn. His Cafe c o n Lec he Web site at http:// metalab.unc.edu/xml/has bec o me o ne o f the mo st po pular independent XML sites o n the Internet.
Welc o me to the XML Bible. After reading this bo o k I ho pe yo u’ll agree with me that XML is the mo st exc iting develo pment o n the Internet sinc e Java, and that it makes Web site develo pment easier, mo re pro duc tive, and mo re fun.
This bo o k is yo ur intro duc tio n to the exc iting and fast gro wing wo rld o f XML. In this bo o k, yo u’ll learn ho w to write do c uments in XML and ho w to use style sheets to c o nvert tho se do c uments into HTML so legac y bro wsers c an read them. Yo u’ll also learn ho w to use do c ument type definitio ns (DTDs) to desc ribe and validate do c uments. This will bec o me inc reasingly impo rtant as mo re and mo re bro wsers like Mo zilla and Internet Explo rer 5.0 pro vide native suppo rt fo r XML.
About You the Reader
Unlike mo st o ther XML bo o ks o n the market, the XML Biblec o vers XML no t fro m the perspec tive o f a so ftware develo per, but rather that o f a Web-page autho r. I do n’t spend a lo t o f time disc ussing BNF grammars o r parsing element trees. Instead, I sho w yo u ho w yo u c an use XML and existing to o ls to day to mo re effic iently pro duc e attrac tive, exc iting, easy-to -use, easy-to -maintain Web sites that keep yo ur readers c o ming bac k fo r mo re.
This bo o k is aimed direc tly at Web-site develo pers. I assume yo u want to use XML to pro duc e Web sites that are diffic ult to impo ssible to c reate with raw HTML. Yo u’ll be amazed to disc o ver that in c o njunc tio n with style sheets and a few free to o ls, XML enables yo u to do things that previo usly required either c usto m so ftware c o sting hundreds to tho usands o f do llars per develo per, o r extensive kno wledge o f pro gramming languages like Perl. No ne o f the so ftware in this bo o k will c o st yo u mo re than a few minutes o f do wnlo ad time. No ne o f the tric ks require any pro gramming.
What You Need to Know
To be mo re spec ific , in this bo o k I assume that yo u c an:
✦Write a basic HTML page inc luding links, images, and text using a text edito r. ✦Plac e that page o n a Web server.
On the o ther hand, I do no t assume that yo u:
✦Kno w SGML. In fac t, this prefac e is almo st the o nly plac e in the entire bo o k yo u’ll see the wo rd SGML used. XML is suppo sed to be simpler and mo re widespread than SGML. It c an’t be that if yo u have to learn SGML first. ✦Are a pro grammer, whether o f Java, Perl, C, o r so me o ther language, XML is
a markup language, no t a pro gramming language. Yo u do n’t need to be a pro grammer to write XML do c uments.
What You’ll Learn
This bo o k has o ne primary go al; to teac h yo u to write XML do c uments fo r the Web. Fo rtunately, XML has a dec idedly flat learning c urve, muc h like HTML (and unlike SGML). As yo u learn a little yo u c an do a little. As yo u learn a little mo re, yo u c an do a little mo re. Thus the c hapters in this bo o k build steadily o n eac h o ther. They are meant to be read in sequenc e. Alo ng the way yo u’ll learn:
✦Ho w an XML do c ument is c reated and delivered to readers.
✦Ho w semantic tagging makes XML do c uments easier to maintain and develo p than their HTML equivalents.
✦Ho w to po st XML do c uments o n Web servers in a fo rm everyo ne c an read. ✦Ho w to make sure yo ur XML is well-fo rmed.
✦Ho w to use internatio nal c harac ters like _ and _ in yo ur do c uments. ✦Ho w to validate do c uments with DTDs.
✦Ho w to use entities to build large do c uments fro m smaller parts. ✦Ho w attributes desc ribe data.
✦Ho w to wo rk with no n-XML data.
In the final sec tio n o f this bo o k, yo u’ll see several prac tic al examples o f XML being used fo r real-wo rld applic atio ns inc luding:
✦Web Site Design ✦Push
✦Vec to r Graphic s ✦Genealo gy
How the Book Is Organized
This bo o k is divided into five parts and inc ludes three appendixes: I. Intro duc ing XML
II. Do c ument Type Definitio ns III. Style Languages
IV. Supplemental Tec hno lo gies V. XML Applic atio ns
By the time yo u’re finished reading this bo o k, yo u’ll be ready to use XML to c reate c o mpelling Web pages. The five parts and the appendixes are desc ribed belo w.
Part I: Introducing XM L
Part II: Document Type Definitions
Part II c o nsists o f Chapters 8 thro ugh 11, all o f whic h fo c us o n do c ument type definitio ns (DTDs). An XML do c ument may o ptio nally c o ntain a DTD that spec ifies whic h elements are and are no t allo wed in an XML do c ument. The DTD spec ifies the exac t c o ntext and struc ture o f tho se elements. A validating parser c an read a do c ument and c o mpare it to its DTD, and repo rt any mistakes it finds. This enables do c ument autho rs to make sure that their wo rk meets any nec essary c riteria. In Part II, yo u’ll learn ho w to attac h a DTD to a do c ument, ho w to validate yo ur do c uments against their DTDs, and ho w to write yo ur o wn DTDs that so lve yo ur o wn pro blems. Yo u’l learn the syntax fo r dec laring elements, attributes, entities, and no tatio ns. Yo u’ll see ho w yo u c an use entity dec laratio ns and entity referenc es to build bo th a do c ument and its DTD fro m multiple, independent piec es. This allo ws yo u to make lo ng, hard-to -fo llo w do c uments muc h simpler by separating them into related mo dules and c o mpo nents. And yo u’ll learn ho w to integrate o ther fo rms o f data like raw text and GIF image files in yo ur XML do c ument.
Part III: Style Languages
Part III c o nsists o f Chapters 12 thro ugh 15. XML markup o nly spec ifies what’s in a do c ument. Unlike HTML, it do es no t say anything abo ut what that c o ntent sho uld lo o k like. Info rmatio n abo ut an XML do c ument’s appearanc e when printed, viewed in a Web bro wser, o r o therwise displayed is sto red in a style sheet. Different style sheets c an be used fo r the same do c ument. Yo u might, fo r instanc e, want to use a style sheet that spec ifies small fo nts fo r printing, ano ther o ne that uses larger fo nts fo r o n-sc reen use, and a third with abso lutely humo ngo us fo nts to pro jec t the do c ument o n a wall at a seminar. Yo u c an c hange the appearanc e o f an XML do c u-ment by c ho o sing a different style sheet witho ut to uc hing the do c uu-ment itself. Part III desc ribes in detail the two style sheet languanges in bro adest use o n the Web, Casc ading Style Sheets (CSS) and the Extensible Style Language (XSL). CSS is a simple style-sheet language o riginally designed fo r use with HTML. CSS exists in two versio ns: CSS Level 1 and CSS Level 2. CSS Level 1 pro vides basic info rmatio n abo ut fo nts, c o lo r, po sitio ning, and text pro perties, and is reaso nably well suppo rted by c urrent Web bro wsers fo r HTML and XML. CSS Level 2 is a mo re rec ent standard that adds suppo rt fo r aural style sheets, user interfac e styles, internatio nal and bi-direc tio nal text, and mo re. CSS is a relatively simple standard that spplies fixed style rules to the c o ntents o f partic ular elements.
than the fo rmatting language. No netheless, it is beginning to firm up, and is suppo rted by Micro so ft Internet Explo rer 5.0 and so me third-party fo rmatting engines.
Part IV: Supplemental Technologies
Part IV c o nsists o f Chapters 16 thro ugh 19. It intro duc es so me XML-based languages and syntaxes that layer o n to p o f basic XML. XLinks pro vides multi-direc tio nal hypertext links that are far mo re po werful than the simple HTML <A>tag. XPo inters intro duc e a new syntax yo u c an attac h to the end o f URLs to link no t o nly to parti-c ular do parti-c uments, but to partiparti-c ular parts o f partiparti-c ular do parti-c uments. Namespaparti-c es use prefixes and URLs to disambiguate c o nflic ting XML markup languages. The Reso urc e Desc riptio n Framewo rk (RDF) is an XML applic atio n used to embed meta-data in XML and HTML do c uments. Meta-data is info rmatio n abo ut a do c ument, suc h as the autho r, date, and title o f a wo rk, rather than the wo rk itself. All o f these c an be added to yo ur o wn XML-based markup languages to extend their po wer and utility.
Part V: XM L Applications
Part V, whic h c o nsists o f Chapters 20–23, sho ws yo u fo ur prac tic al uses o f XML in different do mains. XHTML is a refo rmulatio n o f HTML 4.0 as valid XML. Mic ro so ft’s Channel Definitio n Fo rmat (CDF), is an XML-based markup language fo r defining c hannels that c an push updated Web site c o ntent to subsc ribers. The Vec to r Markup Language (VML) is an XML applic atio n fo r sc alable graphic s used by Mic ro -so ft Offic e 2000 and Internet Explo rer 5.0. Finally, a c o mpletely new applic atio n is develo ped fo r genealo gic al data to sho w yo u no t just ho w to use XML tags, but why and when to c ho o se them.
Appendixes
This bo o k has two appendixes, whic h fo c us o n the fo rmal spec ific atio ns fo r XML, as o ppo sed to the mo re info rmal desc riptio n o f it used thro ugho ut the rest o f the bo o k. Appendix A pro vides detailed explanatio ns o f three individual parts o f the XML 1.0 spec ific atio n: XML BNF grammar, well-fo rmedness c o nstraints, and the validity c o nstraints. Appendix B c o ntains the o ffic ial W3C XML 1.0 spec ific atio n published by the W3C. The bo o k also has a third appendix, Appendix C, whic h desc ribes the c o ntents o f the CD-ROM that ac c o mpanies this bo o k.
What You Need
To make the best use o f this bo o k and XML, yo u need: ✦A PC running Windo ws 95, Windo ws 98, o r Windo ws NT ✦Internet Explo rer 5.0
Any system that c an run Windo ws will suffic e. In this bo o k, I mo stly assume yo u’re using Windo ws 95 o r NT 4.0 o r later. As a lo ngtime Mac and Unix user, I so mewhat regret this. Like Java, XML is suppo sed to be platfo rm independent. Also like Java, the reality is so mewhat sho rt o f the hype. Altho ugh XML c o de is pure text that c an be written with any edito r, many o f the to o ls are c urrently o nly available o n Windo ws.
Ho wever, altho ugh there aren’t many Unix o r Mac into sh native XML pro grams, there are an inc reasing number o f XML pro grams written in Java. If yo u have a Java 1.1 o r later virtual mac hine o n yo ur platfo rm o f c ho ic e, yo u sho uld be able to make do . Even if yo u c an’t lo ad yo ur XML do c uments direc tly into a Web bro wser, yo u c an still c o nvert them to XML do c uments and view tho se. When Mo zilla is released, it sho uld pro vide the best XML bro wser yet ac ro ss multiple platfo rms.
How to Use This Book
This bo o k is designed to be read mo re o r less c o ver to c o ver. Eac h c hapter builds o n the material in the previo us c hapters in a fairly predic table fashio n. Of c o urse, yo u’re always welc o me to skim o ver material that’s already familiar to yo u. I also ho pe yo u’ll sto p alo ng the way to try o ut so me o f the examples and to write so me XML do c uments o f yo ur o wn. It’s impo rtant to learn no t just by reading, but also by do ing. Befo re yo u get started, I’d like to make a c o uple o f no tes abo ut grammatic al c o nventio ns used in this bo o k.
Unlike HTML, XML is c ase sensitive. <FATHER>is no t the same as <Father>o r <father>. The fatherelement is no t the same as the Fatherelement o r the FATHERelement. Unfo rtunately, c ase-sensitive markup languages have an anno ying habit o f c o nflic ting with standard English usage. On rare o c c asio n this means that yo u may enc o unter sentenc es that do n’t begin with a c apital letter. Mo re c o mmo nly, yo u’ll see c apitalizatio n used in the middle o f a sentenc e where yo u wo uldn’t no rmally expec t it. Please do n’t get to o bo thered by this. All XML and HTML c o de used in this bo o k is plac ed in a mo no spac ed fo nt, so mo st o f the time it will be o bvio us fro m the c o ntext what is meant.
What the Icons M ean
Thro ugho ut the bo o k, I’ve used ico nsin the left margin to c all yo ur attentio n to po ints that are partic ularly impo rtant.
Note icons provide supplem ental inform ation about the subject at hand, but gen-erally som ething that isn’t quite the m ain idea. Notes are often used to elaborate on a detailed technical point.
Tip icons indicate a m ore efficient w ay of doing som ething, or a technique that m ay not be obvious.
CD-ROM icons tell you that softw are discussed in the book is available on the com panion CD-ROM. This icon also tells you if a longer exam ple, discussed but not included in its entirety in the book, is on the CD-ROM.
Caution icons w arn you of a com m on m isconception or that a procedure doesn’t alw ays w ork quite like it’s supposed to. The m ost com m on purpose of a Caution icon in this book is to point out the difference betw een w hat a specification says should happen, and w hat actually does.
The Cross Reference icon refers you to other chapters that have m ore to say about a particular subject.
About the Companion CD-ROM
The inside bac k c o ver o f this bo o k c o ntains a CD-ROM that ho lds all numbered c o de listings that yo u’ll find in the text. It also inc ludes many lo nger examples that c o uldn’t fit into this bo o k. The CD-ROM also c o ntains the c o mplete text o f vario us XML spec ific atio ns in HTML. (So me o f the spec ific atio ns will be in o ther fo rmats as well.) Finally, yo u will find an asso rtment o f useful so ftware fo r wo rking with XML do c uments. Many (tho ugh no t all) o f these pro grams are written in Java, so they’ll run o n any system with a reaso nably c o mpatible Java 1.1 o r later virtual mac hine. Mo st o f the pro grams that aren’t written in Java are designed fo r Windo ws 95, 98, and NT.
Fo r a c o mplete desc riptio n o f the CD-ROM c o ntents, yo u c an read Appendix C. In additio n, to get a c o mplete desc riptio n o f what is o n the CD-ROM, yo u c an lo ad the file index.html o nto yo ur Web bro wser. The files o n the c o mpanio n CD-ROM are no t c o mpressed, so yo u c an ac c ess them direc tly fro m the CD.
Cross-Reference
Caution On the CD-ROM
Reach Out
The publisher and I want yo ur feedbac k. After yo u have had a c hanc e to use this bo o k, please take a mo ment to c o mplete the IDG Bo o ks Wo rldwide Registratio n Card (in the bac k o f the bo o k). Please be ho nest in yo ur evaluatio n. If yo u tho ught a partic ular c hapter didn’t tell yo u eno ugh, let me kno w. Of c o urse, I wo uld prefer to rec eive c o mments like: “This is the best bo o k I’ve ever read”, “Thanks to this bo o k, my Web site wo n Co o l Site o f the Year”, o r “When I was reading this bo o k o n the beac h, I was besieged by mo dels who tho ught I was super c o o l”, but I’ll take any c o mments I c an get :-).
Feel free to send me spec ific questio ns regarding the material in this bo o k. I’ll do my best to help yo u o ut and answer yo ur questio ns, but I c an’t guarantee a reply. The best way to reac h me is by email:
Also , I invite yo u to visit my Cafe c o n Lec he Web site at http://metalab.unc. edu/xml/, whic h c o ntains a lo t o f XML-related material and is updated almo st daily. Despite my persistent effo rts to make this bo o k perfec t, so me erro rs have do ubtless slipped by. Even mo re c ertainly, so me o f the material disc ussed here will c hange o ver time. I’ll po st any nec essary updates and errata o n my Web site at http://metalab.unc.edu/xml/books/bible/. Please let me kno w via email o f any erro rs that yo u find that aren’t already listed.
The fo lks at IDG have all been great. The ac quisitio ns edito r, Jo hn Osbo rn, deserves spec ial thanks fo r arranging the unusual sc heduling this bo o k required to hit the mo ving target XML presents. Terri Varveris shepherded this bo o k thro ugh the develo pment pro c ess. With po ise and grac e, she managed the c o nstantly shifting o utline and sc hedule that a bo o k based o n unstable spec ific atio ns and so ftware requires. Amy Eo ff c o rrec ted many o f my grammatic al sho rtc o mings. Susan Parini and Ritc hie Durdin, the pro duc tio n c o o rdinato rs, also deserve spec ial thanks fo r managing the pro duc tio n o f this bo o k and fo r dealing with last-minute figure c hanges.
Steven Champeo n bro ught his SGML experienc e to the bo o k, and pro vided many insightful c o mments o n the text. My bro ther Tho mas Haro ld put his c o mmand o f c hemistry at my dispo sal when I was trying to grasp the Chemic al Markup Language. Carro ll Bellau pro vided me with parts o f my family tree, whic h yo u’ll find in Chapter 17.
I also greatly apprec iate all the c o mments, questio ns, and c o rrec tio ns sent in by readers o f my previo us bo o k, XML: Exte nsible Markup Language. I ho pe that I’ve managed to address mo st o f tho se c o mments in this bo o k. They’ve definitely helped make XML Bible a better bo o k. Partic ular thanks are due to Alan Esenther and Do nald Lanc o n Jr. fo r their espec ially detailed c o mments.
WandaJane Phillips wro te the o riginal versio n o f Chapter 21 o n CDF that is adapted here. Heather Williamso n, in additio n to perfo rming yeo man-like servic e as tec hnic al edito r, wro te Chapter 13, CSS Le ve l 2, and parts o f Chapters 18, 19, and 22. Her help was instrumental in helping me almo st meet my deadline. (Blame fo r this almo st rests o n my sho ulders, no t theirs.) Also , I wo uld like to thank Piro z Mo hseni, who also served as a tec hnic al edito r fo r this bo o k.
Prefac e ...ix
Ac kno wledgments ...xvii
Part I: Introducing XM L ...1
Chapter 1: An Eagle’s Eye View o f XML ...3
Chapter 2: An Intro duc tio n to XML Applic atio ns ...17
Chapter 3: Yo ur First XML Do c ument ...49
Chapter 4: Struc turing Data ...59
Chapter 5: Attributes, Empty Tags, and XSL ...95
Chapter 6: Well-Fo rmed XML Do c uments Chapter 7: Fo reign Languages and No n-Ro man Text ...161
Part II: Document Type Definitions ...189
Chapter 8: Do c ument Type Definitio ns and Validity ...191
Chapter 9: Entities and External DTD Subsets ...247
Chapter 10: Attribute Dec laratio ns in DTDs ...283
Chapter 11: Embedding No n-XML Data ...307
Part III: Style Languages ...321
Chapter 12: Casc ading Style Sheets Level 1 ...323
Chapter 13: Casc ading Style Sheets Level 2 ...389
Chapter 14: XSL Transfo rmatio ns ...433
Chapter 15: XSL Fo rmatting Objec ts ...513
Part IV: Supplemental Technologies ...569
Chapter 16: XLinks ...571
Chapter 17: XPo inters ...591
Chapter 18: Namespac es ...617
Chapter 19: The Reso urc e Desc riptio n Framewo rk ...631
PartV: XM L Applications ...655
Chapter 20: Reading Do c ument Type Definitio ns ...657
Chapter 21: Pushing Web Sites with CDF ...775
Chapter 22: The Vec to r Markup Language ...805
Appendix A: XML Referenc e Material ...863
Appendix B: The XML 1.0 Spec ific atio n ...921
Appendix C: What’s o n the CD-ROM ...971
Index ...975
End-User Lic ense Agreement ...1018
Prefac e ...ix
Ac kno wledgments ...xvii
Part I: Introducing XM L
1
Chapter 1: An Eagle’s Eye View of XM L ...3
What Is XML? ...3
XML Is a Meta-Markup Language ...3
XML Desc ribes Struc ture and Semantic s, No t Fo rmatting ...4
Why Are Develo pers Exc ited abo ut XML? ...6
Design o f Do main-Spec ific Markup Languages ...6
Self-Desc ribing Data ...6
Interc hange o f Data Amo ng Applic atio ns ...7
Struc tured and Integrated Data ...8
The Life o f an XML Do c ument ...8
Edito rs ...9
Parsers and Pro c esso rs ...9
Bro wsers and Other To o ls ...9
The Pro c ess Summarized ...10
Related Tec hno lo gies ...10
Hypertext Markup Language ...10
Casc ading Style Sheets ...11
Extensible Style Language ...12
URLs and URIs ...12
XLinks and XPo inters ...13
The Unic o de Charac ter Set ...14
Ho w the Tec hno lo gies Fit To gether ...14
Chapter 2: An Introduction to XM L Applications ...17
What Is an XML Applic atio n? ...17
Chemic al Markup Language ...18
Mathematic al Markup Language ...19
Channel Definitio n Fo rmat ...22
Classic Literature ...22
Sync hro nized Multimedia Integratio n Language ...24
HTML+TIME ...25
Open So ftware Desc riptio n ...26
Sc alable Vec to r Graphic s ...27
Vec to r Markup Language ...29
Music ML ...30
Open Financ ial Exc hange ...34
Extensible Fo rms Desc riptio n Language ...36
Human Reso urc es Markup Language ...38
Reso urc e Desc riptio n Framewo rk ...40
XML fo r XML ...42
XSL ...42
XLL ...43
DCD ...43
Behind-the-Sc ene Uses o f XML ...44
Chapter 3: Your First XM L Document ...49
Hello XML ...49
Creating a Simple XML Do c ument ...50
Saving the XML File ...50
Lo ading the XML File into a Web Bro wser ...51
Explo ring the Simple XML Do c ument ...52
Assigning Meaning to XML Tags ...54
Writing a Style Sheet fo r an XML Do c ument ...55
Attac hing a Style Sheet to an XML Do c ument ...56
Chapter 4: Structuring Data ...59
Examining the Data ...59
Batters ...60
Pitc hers ...62
Organizatio n o f the XML Data ...62
XMLizing the Data ...65
Starting the Do c ument: XML Dec laratio n and Ro o t Element ...65
XMLizing League, Divisio n, and Team Data ...67
XMLizing Player Data ...69
XMLizing Player Statistic s ...70
Putting the XML Do c ument Bac k To gether Again ...72
The Advantages o f the XML Fo rmat ...80
Preparing a Style Sheet fo r Do c ument Display ...81
Linking to a Style Sheet ...82
Assigning Style Rules to the Ro o t Element ...84
Assigning Style Rules to Titles ...85
Assigning Style Rules to Player and Statistic s Elements ...88
Summing Up ...89
Chapter 5: Attributes, Empty Tags, and XSL ...95
Attributes ...95
Attributes versus Elements ...101
Struc tured Meta-data ...102
Meta-Meta-Data ...105
What’s Yo ur Meta-data Is So meo ne Else’s Data ...106
Elements Are Mo re Extensible ...106
Empty Tags ...108
XSL ...109
XSL Style Sheet Templates ...110
The Bo dy o f the Do c ument ...111
The Title ...113
Leagues, Divisio ns, and Teams ...115
Players ...120
Separatio n o f Pitc hers and Batters ...122
CSS o r XSL? ...130
Chapter 6: Well-Formed XM L Documents ...133
#1: The XML dec laratio n must begin the do c ument ...144
#2: Use Bo th Start and End Tags in No n-Empty Tags ...144
Chapter 7: Foreign Languages and Non-Roman Text ...161
No n-Ro man Sc ripts o n the Web ...161
Sc ripts, Charac ter Sets, Fo nts, and Glyphs ...166
A Charac ter Set fo r the Sc ript ...166
A Fo nt fo r the Charac ter Set ...167
An Input Metho d fo r the Charac ter Set ...167
Operating System and Applic atio n So ftware ...168
Legac y Charac ter Sets ...169
The ASCII Charac ter Set ...169
The ISO Charac ter Sets ...172
The Mac Ro man Charac ter Set ...175
The Windo ws ANSI Charac ter Set ...176
The Unic o de Charac ter Set ...177
UTF 8 ...182
The Universal Charac ter System ...182
Ho w to Write XML in Unic o de ...183
Inserting Charac ters in XML Files with Charac ter Referenc es ...183
Co nverting to and fro m Unic o de ...184
Ho w to Write XML in Other Charac ter Sets ...185
Part II: Document Type Definitions
189
Chapter 8: Document Type Definitions and Validity ...191
Do c ument Type Definitio ns ...191
Do c ument Type Dec laratio ns ...192
Validating Against a DTD ...195
Listing the Elements ...201
Element Dec laratio ns ...208
ANY ...209
#PCDATA ...209
Child Lists ...212
Sequenc es ...214
A DTD fo r Attribute-Based Baseball Statistic s ...300 Dec laring SEASON Attributes in the DTD ...301 Dec laring LEAGUE and DIVISION Attributes in the DTD ...301 Dec laring TEAM Attributes in the DTD ...302 Dec laring PLAYER Attributes in the DTD ...302 The Co mplete DTD fo r the Baseball Statistic s Example ...304
Chapter 11: Embedding Non-XM L Data ...307
No tatio ns ...307 Unparsed External Entities ...311 Dec laring Unparsed Entities ...311 Embedding Unparsed Entities ...312 Embedding Multiple Unparsed Entities ...315 Pro c essing Instruc tio ns ...315 Co nditio nal Sec tio ns in DTDs ...319Part III: Style Languages
321
Transfo rming to Fo rmatting Objec ts ...522 Using FOP ...524 Page Layo ut ...526 Master Pages ...526 Page Sequenc es ...529 Co ntent ...535 Blo c k-level Fo rmatting Objec ts ...535 Inline Fo rmatting Objec ts ...537 Table-fo rmatting Objec ts ...538 Out-o f-line Fo rmatting Objec ts ...538 Rules ...539 XSL Fo rmatting Pro perties ...548 Units and Data Types ...549 Info rmatio nal Pro perties ...551 Paragraph Pro perties ...551 Charac ter Pro perties ...554 Sentenc e Pro perties ...556 Area Pro perties ...559 Aural Pro perties ...565
Part IV: Supplemental Technologies
569
Part V: XM L Applications
655
The Family Tree DTD ...848 Designing a Style Sheet fo r Family Trees ...855
Appendix A: XM L Reference M aterial ...863
Appendix B: The XM L 1.0 Specification ...921
Appendix C: What’s on the CD-ROM ...971
Index ...975
End-User License Agreement ...1021
An Eagle’s Eye
View of XM L
T
his first c hapter intro duc es yo u to XML. It explains in general what XML is and ho w it is used. It sho ws yo u ho w the different piec es o f the XML equatio n fit to gether, and ho w an XML do c ument is c reated and delivered to readers.What Is XM L?
XML stands fo r Extensible Markup Language (o ften written as eXtensibleMarkup Language to justify the ac ro nym). XML is a set o f rules fo r defining semantic tags that break a do c ument into parts and identify the different parts o f the do c ument. It is a meta-markup language that defines a syntax used to define o ther do main-spec ific , semantic , struc tured markup languages.
XM L Is a M eta-M arkup Language
The first thing yo u need to understand abo ut XML is that it isn’t just ano ther markup language like the Hypertext Markup Language (HTML) o r tro ff. These languages define a fixed set o f tags that desc ribe a fixed number o f elements. If the markup language yo u use do esn’t c o ntain the tag yo u need — yo u’re o ut o f luc k. Yo u c an wait fo r the next versio n o f the markup language ho ping that it inc ludes the tag yo u need; but then yo u’re really at the merc y o f what the vendo r c ho o ses to inc lude.XML, ho wever, is a meta-markup language. It’s a language in whic h yo u make up the tags yo u need as yo u go alo ng. These tags must be o rganized ac c o rding to c ertain general princ iples, but they’re quite flexible in their meaning. Fo r instanc e, if yo u’re wo rking o n genealo gy and need to desc -ribe peo ple, births, deaths, burial sites, families, marriages, divo rc es, and so o n, yo u c an c reate tags fo r eac h o f these. Yo u do n’t have to fo rc e yo ur data to fit into paragraphs, list items, stro ng emphasis, o r o ther very general c atego ries.
1
1
✦
✦
✦
✦
In This Cha pter
W hat is XML?
W hy are develo pers excited abo ut XML?
The life o f an XML do cument
Related techno lo g ies
The tags yo u c reate c an be do c umented in a Do c ument Type Definitio n (DTD). Yo u’ll learn mo re abo ut DTDs in Part II o f this bo o k. Fo r no w, think o f a DTD as a vo c abulary and a syntax fo r c ertain kinds o f do c uments. Fo r example, the MOL.DTD in Peter Murray-Rust’s Chemic al Markup Language (CML) desc ribes a vo c abulary and a syntax fo r the mo lec ular sc ienc es: c hemistry, c rystallo graphy, so lid state physic s, and the like. It inc ludes tags fo r ato ms, mo lec ules, bo nds, spec tra, and so o n. This DTD c an be shared by many different peo ple in the mo lec ular sc ienc es field. Other DTDs are available fo r o ther fields, and yo u c an also c reate yo ur o wn. XML defines a meta syntax that do main-spec ific markup languages like Music ML, MathML, and CML must fo llo w. If an applic atio n understands this meta syntax, it auto matic ally understands all the languages built fro m this meta language. A bro wser do es no t need to kno w in advanc e eac h and every tag that might be used by tho usands o f different markup languages. Instead it disc o vers the tags used by any given do c ument as it reads the do c ument o r its DTD. The detailed instruc tio ns abo ut ho w to display the c o ntent o f these tags are pro vided in a separate style sheet that is attac hed to the do c ument.
Fo r example, c o nsider Sc hro dinger’s equatio n:
Sc ientific papers are full o f equatio ns like this, but sc ientists have been waiting eight years fo r the bro wser vendo rs to suppo rt the tags needed to write even the mo st basic math. Music ians are in a similar bind, sinc e Netsc ape Navigato r and Internet Explo rer do n’t suppo rt sheet music .
XML means yo u do n’t have to wait fo r bro wser vendo rs to c atc h up with what yo u want to do . Yo u c an invent the tags yo u need, when yo u need them, and tell the bro wsers ho w to display these tags.
XM L Describes Structure and
Semantics, Not Formatting
The sec o nd thing to understand abo ut XML is that XML markup desc ribes a do c ument’s struc ture and meaning. It do es no t desc ribe the fo rmatting o f the elements o n the page. Fo rmatting c an be added to a do c ument with a style sheet. The do c ument itself o nly c o ntains tags that say what is in the do c ument, no t what the do c ument lo o ks like.
ih
∂ψ
r
,
t
∂
t
= –
h
22
m
∂
2ψ
r
,
t
By c o ntrast, HTML enc o mpasses fo rmatting, struc tural, and semantic markup. <B>
is a fo rmatting tag that makes its c o ntent bo ld. <STRONG>is a semantic tag that means its c o ntents are espec ially impo rtant. <TD>is a struc tural tag that indic ates that the c o ntents are a c ell in a table. In fac t, so me tags c an have all three kinds o f meaning. An <H1>tag c an simultaneo usly mean 20 po int Helvetic a bo ld, a level-1 heading, and the title o f the page.
Fo r example, in HTML a so ng might be desc ribed using a definitio n title, definitio n data, an uno rdered list, and list items. But no ne o f these elements ac tually have anything to do with music . The HTML might lo o k so mething like this:
<dt>Hot Cop
<dd> by Jacques Morali, Henri Belolo, and Victor Willis <ul>
In XML the same data might be marked up like this:
<SONG>
Instead o f generic tags like <dt>and <li>, this listing uses meaningful tags like
<SONG>, <TITLE>, <COMPOSER>, and <YEAR>. This has a number o f advantages, inc luding that it’s easier fo r a human to read the so urc e c o de to determine what the autho r intended.
XML markup also makes it easier fo r no n-human auto mated ro bo ts to lo c ate all o f the so ngs in the do c ument. In HTML ro bo ts c an’t tell mo re than that an element is a dt. They c anno t determine whether that dtrepresents a so ng title, a definitio n, o r just so me designer’s favo rite means o f indenting text. In fac t, a single do c ument may well c o ntain dtelements with all three meanings.
Why Are Developers Excited about XM L?
XML makes easy many Web -d evelo p ment tasks that are extremely p ainful using o nly HTML, and it makes tasks that are imp o ssib le with HTML, p o ssib le. Bec ause XML is eXtensib le, d evelo p ers like it fo r many reaso ns. Whic h o nes mo st interest yo u d ep end o n yo ur ind ivid ual need s. But o nc e yo u learn XML, yo u’ re likely to d isc o ver that it’s the so lutio n to mo re than o ne p ro b lem yo u’ re alread y struggling with. This sec tio n investigates so me o f the generic uses o f XML that exc ite d evelo p ers. In Chap ter 2, yo u’ ll see so me o f the sp ec ific ap p lic atio ns that have alread y b een d evelo p ed with XML.Design of Domain-Specific M arkup Languages
XML allo ws vario us pro fessio ns (e.g., music , c hemistry, math) to develo p their o wn do main-spec ific markup languages. This allo ws individuals in the field to trade no tes, data, and info rmatio n witho ut wo rrying abo ut whether o r no t the perso n o n the rec eiving end has the partic ular pro prietary payware that was used to c reate the data. They c an even send do c uments to peo ple o utside the pro fessio n with a reaso nable c o nfidenc e that the peo ple who rec eive them will at least be able to view the do c uments.
Furthermo re, the c reatio n o f markup languages fo r individual do mains do es no t lead to blo atware o r unnec essary c o mplexity fo r tho se o utside the pro fessio n. Yo u may no t be interested in elec tric al engineering diagrams, but elec tric al engineers are. Yo u may no t need to inc lude sheet music in yo ur Web pages, but c o mpo sers do . XML lets the elec tric al engineers desc ribe their c irc uits and the c o mpo sers no tate their sc o res, mo stly witho ut stepping o n eac h o ther’s to es. Neither field will need spec ial suppo rt fro m the bro wser manufac turers o r c o mplic ated plug-ins, as is true to day.
Self-Describing Data
At a higher level, XML is self-desc ribing. Suppo se yo u’re an info rmatio n arc haeo lo gist in the 23rd c entury and yo u enc o unter this c hunk o f XML c o de o n an o ld flo ppy disk that has survived the ravages o f time:
<PERSON ID=”p1100” SEX=”M”> <NAME>
<GIVEN>Judson</GIVEN>
<SURNAME> McDaniel</SURNAME> </NAME>
<BIRTH>
<DATE>21 Feb 1834</DATE> </BIRTH> <DEATH>
<DATE>9 Dec 1905</DATE> </DEATH> </PERSON>
Even if yo u’re no t familiar with XML, assuming yo u speak a reaso nable fac simile o f 20th c entury English, yo u’ve go t a pretty go o d idea that this fragment desc ribes a man named Judso n Mc Daniel, who was bo rn o n February 21, 1834 and died o n Dec ember 9, 1905. In fac t, even with gaps in, o r c o rruptio n o f the data, yo u c o uld pro bably still extrac t mo st o f this info rmatio n. The same c o uld no t be said fo r so me pro prietary spreadsheet o r wo rd-pro c esso r fo rmat.
Furthermo re, XML is very well do c umented. The W3C’s XML 1.0 spec ific atio n and numero us paper bo o ks like this o ne tell yo u exac tly ho w to read XML data. There are no sec rets waiting to trip up the unwary.
Interchange of Data Among Applications
Sinc e XML is no n-pro prietary and easy to read and write, it’s an exc ellent fo rmat fo r the interc hange o f data amo ng different applic atio ns. One suc h fo rmat under c urrent develo pment is the Open Financ ial Exc hange Fo rmat (OFX). OFX is designed to let perso nal financ e pro grams like Mic ro so ft Mo ney and Quic ken trade data. The data c an be sent bac k and fo rth between pro grams and exc hanged with banks, bro kerage ho uses, and the like.
OFX is discussed in Chapter 2.
As no ted abo ve, XML is a no n-pro prietary fo rmat, no t enc umbered by c o pyright, patent, trade sec ret, o r any o ther so rt o f intellec tual pro perty restric tio n. It has been designed to be extremely po werful, while at the same time being easy fo r bo th human beings and c o mputer pro grams to read and write. Thus it’s an o bvio us c ho ic e fo r exc hange languages.
By using XML instead o f a pro prietary data fo rmat, yo u c an use any to o l that understands XML to wo rk with yo ur data. Yo u c an even use different to o ls fo r different purpo ses, o ne pro gram to view and ano ther to edit fo r instanc e. XML keeps yo u fro m getting lo c ked into a partic ular pro gram simply bec ause that’s what
yo ur data is already written in, o r bec ause that pro gram’s pro prietary fo rmat is all yo ur c o rrespo ndent c an ac c ept.
Fo r example, many publishers require submissio ns in Mic ro so ft Wo rd. This means that mo st autho rs have to use Wo rd, even if they wo uld rather use Wo rdPerfec t o r Nisus Writer. So it’s extremely diffic ult fo r any o ther c o mpany to publish a c o mpeting wo rd pro c esso r unless they c an read and write Wo rd files. Sinc e do ing so requires a develo per to reverse-engineer the undo c umented Wo rd file fo rmat, it’s a signific ant investment o f limited time and reso urc es. Mo st o ther wo rd pro c esso rs have a limited ability to read and write Wo rd files, but they generally lo se trac k o f graphic s, mac ro s, styles, revisio n marks, and o ther impo rtant features. The pro blem is that Wo rd’s do c ument fo rmat is undo c u-mented, pro prietary, and c o nstantly c hanging. Wo rd tends to end up winning by default, even when writers wo uld prefer to use o ther, simpler pro grams. If a c o mmo n wo rd-pro c essing fo rmat were develo ped in XML, writers c o uld use the pro gram o f their c ho ic e.
Structured and Integrated Data
XML is ideal fo r large and c o mplex do c uments bec ause the data is struc tured. It no t o nly lets yo u spec ify a vo c abulary that defines the elements in the do c ument; it also lets yo u spec ify the relatio ns between elements. Fo r example, if yo u’re putting to gether a Web page o f sales c o ntac ts, yo u c an require that every c o ntac t have a pho ne number and an email address. If yo u’re inputting data fo r a database, yo u c an make sure that no fields are missing. Yo u c an require that every bo o k have an autho r. Yo u c an even pro vide default values to be used when no data is entered. XML also pro vides a c lient-side inc lude mec hanism that integrates data fro m multiple so urc es and displays it as a single do c ument. The data c an even be rearranged o n the fly. Parts o f it c an be sho wn o r hidden depending o n user ac tio ns. This is extremely useful when yo u’re wo rking with large info rmatio n repo sito ries like relatio nal databases.
The Life of an XM L Document
However, like all data formats, XML needs programs and content before it’s useful. So it isn’t enough to only understand XML itself which is little more than a specification for what data should look like. You also need to know how XML documents are edited, how processors read XML documents and pass the information they read on to applications, and what these applications do with that data.
Editors
XML do c uments are mo st c o mmo nly c reated with an edito r. This may be a basic text edito r like No tepad o r vi that do esn’t really understand XML at all. On the o ther hand, it may be a c o mpletely WYSIWYG edito r like Ado be FrameMaker that insulates yo u almo st c o mpletely fro m the details o f the underlying XML fo rmat. Or it may be a struc tured edito r like JUMBO that displays XML do c uments as trees. Fo r the mo st part, the fanc y edito rs aren’t very useful yet, so this bo o k c o nc entrates o n writing raw XML by hand in a text edito r.
Other pro grams c an also c reate XML do c uments. Fo r example, later in this bo o k, in the c hapter o n designing a new DTD, yo u’ll see so me XML data that c ame straight o ut o f a FileMaker database. In this c ase, the data was first entered into the FileMaker database. Then a FileMaker c alc ulatio n field c o nverted that data to XML. In general, XML wo rks extremely well with databases.
Specifically, you’ll see this in Chapter 23, Designing a New XML Application.
In any c ase, the edito r o r o ther pro gram c reates an XML do c ument. Mo re o ften than no t this do c ument is an ac tual file o n so me c o mputer’s hard disk, but it do esn’t abso lutely have to be. Fo r example, the do c ument may be a rec o rd o r a field in a database, o r it may be a stream o f bytes rec eived fro m a netwo rk.
Parsers and Processors
An XML parser (also kno wn as an XML pro c esso r) reads the do c ument and verifies that the XML it c o ntains is well fo rmed. It may also c hec k that the do c ument is valid, tho ugh this test is no t required. The exac t details o f these tests will be c o vered in Part II. But assuming the do c ument passes the tests, the pro c esso r c o nverts the do c ument into a tree o f elements.
Browsers and Other Tools
Finally the parser passes the tree o r individual no des o f the tree to the end applic atio n. This applic atio n may be a bro wser like Mo zilla o r so me o ther pro gram that understands what to do with the data. If it’s a bro wser, the data will be displayed to the user. But o ther pro grams may also rec eive the data. Fo r instanc e, the data might be interpreted as input to a database, a series o f music al no tes to play, o r a Java pro gram that sho uld be launc hed. XML is extr-emely flex-ible and c an be used fo r many different purpo ses.
The Process Summarized
To summarize, an XML do c ument is c reated in an edito r. The XML parser reads the do c ument and c o nverts it into a tree o f elements. The parser passes the tree to the bro wser that displays it. Figure 1-1 sho ws this pro c ess.
Figure 1-1: XM L Docum ent Life Cycle
It’s impo rtant to no te that all o f these piec es are independent and dec o upled fro m eac h o ther. The o nly thing that c o nnec ts them all is the XML do c ument. Yo u c an c hange the edito r pro gram independently o f the end applic atio n. In fac t yo u may no t always kno w what the end applic atio n is. It may be an end user reading yo ur wo rk, o r it may be a database suc king in data, o r it may even be so mething that hasn’t been invented yet. It may even be all o f these. The do c ument is independent o f the pro grams that read it.
HTML is also som ew hat independent of the program s that read and w rite it, but it’s really only suitable for brow sing. Other uses, like database input, are outside its scope. For exam ple, HTML does not provide a w ay to force an author to include cer-tain required content, like requiring that every book have an ISBN num ber. In XML you can require this. You can even enforce the order in w hich particular elem ents appear (for exam ple, that level-2 headers m ust alw ays follow level-1 headers).
Related Technologies
XML do esn’t o perate in a vac uum. Using XML as mo re than a data fo rmat requires interac tio n with a number o f related tec hno lo gies. These tec hno lo gies inc lude HTML fo r bac kward c o mpatibility with legac y bro wsers, the CSS and XSL style-sheet languages, URLs and URIs, the XLL linking language, and the Unic o de c harac ter set.
Hypertext M arkup Language
using Netsc ape 1.1.) So yo u’re go ing to need to c o nvert yo ur XML c o ntent into c lassic HTML fo r so me time to c o me.
Therefo re, befo re yo u jump into XML, yo u sho uld be c o mpletely c o mfo rtable with HTML. Yo u do n’t need to be an abso lutely snazzy graphic al designer, but yo u sho uld kno w ho w to link fro m o ne page to the next, ho w to inc lude an image in a do c ument, ho w to make text bo ld, and so fo rth. Sinc e HTML is the mo st c o mmo n o utput fo rmat o f XML, the mo re familiar yo u are with HTML, the easier it will be to c reate the effec ts yo u want.
On the o ther hand, if yo u’re ac c usto med to using tables o r single-pixel GIFs to arrange o bjec ts o n a page, o r if yo u start to make a Web site by sketc hing o ut its appearanc e rather than its c o ntent, then yo u’re go ing to have to unlearn so me bad habits. As previo usly disc ussed, XML separates the c o ntent o f a do c ument fro m the appearanc e o f the do c ument. The c o ntent is develo ped first; then a fo rmat is attac hed to that c o ntent with a style sheet. Separating c o ntent fro m style is an extremely effec tive tec hnique that impro ves bo th the c o ntent and the appearanc e o f the do c ument. Amo ng o ther things, it allo ws autho rs and designers to wo rk mo re independently o f eac h o ther. Ho wever, it do es require a different way o f thinking abo ut the design o f a Web site, and perhaps even the use o f different pro jec t-management tec hniques when multiple peo ple are invo lved.
Cascading Style Sheets
Sinc e XML allo ws arbitrary tags to be inc luded in a do c ument, there isn’t any way fo r the bro wser to kno w in advanc e ho w eac h element sho uld be displayed. When yo u send a do c ument to a user yo u also need to send alo ng a style sheet that tells the bro wser ho w to fo rmat individual elements. One kind o f style sheet yo u c an use is a Casc ading Style Sheet (CSS).
CSS, initially designed fo r HTML, defines fo rmatting pro perties like fo nt size, fo nt family, fo nt weight, paragraph indentatio n, paragraph alignment, and o ther styles that c an be applied to partic ular elements. Fo r example, CSS allo ws HTML do c uments to spec ify that all H1elements sho uld be fo rmatted in 32 po int c ent-ered Helvetic a bo ld. Individual styles c an be applied to mo st HTML tags that o verride the bro wser’s defaults. Multiple style sheets c an be applied to a single do c ument, and multiple styles c an be applied to a single element. The styles then c asc ade ac c o rding to a partic ular set o f rules.
CSS rules and properties are explored in m ore detail in Chapter 12, Cascading Style Sheets Level 1, and Chapter 13, Cascading Style Sheets Level 2.
It’s easy to apply CSS rules to XML do c uments. Yo u simply c hange the names o f the tags yo u’re applying the rules to . Mo zilla 5.0 direc tly suppo rts CSS style sheets c o mbined with XML do c uments, tho ugh at present, it c rashes rather to o frequently.
Extensible Style Language
The Extensible Style Language (XSL) is a mo re advanc ed style-sheet language spec ific ally designed fo r use with XML do c uments. XSL do c uments are themselves well-fo rmed XML do c uments.
XSL do c uments c o ntain a series o f rules that apply to partic ular patterns o f XML elements. An XSL pro c esso r reads an XML do c ument and c o mpares what it sees to the patterns in a style sheet. When a pattern fro m the XSL style sheet is rec o gnized in the XML do c ument, the rule o utputs so me c o mbinatio n o f text. Unlike c asc ading style sheets, this o utput text is so mewhat arbitrary and is no t limited to the input text plus fo rmatting info rmatio n.
CSS c an o nly c hange the fo rmat o f a partic ular element, and it c an o nly do so o n an element-wide basis. XSL style sheets, o n the o ther hand, c an rearrange and reo rder elements. They c an hide so me elements and display o thers. Furthermo re, they c an c ho o se the style to use no t just based o n the tag, but also o n the c o ntents and attributes o f the tag, o n the po sitio n o f the tag in the do c ument relative to o ther elements, and o n a variety o f o ther c riteria.
CSS has the advantage o f bro ader bro wser suppo rt. Ho wever, XSL is far mo re flexible and po werful, and better suited to XML do c uments. Furthermo re, XML do c uments with XSL style sheets c an be easily c o nverted to HTML do c uments with CSS style sheets.
XSL style sheets w ill be explored in great detail in Chapter 14, XSL Transformations, and Chapter 15, XSL Formatting Objects.
URLs and URIs
XML do cuments can live o n the Web, just like HTML and o ther do cuments. When they do , they are referred to by Unifo rm Reso urce Lo cato rs (URLs), just like HTML files. Fo r example, at the URL http://www.hypermedic.com/style/xml/tempest.xmlyo u’ll find the co mplete text o f Shakespeare’s Te mpe stmarked up in XML.
Altho ugh URLs are well understo o d and well suppo rted, the XML spec ific atio n uses the mo re general Unifo rm Reso urc e Identifier (URI). URIs are a mo re general arc hitec ture fo r lo c ating reso urc es o n the Internet, that fo c us a little mo re o n the reso urc e and a little less o n the lo c atio n. In theo ry, a URI c an find the c lo sest c o py o f a mirro red do c ument o r lo c ate a do c ument that has been mo ved fro m o ne site to ano ther. In prac tic e, URIs are still an area o f ac tive researc h, and the o nly kinds o f URIs that are ac tually suppo rted by c urrent so ftware are URLs.
XLinks and XPointers
As lo ng as XML do c uments are po sted o n the Internet, yo u’re go ing to want to be able to address them and ho t link between them. Standard HTML link tags c an be used in XML do c uments, and HTML do c uments c an link to XML do c uments. Fo r example, this HTML link po ints to the afo rementio ned c o py o f the Te mpe st rendered in XML:
<a href=”http://www.hypermedic.com/style/xml/tempest.xml”> The Tempest by Shakespeare
</a>
Whether the brow ser can display this docum ent if you follow the link, depends on just how w ell the brow ser handles XM L files. Most current brow sers don’t handle them very w ell.
Ho wever, XML lets yo u go further with XLinks fo r linking to do c uments and XPo inters fo r addressing individual parts o f a do c ument.
XLinks enable any element to bec o me a link, no t just an Aelement. Furthermo re, links c an be bi-direc tio nal, multidirec tio nal, o r even po int to multiple mirro r sites fro m whic h the nearest is selec ted. XLinks use no rmal URLs to identify the site they’re linking to .
XLinks are discussed in Chapter 16, XLinks.
XPo inters enable links to po int no t just to a partic ular do c ument at a partic ular lo c atio n, but to a partic ular part o f a partic ular do c ument. An XPo inter c an refer to a partic ular element o f a do c ument, to the first, the sec o nd, o r the 17th suc h element, to the first element that’s a c hild o f a given element, and so o n. XPo inters pro vide extremely po werful c o nnec tio ns between do c uments that do no t require the targeted do c ument to c o ntain additio nal markup just so its individual piec es c an be linked to it.
Furthermo re, unlike HTML anc ho rs, XPo inters do n’t just refer to a po int in a do c ument. They c an po int to ranges o r spans. Thus an XPo inter might be used to selec t a partic ular part o f a do c ument, perhaps so that it c an be c o pied o r lo aded into a pro gram.
XPointers are discussed in Chapter 17, XPointers.
Cross-Reference Cross-Reference
The Unicode Character Set
The Web is internatio nal, yet mo st o f the text yo u’ll find o n it is in English. XML is starting to c hange that. XML pro vides full suppo rt fo r the two -byte Unic o de c harac ter set, as well as its mo re c o mpac t representatio ns. This c harac ter set suppo rts almo st every c harac ter c o mmo nly used in every mo dern sc ript o n Earth. Unfo rtunately, XML alo ne is no t eno ugh. To read a sc ript yo u need three things:
1.A c harac ter set fo r the sc ript
2.A fo nt fo r the c harac ter set
3.An o perating system and applic atio n so ftware that understands the c harac ter set
If yo u want to write in the sc ript as well as read it, yo u’ll also need an input metho d fo r the sc ript. Ho wever, XML defines c harac ter referenc es that allo w yo u to use pure ASCII to enc o de c harac ters no t available in yo ur native c harac ter set. This is suffic ient fo r an o c c asio nal quo te in Greek o r Chinese, tho ugh yo u wo uldn’t want to rely o n it to write a no vel in ano ther language.
In Chapter 7, Foreign Languages and non-Roman Text, you’ll explore how interna-tional text is represented in com puters, how XM L understands text, and how you can use the softw are you have to read and w rite in languages other than English.
How the Technologies Fit Together
XML defines a grammar fo r tags yo u c an use to mark up a do c ument. An XML do c ument is marked up with XML tags. The default enc o ding fo r XML do c uments is Unic o de.
Amo ng o ther things, an XML do c ument may c o ntain hypertext links to o ther do c uments and reso urc es. These links are c reated ac c o rding to the XLink spec ific atio n. XLinks identify the do c uments they’re linking to with URIs (in theo ry) o r URLs (in prac tic e). An XLink may further spec ify the individual part o f a do c ument it’s linking to . These parts are addressed via XPo inters. If an XML do c ument is intended to be read by human beings — and no t all XML do c uments are — then a style sheet pro vides instruc tio ns abo ut ho w individual elements are fo rmatted. The style sheet may be written in any o f several style-sheet languages. CSS and XSL are the two mo st po pular style-sheet languages, tho ugh there are o thers inc luding DSSSL — the Do c ument Style Semantic s and Spec ific atio n Language — o n whic h XSL is based.
I’ve outlined a lot of exciting stuff in this chapter. How ever, honesty com pels m e to tell you that I haven’t discussed all of it yet. In fact, m uch of w hat I’ve described is the prom ise of XM L rather than the current reality. XM L has a lot of people in the softw are industry very excited, and a lot of program m ers are w orking very hard to turn these dream s into reality. New softw are is released every day that brings us closer to XM L nirvana, but this is all very new, and som e of the softw are isn’t fully cooked yet. Throughout the rest of this book, I’ll be careful to point out not only w hat is supposed to happen, but w hat actually does happen. Depressingly these are all too often not the sam e thing. Nonetheless w ith a little caution you can do real w ork right now w ith XM L.
Summary
In this c hapter, yo u have learned so me o f the things that XML c an do fo r yo u. In partic ular, yo u have learned:
✦XML is a meta-markup language that enables the c reatio n o f markup languages fo r partic ular do c uments and do mains.
✦XML tags desc ribe the struc ture and semantic s o f a do c ument’s c o ntent, no t the fo rmat o f the c o ntent. The fo rmat is desc ribed in a separate style sheet. ✦XML grew o ut o f many users’ frustratio n with the c o mplexity o f SGML and the
inadequac ies o f HTML.
✦XML do c uments are c reated in an edito r, read by a parser, and displayed by a bro wser.
✦XML o n the Web rests o n the fo undatio ns pro vided by HTML, Casc ading Style Sheets, and URLs.
✦Numero us suppo rting tec hno lo gies layer o n to p o f XML, inc luding XSL style sheets, XLinks, and XPo inters. These let yo u do mo re than yo u c an
ac c o mplish with just CSS and URLs.
✦Be c areful. XML isn’t c o mpletely finished. It will c hange and expand, and yo u will enc o unter bugs in c urrent XML so ftware.
In the next c hapter, yo u’ll see a number o f XML applic atio ns, and learn abo ut so me ways XML is being used in the real wo rld to day. Examples inc lude vec to r graphic s, music no tatio n, mathematic s, c hemistry, human reso urc es, Webc asting, and mo re.
✦ ✦ ✦
An Introduction
to XM L
Applications
I
n this c hapter we’ll be lo o king at so me examples o f XML applic atio ns, markup languages used to further refine XML, and behind-the-sc ene uses o f XML. It is inspiring to lo o k at so me o f the uses to whic h XML has already been put, even in this early stage o f its develo pment. This c hapter will give yo u so me idea o f the wide applic ability o f XML. Many mo re XML applic atio ns are being c reated o r po rted fro m o ther fo rmats as I write this.Part V covers som e of the XM L applications discussed in this chapter in m ore detail.
What Is an XM L Application?
XML is a meta-markup language fo r designing do main-spec ific markup languages. Eac h XML-based markup language is c alled an XML applic atio n. This is no t an applic atio n that uses XML like the Mo zilla Web bro wser, the Gnumeric spreadsheet, o r the XML Pro edito r, but rather an applic atio n o f XML to a spec ific do main suc h as Chemic al Markup Language (CML) fo r c hemistry o r GedML fo r genealo gy.
Eac h XML applic atio n has its o wn syntax and vo c abulary. This syntax and vo c abulary adheres to the fundamental rules o f XML. This is muc h like human languages, whic h eac h have their o wn vo c abulary and grammar, while at the same time adhering to c ertain fundamental rules impo sed by human anato my and the struc ture o f the brain.
Cross-Reference
2
2
✦
✦
✦
✦
In This Cha pter
W hat is an XML applicatio n?
XML fo r XML Behind-the-scene uses o f XML
XML is an extremely flexible fo rmat fo r text-based data. The reaso n XML was c ho sen as the fo undatio n fo r the wildly different applic atio ns disc ussed in this c hapter (aside fro m the hype fac to r) is that XML pro vides a sensible,
well-do c umented fo rmat that’s easy to read and write. By using this fo rmat fo r its data, a pro gram c an o fflo ad a great quantity o f detailed pro c essing to a few standard free to o ls and libraries. Furthermo re, it’s easy fo r suc h a pro gram to layer additio nal levels o f syntax and semantic s o n to p o f the basic struc ture XML pro vides.
Chemical M arkup Language
Peter Murray-Rust’s Chemic al Markup Language (CML) may have been the first XML applic atio n. CML was o riginally develo ped as an SGML applic atio n, and gradually transitio ned to XML as the XML standard develo ped. In its mo st
simplistic fo rm, CML is “HTML plus mo lec ules”, but it has applic atio ns far beyo nd the limited c o nfines o f the Web.
Mo lec ular do c uments o ften c o ntain tho usands o f different, very detailed o bjec ts. Fo r example, a single medium-sized o rganic mo lec ule may c o ntain hundreds o f ato ms, eac h with several bo nds. CML seeks to o rganize these c o mplex c hemic al o bjec ts in a straightfo rward manner that c an be understo o d, displayed, and searc hed by a c o mputer. CML c an be used fo r mo lec ular struc tures and sequenc es, spec tro graphic analysis, c rystallo graphy, publishing, c hemic al databases, and mo re. Its vo c abulary inc ludes mo lec ules, ato ms, bo nds, c rystals, fo rmulas,
sequenc es, symmetries, reac tio ns, and o ther c hemistry terms. Fo r instanc e Listing 2-1 is a basic CML do c ument fo r water (H2O):
Listing 2-1:
The water molecule H
2O
<?xml version=”1.0”?> <CML>
<MOL TITLE=”Water”> <ATOMS>
<ARRAY BUILTIN=”ELSYM”>H O H</ARRAY> </ATOMS>
<BONDS>
<ARRAY BUILTIN=”ATID1”>1 2</ARRAY> <ARRAY BUILTIN=”ATID2”>2 3</ARRAY> <ARRAY BUILTIN=”ORDER”>1 1</ARRAY> </BONDS>
</MOL> </CML>
traditio nal c hemic al so ftware and do c uments like the Pro tein Data Bank (PDB) fo rmat o r MDL Mo lfiles, is avo ided.
Murray-Rust also c reated JUMBO, the first general-purpo se XML bro wser. Figure 2-1 sho ws JUMBO displaying a CML file. Jumbo wo rks by assigning eac h XML element to a Java c lass that kno ws ho w to render that element. To allo w Jumbo to suppo rt new elements, yo u simply write Java c lasses fo r tho se elements. Jumbo is distributed with c lasses fo r displaying the basic set o f CML elements inc luding mo lec ules, ato ms, and bo nds, and is available at http://www.xml-cml.org/.
Figure 2-1: The JUM BO brow ser displaying a CM L file
M athematical M arkup Language
Legend c laims that Tim Berners-Lee invented the Wo rld Wide Web and HTML at CERN so that high-energy physic ists c o uld exc hange papers and preprints. Perso nally I’ve never believed that. I grew up in physic s; and while I’ve wandered bac k and fo rth between physic s, applied math, astro no my, and c o mputer sc ienc e o ver the years, o ne thing the papers in all o f these disc iplines had in c o mmo n was lo ts and lo ts o f equatio ns. Until no w, nine years after the Web was invented, there hasn’t been any go o d way to inc lude equatio ns in Web pages.