Book HomePerl & XML

Chapter 2. An XML Recap

Contents:

A Brief History of XML
Markup, Elements, and Structure
Namespaces
Spacing
Entities
Unicode, Character Sets, and Encodings
The XML Declaration
Processing Instructions and Other Markup
Free-Form XML and Well-Formed Documents
Declaring Elements and Attributes
Schemas
Transformations

XML is a revolutionary (and evolutionary) markup language. It combines the generalized markup power of SGML with the simplicity of free-form markup and well-formedness rules. Its unambiguous structure and predictable syntax make it a very easy and attractive format to process with computer programs.

You are free, with XML, to design your own markup language that best fits your data. You can select element names that make sense to you, rather than use tags that are overloaded and presentation-heavy. If you like, you can formalize the language by using element and attribute declarations in the DTD.

XML has syntactic shortcuts such as entities, comments, processing instructions, and CDATA sections. It allows you to group elements and attributes by namespace to further organize the vocabulary of your documents. Using the xml:space attribute can regulate whitespace, sometimes a tricky issue in markup in which human readability is as important as correct formatting.

Some very useful technologies are available to help you maintain and mutate your documents. Schemas, like DTDs, can measure the validity of XML as compared to a canonical model. Schemas go even further by enforcing patterns in character data and improving content model syntax. XSLT is a rich language for transforming documents into different forms. It could be an easier way to work with XML than having to write a program, but isn't always.

This chapter gives a quick recap of XML, where it came from, how it's structured, and how to work with it. If you choose to skip this chapter (because you already know XML or because you're impatient to start writing code), that's fine; just remember that it's here if you need it.

2.1. A Brief History of XML

Early text processing was closely tied to the machines that displayed it. Sophisticated formatting was tied to a particular device -- or rather, a class of devices called printers.

Take troff, for example. Troff was a very popular text formatting language included in most Unix distributions. It was revolutionary because it allowed high-quality formatting without a typesetting machine.

Troff mixes formatting instructions with data. The instructions are symbols composed of characters, with a special syntax so a troff interpreter can tell the two apart. For example, the symbol \fI changes the current font style to italic. Without the backslash character, it would be treated as data. This mixture of instructions and data is called markup.

Troff can be even more detailed than that. The instruction .vs 18p tells the formatter to insert 18 points of vertical space at whatever point in the document where the instruction appears. Beyond aesthetics, we can't tell just by looking at it what purpose this spacing serves; it gives a very specific instruction to the processor that can't be interpreted in any other way. This instruction is fine if you only want to prepare a document for printing in a specific style. If you want to make changes, though, it can be quite painful.

Suppose you've marked up a book in troff so that every newly defined term is in boldface. Your document has thousands of bold font instructions in it. You're happy and ready to send it to the printer when suddenly, you get a call from the design department. They tell you that the design has changed and they now want the new terms to be formatted as italic. Now you have a problem. You have to turn every bold instruction for a new term into an italic instruction.

Your first thought is to open the document in your editor and do a search-and-replace maneuver. But, to your horror, you realize that new terms aren't the only places where you used bold font instructions. You also used them for emphasis and for proper nouns, meaning that a global replace would also mangle these instances, which you definitely don't want. You can change the right instructions only by going through them one at a time, which could take hours, if not days.

No matter how smart you make a formatting language like troff, it still has the same problem: it's inherently presentational. A presentational markup language describes content in terms of how to format it. Troff specifies details about fonts and spacing, but it never tells you what something is. Using troff makes the document less useful in some ways. It's hard to search through troff and come back with the last paragraph of the third section of a book, for example. The presentational markup gets in the way of any task other than its specific purpose: to format the document for printing.

We can characterize troff, then, as a destination format. It's not good for anything but a specific end purpose. What other kind of format could there be? Is there an "origin" format -- that is, something that doesn't dictate any particular formatting but still packages the data in a useful way? People began to ask this key question in the late 1960s when they devised the concept of generic coding: marking up content in a presentation-agnostic way, using descriptive tags rather than formatting instructions.

The Graphic Communications Association (GCA) started a project to explore this new area called GenCode, which develops ways to encode documents in generic tags and assemble documents from multiple pieces -- a precursor to hypertext. IBM's Generalized Markup Language (GML), developed by Charles Goldfarb, Edward Mosher, and Raymond Lorie, built on this concept.[3] As a result of this work, IBM could edit, view on a terminal, print, and search through the same source material using different programs. You can imagine that this benefit would be important for a company that churned out millions of pages of documentation per year.

[3]Cute fact: the initials of these researchers also spell out "GML."

Goldfarb went on to lead a standards team at the American National Standards Institute (ANSI) to make the power of GML available to the world. Building on the GML and GenCode projects, the committee produced the Standard Generalized Markup Language (SGML). Quickly adopted by the U.S. Department of Defense and the Internal Revenue Service, SGML proved to be a big success. It became an international standard when ratified by the ISO in 1986. Since then, many publishing and processing packages and tools have been developed.

Generic coding was a breakthrough for digital content. Finally, content could be described for what it was, instead of how to display it. Something like this looks more like a database than a word-processing file:

<personnel-record>
  <name>
    <first>Rita</first>
    <last>Book</last>
  </name>
  <birthday>
    <year>1969</year>
    <month>4</month>
    <day>23</day>
  </birthday>
</personnel-record>

Notice the lack of presentational information. You can format the name any way you want: first name then last name, or last name first, with a comma. You could format the date in American style (4/23/1969) or European (23/4/1969) simply by specifying whether the <month> or <day> element should present its contents first. The document doesn't dictate its use, which makes it useful as a source document for multiple destinations.

In spite of its revolutionary capabilities, SGML never really caught on with small companies the way it did with the big ones. Software is expensive and bulky. It takes a team of developers to set up and configure a production environment around SGML. SGML feels bureaucratic, confusing, and resource-heavy. Thus, SGML in its original form was not ready to take the world by storm.

"Oh really," you say. "Then what about HTML? Isn't it true that HTML is an application of SGML?" HTML, that celebrity of the Internet, the harbinger of hypertext and workhorse of the World Wide Web, is indeed an application of SGML. By application, we mean that it is a markup language derived with the rules of SGML. SGML isn't a markup language, but a toolkit for designing your own descriptive markup language. Besides HTML, languages for encoding technical documentation, IRS forms, and battleship manuals are in use.

HTML is indeed successful, but it has limitations. It's a very small language, and not very descriptive. It is closer to troff in function than to DocBook and other SGML applications. It has tags like <i> and <b> that change the font style without saying why. Because HTML is so limited and at least partly presentational, it doesn't represent an overwhelming success for SGML, at least not in spirit. Instead of bringing the power of generic coding to the people, it brought another one-trick pony, in which you could display your content in a particular venue and couldn't do much else with it.

Thus, the standards folk decided to try again and see if they couldn't arrive at a compromise between the descriptive power of SGML and the simplicity of HTML. They came up with the Extensible Markup Language (XML). The "X" stands for "extensible," pointing out the first obvious difference from HTML, which is that some people think that "X" is a cooler-sounding letter than "E" when used in an acronym. The second and more relevant difference is that your documents don't have to be stuck in the anemic tag set of HTML. You can extend the tag namespace to be as descriptive as you want -- as descriptive, even, as SGML. Voilà! The bridge is built.

By all accounts, XML is a smashing success. It has lived up to the hype and keeps on growing: XML-RPC, XHTML, SVG, and DocBook XML are some of its products. It comes with several accessories, including XSL for formatting, XSLT for transforming, XPath for searching, and XLink for linking. Much of the standards work is under the auspices of the World Wide Web Consortium (W3C), an organization whose members include Microsoft, Sun, IBM, and many academic and public institutions.

The W3C's mandate is to research and foster new technology for the Internet. That's a rather broad statement, but if you visit their site at http://www.w3.org/ you'll see that they cover a lot of bases. The W3C doesn't create, police, or license standards. Rather, they make recommendations that developers are encouraged, but not required, to follow.[4]

[4]When a trusted body like the W3C makes a recommendation, it often has the effect of a law; many developers begin to follow the recommendation upon its release, and developers who hope to write software that is compatible with everyone else's (which is the whole point behind standards like XML) had better follow the recommendation as well.

However, the system remains open enough to allow healthy dissent, such as the recent and interesting case of XML Schema, a W3C standard that has generated controversy and competition. We'll examine this particular story further in Chapter 3, "XML Basics: Reading and Writing". It's strong enough to be taken seriously, but loose enough not to scare people away. The recommendations are always available to the public.

Every developer should have working knowledge of XML, since it's the universal packing material for data, and so many programs are all about crunching data. The rest of this chapter gives a quick introduction to XML for developers.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.