Appendix A. The XML You Need for OpenDocument

The purpose of this appendix is to introduce you to XML. A knowledge of XML is essential if you wish to manipulate OpenDocument files directly, since XML is the basis of the OpenDocument format.

If you’re already acquainted with XML, you don’t need to read this appendix. If not, read on. The general overview of XML given in this appendix should be more than sufficient to enable you to work with OpenDocument documents. For further information about XML, the O’Reilly books Learning XML by Erik T. Ray and XML in a Nutshell by Elliotte Rusty Harold and W. Scott Means are invaluable guides, as is the weekly online magazine XML.com.

Note that this appendix makes frequent reference to the formal XML 1.0 specification, which can be used for further investigation of topics that fall outside the scope of this book. Readers are also directed to the "Annotated XML Specification," written by Tim Bray and published online at http://XML.com/, which provides illuminating explanation of the XML 1.0 specification; and to "What is XML?" by Norm Walsh, also published on XML.com.

XML, the Extensible Markup Language, is an Internet-friendly format for data and documents, invented by the World Wide Web Consortium (W3C). The "Markup" denotes a way of expressing the structure of a document within the document itself. XML has its roots in a markup language called SGML (Standard Generalized Markup Language), which is used in publishing and shares this heritage with HTML. XML was created to do for machine-readable documents on the Web what HTML did for human-readable documents - that is, provide a commonly agreed-upon syntax so that processing the underlying format becomes a commodity and documents are made accessible to all users.

Unlike HTML, though, XML comes with very little predefined. HTML developers are accustomed both to the notion of using angle brackets < > for denoting elements (that is, syntax), and also to the set of element names themselves (such as head, body, etc.). XML only shares the former feature (i.e., the notion of using angle brackets for denoting elements). Unlike HTML, XML has no predefined elements, but is merely a set of rules that lets you write other languages like HTML.[17] Because XML defines so little, it is easy for everyone to agree to use the XML syntax, and then to build applications on top of it. It’s like agreeing to use a particular alphabet and set of punctuation symbols, but not saying which language to use. However, if you’re coming to XML from an HTML background, then prepare yourself for the shock of having to choose what to call your tags!

Knowing that XML’s roots lie with SGML should help you understand some of XML’s features and design decisions. Note that, although SGML is essentially a document-centric technology, XML’s functionality also extends to data-centric applications, including OpenDocument. Commonly, data-centric applications do not need all the flexibility and expressiveness that XML provides and limit themselves to employing only a subset of XML’s functionality.

The best way to explain how an XML document is composed is to present one. The following example shows an XML document you might use to describe two authors:

<?xml version="1.0" encoding="us-ascii"?>
<authors>
    <person id="lear">
        <name>Edward Lear</name>
        <nationality>British</nationality>
    </person>
    <person id="asimov">
        <name>Isaac Asimov</name>
        <nationality>American</nationality>
    </person>
    <person id="mysteryperson"/>
</authors>

The first line of the document is known as the XML declaration. This tells a processing application which version of XML you are using - the version indicator is mandatory[18] - and which character encoding you have used for the document. In the previous example, the document is encoded in ASCII. (The significance of character encoding is covered later in this chapter.) If the XML declaration is omitted, a processor will make certain assumptions about your document. In particular, it will expect it to be encoded in UTF-8, an encoding of the Unicode character set. However, it is best to use the XML declaration wherever possible, both to avoid confusion over the character encoding and to indicate to processors which version of XML you’re using.

The second line of the example begins an element, which has been named "authors." The contents of that element include everything between the right angle bracket (>) in <authors> and the left angle bracket (<) in </authors>. The actual syntactic constructs <authors> and </authors> are often referred to as the element start tag and end tag, respectively. Do not confuse tags with elements! Note that elements may include other elements, as well as text. An XML document must contain exactly one root element, which contains all other content within the document. The name of the root element defines the type of the XML document.

Elements that contain both text and other elements simultaneously are classified as mixed content. Many OpenDocument elements contain mixed content.

The sample "authors" document uses elements named person to describe the authors themselves. Each person element has an attribute named id. Unlike elements, attributes can only contain textual content. Their values must be surrounded by quotes. Either single quotes (') or double quotes (") may be used, as long as you use the same kind of closing quote as the opening one.

Within XML documents, attributes are frequently used for metadata (i.e., "data about data")–describing properties of the element’s contents. This is the case in our example, where id contains a unique identifier for the person being described.

As far as XML is concerned, it does not matter in which order attributes are presented in the element start tag. For example, these two elements contain exactly the same information as far as an XML 1.0 conformant processing application is concerned:

<animal name="dog" legs="4"/>
<animal legs="4" name="dog"/>

On the other hand, the information presented to an application by an XML processor on reading the following two lines will be different for each animal element because the ordering of elements is significant:

<animal><name>dog</name><legs>4</legs></animal>
<animal><legs>4</legs><name>dog</name></animal>

XML treats a set of attributes like a bunch of stuff in a bag–there is no implicit ordering–while elements are treated like items on a list, where ordering matters.

New XML developers frequently ask when it is best to use attributes to represent information and when it is best to use elements. As you can see from the "authors" example, if order is important to you, then elements are a good choice. In general, there is no hard-and-fast "best practice" for choosing whether to use attributes or elements.

The final author described in our document has no information available. All we know about this person is his or her ID, mysteryperson. The document uses the XML shortcut syntax for an empty element. The following is a reasonable alternative:

<person id="mysteryperson"></person>

As in HTML, it is possible to include comments within XML documents. XML comments are intended to be read only by people. With HTML, developers have occasionally employed comments to add application-specific functionality–for example, the server-side include functionality of most web servers uses instructions embedded in HTML comments. XML provides other means of indicating application processing instructions,[20] so comments should not be used for any purpose other than those for which they were intended.

The start of a comment is indicated with <!--, and the end of the comment with -->. Any sequence of characters, aside from the string --, may appear within a comment.

Comments tend to be used more in XML documents intended for human consumption than those intended for machine consumption. Since OpenDocument files are almost always intended for machine consumption, they will contain few if any comments.

Another feature of XML with which is occasionally useful when creating XML documents is the mechanism for escaping characters.

Because some characters have special significance in XML, there needs to be a way to represent them. For example, in some cases the < symbol might really be intended to mean "less than" rather than to signal the start of an element name. Clearly, just inserting the character without any escaping mechanism would result in a poorly formed document because a processing application would assume you were commencing another element. Another instance of this problem is that of needing to include both double quotes and single quotes simultaneously in an attribute’s value. Here’s an example that illustrates both these difficulties:

<badDoc>
  <para>
    I'd really like to use the < character
  </para>
  <note title="On the proper 'use' of the " character"/>
</badDoc>

XML avoids this problem by the use of the (fearsomely named) predefined entity references. The word entity in the context of XML simply means a unit of content. The term entity reference means just that, a symbolic way of referring to a certain unit of content. XML predefines entities for the following symbols: left angle bracket (<), right angle bracket (>), apostrophe ('), double quote ("), and ampersand (&).

An entity reference is introduced with an ampersand (&), which is followed by a name (using the word "name" in its formal sense, as defined by the XML 1.0 specification), and terminated with a semicolon (;). Table A.2, “ Predefined entity references in XML 1.0 ” shows how the five predefined entities can be used within an XML document.

Here’s our problematic document revised to use entity references:

<badDoc>
  <para>
    I'd really like to use the &lt; character
  </para>
  <note title="On the proper &apos;use&apos; of the &quot;character"/>
</badDoc>

Being able to use the predefined entities is all you need for OpenDocument; in general, entities are provided as a convenience for human-created XML. XML 1.0 allows you to define your own entities and use entity references as "shortcuts" in your document. Section 4 of the XML 1.0 specification, available at http://www.w3.org/TR/REC-xml#sec-physical-struct, describes the use of entities.

The subject of character encodings is frequently a mysterious one for developers. Most code tends to be written for one computing platform and, normally, to run within one organization. Although the Internet is changing things quickly, most of us have never had cause to think too deeply about internationalization.

XML, designed to be an Internet-friendly syntax for information exchange, has internationalization at its very core. One of the basic requirements for XML processors is that they support the Unicode standard character encoding. Unicode attempts to include the requirements of all the world’s languages within one character set. Consequently, it is very large!

Unicode 3.0 has more than 57,700 code points, each of which corresponds to a character.[21] If one were to express a Unicode string by using the position of each character in the character set as its encoding (in the same way as ASCII does), expressing the whole range of characters would require 4 octets[22] for each character. Clearly, if a document is written in 100 percent American English, it will be four times larger than required - all the characters in ASCII fitting into a 7-bit representation. This places a strain both on storage space and on memory requirements for processing applications.

Fortunately, two encoding schemes for Unicode alleviate this problem: UTF-8 and UTF-16. As you might guess from their names, applications can process documents in these encodings in 8- or 16-bit segments at a time. When code points are required in a document that cannot be represented by one chunk, a bit-pattern is used that indicates that the following chunk is required to calculate the desired code point. In UTF-8 this is denoted by the most significant bit of the first octet being set to 1.

This scheme means that UTF-8 is a highly efficient encoding for representing languages using Latin alphabets, such as English. All of the ASCII character set is represented natively in UTF-8–an ASCII-only document and its equivalent in UTF-8 are byte-for-byte identical. OpenDocument files are always encoded in UTF-8.

This knowledge will also help you debug encoding errors. One frequent error arises because of the fact that ASCII is a proper subset of UTF-8–programmers get used to this fact and produce UTF-8 documents, but use them as if they were ASCII. Things start to go awry when the XML parser processes a document containing, for example, characters from foreign languages, such as Á . Because this character cannot be represented using only one octet in UTF-8, this produces a two-octet sequence in the output document; in a non-Unicode viewer or text editor, it looks like a couple of characters of garbage.

In addition to well-formedness, XML 1.0 offers another level of verification, called validity. To explain why validity is important, let’s take a simple example. Imagine you invented a simple XML format for your friends’ telephone numbers:

<phonebook>
  <person>
    <name>Albert Smith</name>
    <number>123-456-7890</number>
  </person>
  <person>
    <name>Bertrand Jones</name>
    <number>456-123-9876</number>
  </person>
</phonebook>

Based on your format, you also construct a program to display and search your phone numbers. This program turns out to be so useful, you share it with your friends. However, your friends aren’t so hot on detail as you are, and try to feed your program this phone book file:

<phonebook>
  <person>
    <name>Melanie Green</name>
    <phone>123-456-7893</phone>
  </person>
</phonebook>

Note that, although this file is perfectly well-formed, it doesn’t fit the format you prescribed for the phone book, and you find you need to change your program to cope with this situation. If your friends had used number as you did to denote the phone number, and not phone, there wouldn’t have been a problem. However, as it is, this second file is not a valid phonebook document.

For validity to be a useful general concept, we need a machine-readable way of saying what a valid document is; that is, which elements and attributes must be present and in what order. XML 1.0 achieves this by introducing document type definitions (DTDs). For the purposes of OpenDocument, you don’t need to know much about DTDs. OpenDocument uses a different system called Relax-NG to specify in great detail exactly which combinations of elements and attributes make up a valid document.

XML 1.0 lets developers create their own elements and attributes, but leaves open the potential for overlapping names. For example, a bookstore’s customer file might use <title> to hold a value like Mrs. or Dr., whereas its catalog document might use <title> to hold a value like Foundation's Edge. The Namespaces in XML specification (which can be found at http://www.w3.org/TR/REC-xml-names/) provides a mechanism developers can use to identify particular vocabularies by using Uniform Resource Identifiers (URIs).

In our hypothetical situtation, the bookstore may need to combine markup from these vocabularies when preparing an invoice. In order to distinguish the two <title> elements, each vocabulary is assigned a unique URI. The URI for the customer list might be http://bookstore.example.com/clients and the catalog’s URI might be http://bookstore.example.com/catalogML. To be complete, the bookstore has a third vocabulary for the elements and attributes specific to the invoice: http://bookstore.example.com/invoiceML. Each of these URIs is associated with a prefix, which is attached to element name in order to identify which vocabulary it belongs to.

Using namespaces, an invoice might look like this:

<inv:invoice xmlns:inv="http://bookstore.example.com/invoiceML"
    xmlns:customer="http://bookstore.example.com/clients"
    xmlns:book="http://bookstore.example.com/catalogML">
    
    <inv:number>345-0123</inv:number>
    <inv:date>2003-11-08</inv:date>
    <inv:ship-to>
        <customer:title>Dr.</customer:title>
        <customer:name>V. Thant</customer:name>
        <customer:adresss>2211 Crestview Drive</customer:address>
        <!-- etc -->
    </inv:ship-to>
    <inv:item>
        <book:book isbn="0-345-30898-0">
            <book:title>Foundation's Edge</book:title>
            <book:date>1982</book:date>
            <book:author>Isaac Asimov</book:author>
            <!-- etc. -->
        </book:book>
    </inv:item>
</inv:invoice>

In the opening tag, we establish three namespaces. The first one, says that any element or attribute with the prefix inv comes from the vocabulary associated with the URI http://bookstore.example.com/invoiceML. [24] The prefixes customer and book are associated with the other two URIs. When we write elements in the invoice document, we write them with the appropriate vocabulary prefix followed by a colon. By using these namespaces and their prefixes, it is now possible for a program to go through invoices and extract only the book titles, avoiding the Doctors and Misters.

Since an OpenDocument file uses information from many different vocabularies, it will establish a large number of these namespace prefixes and URI associations in the opening element. OpenDocument may also use attributes from different vocabularies within an element. When it does so, it puts a namespace prefix on the attribute name as well.

Namespaces are very simple on the surface, but are a well-known field of combat in XML arcana. For more information on namespaces, see Tim Bray’s "XML Namespaces by Example," published at http://www.xml.com/pub/a/1999/01/namespaces.html. You can also get more information in the books Learning XML and XML in a Nutshell, published by O’Reilly & Associates.

Many parsers exist for using XML with many different programming languages. Most of these tools are freely available, the majority being Open Source.



[17] To clarify XML’s relationship with SGML: XML is an SGML subset. By contrast, HTML is an SGML application. OpenDocument uses XML to express its operations and thus is an XML application.

[18] For reasons that will be clearer later, constructs such as version in the XML declaration are known as pseudoattributes.

[19] Actually, a name may also contain a colon, but the colon is used to delimit a namespace prefix and is not available for arbitrary use. (the section called “XML Namespaces” discusses namespaces in more detail.)

[20] A discussion of processing instructions (PIs) is outside the scope of this book. For more information on PIs, see Section 2.6 of the XML 1.0 specification, at the URL http://www.w3.org/TR/REC-xml#sec-pi.

[21] You can obtain charts of all these characters online by visiting http://www.unicode.org/charts/.

[22] An octet is a string of 8 binary digits, or bits. A byte is commonly, but not always, considered the same thing as an octet.

[23] See http://www.docbook.org/.

[24] The <invoice> element is part of the invoice vocabulary, so it receives the inv prefix.


Copyright (c) 2005 O’Reilly & Associates, Inc. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".