Now that you have had a quick taste of working with XML, here is an overview of the more common rules and constructs of the XML language.
These are the rules for a well-formed XML document:
All element attribute values must be in quotation marks.
An element must have both an opening and a closing tag, unless it is an empty element.
If a tag is a standalone empty element, it must contain a closing slash (/) before the end of the tag.
All opening and closing element tags must nest correctly.
Isolated markup characters are not allowed in text; < or & must use entity references. In addition, the sequence ]]> must be expressed as ]]> when used as regular text. (Entity references are discussed in further detail later.)
Well-formed XML documents without a corresponding DTD must have all attributes of type CDATA by default.
XML uses the following special markup constructs.
<?xml ...?> |
Although they are not required to, XML documents typically begin with an XML declaration, which must start with the characters <?xml and end with the characters ?>. Attributes include:
For example:
<?xml version="1.0"?> <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <?xml version="number" [encoding="encoding"] [standalone="yes|no"] ?>
<?...?> |
A processing instruction allows developers to place attributes specific to an outside application within the document. Processing instructions always begin with the characters <? and end with the characters ?>. For example:
<?works document="hello.doc" data="hello.wks"?>
You can create your own processing instructions if the XML application processing the document is aware of what the data means and acts accordingly.
<?target attribute1="value" attribute2="value" ... ?>
<!DOCTYPE> |
The <!DOCTYPE> instruction allows you to specify a DTD for an XML document. This instruction currently takes one of two forms:
<!DOCTYPE root-element SYSTEM "URI_of_DTD"> <!DOCTYPE root-element PUBLIC "name" "URI_of_DTD">
<!DOCTYPE Book SYSTEM "http://mycompany.com/dtd/mydoctype.dtd">
<!DOCTYPE Book PUBLIC "-//O'Reilly//DTD//EN" "http://www.oreilly.com/dtd/xmlbk.dtd">
Public DTDs follow a specific naming convention. See the XML specification for details on naming public DTDs.
<!DOCTYPE root-element SYSTEM|PUBLIC ["name"] "URI_of_DTD">
<!— ... —> |
You can place comments anywhere in an XML document, except within element tags or before the initial XML processing instructions. Comments in an XML document always start with the characters <!-- and end with the characters -->. In addition, they may not include double hyphens within the comment. The contents of the comment are ignored by the XML processor. For example:
<!-- Sales Figures Start Here --> <Units>2000</Units> <Cost>49.95</Cost> <!-- comments -->
CDATA |
You can define special sections of character data, or CDATA, which the XML processor does not attempt to interpret as markup. Anything included inside a CDATA section is treated as plain text. CDATA sections begin with the characters <![CDATA[ and end with the characters ]]>. For example:
<![CDATA[ Im now discussing the <element> tag of documents 5 & 6: "Sales" and "Profit and Loss". Luckily, the XML processor wont apply rules of formatting to these sentences! ]]>
Note that entity references inside a CDATA section will not be expanded.
<![CDATA[ ... ]]>
An element is either bound by its start and end tags or is an empty element. Elements can contain text, other elements, or a combination of both. For example:
<para> Elements can contain text, other elements, or a combination. For example, a chapter might contain a title and multiple paragraphs, and a paragraph might contain text and <emphasis>emphasis elements</emphasis>. </para>
An element name must start with a letter or an underscore. It can then have any number of letters, numbers, hyphens, periods, or underscores in its name. Elements are case-sensitive: <Para>, <para>, and <pArA> are considered three different element types.
Element type names may not start with the string xml in any variation of upper- or lowercase. Names beginning with xml are reserved for special uses by the W3C XML Working Group. Colons (:) are permitted in element type names only for specifying namespaces; otherwise, colons are forbidden. For example:
Example |
Comment |
---|---|
<Italic> |
Legal |
<_Budget> |
Legal |
<Punch line> |
Illegal: has a space |
<205Para> |
Illegal: starts with number |
<repair@log> |
Illegal: contains @ character |
<xmlbob> |
Illegal: starts with xml |
Element type names can also include accented Roman characters, letters from other alphabets (e.g., Cyrillic, Greek, Hebrew, Arabic, Thai, Hiragana, Katakana, or Devanagari), and ideograms from the Chinese, Japanese, and Korean languages. Valid element type names can therefore include <são>, <peut-être>, <più>, and <niño>, plus a number of others our publishing system isn't equipped to handle.
If you use a DTD, the content of an element is constrained by its DTD declaration. Better XML applications inform you which elements and attributes can appear inside a specific element. Otherwise, you should check the element declaration in the DTD to determine the exact semantics.
Attributes describe additional information about an element. They always consist of a name and a value, as follows:
<price currency="Euro">
The attribute value is always quoted, using either single or double quotes. Attribute names are subject to the same restrictions as element type names.
The following are reserved attributes in XML.
xml:lang |
The &xml:lang; attribute can be used on any element. Its value indicates the language of the body of the element. This is useful in a multilingual context. For example, you might have:
<para xml:lang="en">Hello</para> <para xml:lang="fr">Bonjour</para>
This format allows you to display one element or the other, depending on the user's language preference.
The syntax of the &xml:lang; value is defined by ISO-639. A two-letter language code is optionally followed by a hyphen and a two-letter country code. Traditionally, the language is given in lowercase and the country in uppercase (and for safety, this rule should be followed), but processors are expected to use the values in a case-insensitive manner.
In addition, ISO-3166 provides extensions for nonstandardized languages or language variants. Valid &xml:lang; values include notations such as en, en-US, en-UK, en-cockney, i-navajo, and x-minbari.
xml:lang="iso_639_identifier"
xml:space |
The &xml:space; attribute indicates whether any whitespace inside the element is significant and should not be altered by the XML processor. The attribute can take one of two enumerated values:
You should set &xml:space; to preserve only if you want an element to behave like the HTML <pre> element, such as when it documents source code.
xml:space="default|preserve"
Copyright © 2003 O'Reilly & Associates. All rights reserved.