Book HomeXML in a Nutshell

Chapter 7. XML on the Web

Contents:

XHTML
Direct Display of XML in Browsers
Authoring Compound Documents with Modular XHTML
Prospects for Improved Web-Search Methods

XML began as an effort to bring the full power and structure of SGML to the Web in a form that was simple enough for nonexperts to use. Like most great inventions, XML turned out to have uses far beyond what its creators originally envisioned. Indeed, there's a lot more XML off the Web than on it. Nonetheless, XML is still a very attractive language in which to write and serve web pages. Since XML documents must be well-formed and parsers must reject malformed documents, XML pages are less likely to have annoying cross-browser incompatibilities. Since XML documents are highly structured, they're much easier for robots to parse. Since XML tag and attribute names reflect the nature of the content they hold, search-engine spiders can more easily determine the true meaning of a page.

XML on the Web comes in three flavors. The first is XHTML, an XMLized variant of HTML 4.0 that tightens up HTML to match XML's syntax. For instance, XHTML requires that all start-tags correspond to a matching end-tag and that all attribute values be quoted. XHTML also adds a few bits of syntax to HTML, such as the XML declaration and empty-element tags that end with />. Most of XHTML can be displayed quite well in legacy browsers, with a few notable exceptions.

The second flavor of XML on the Web is direct display of XML documents that use arbitrary vocabularies in web browsers. Generally, the formatting of the document is supplied either by a CSS stylesheet or by an XSLT stylesheet that transforms the document into HTML (perhaps XHTML). This flavor requires an XML-aware browser and is only beginning to be supported by the installed base of web clients.

A third option is to mix raw XML vocabularies such as MathML and SVG with XHTML using Modular XHTML. Modular XHTML lets you embed RDF cataloging information, MathML equations, SVG pictures, and more inside your XHTML documents. Namespaces sort out which elements belong to which applications.

7.1. XHTML

XHTML is an official W3C recommendation. It defines an XML-compatible version of HTML, or rather it redefines HTML as an XML application instead of as an SGML application. Just looking at an XHTML document, you might not even realize that there's anything different about it. It still uses the same <p>, <li>, <table>, <h1>, and other tags with which you're familiar. Elements and attributes have the same, familiar names they have in HTML. The syntax is still basically the same.

The difference is not so much what's allowed but what's not allowed. <p> is a legal XHTML tag, but <P> is not. <table border="0" width="515"> is legal XHTML; <table border=0 width=515> is not. A paragraph prefixed with a <p> and suffixed with a </p> is legal XHTML, but a paragraph that omits the closing </p> tag is not. Most existing HTML documents require substantial editing before they become well-formed and valid XHTML documents. However, once they are valid XHTML documents, they are automatically valid XML documents that can be manipulated with the same editors, parsers, and other tools you use to work with any XML document.

7.1.1. Moving from HTML to XHTML

Most of the changes required to turn an existing HTML document into an XHTML document involve making the document well-formed. For instance, given a legacy HTML document, you'll probably have to make at least some of these changes to turn it into XHTML:

  • Add missing end-tags like </p> and </li>.

  • Rewrite elements so that they nest rather than overlap. For example, change <p><em>an emphasized paragraph</p></em> to <p><em>an emphasized paragraph</em></p>.

  • Put double or single quotes around your attribute values. For example, change <p align=center> to <p align="center">.

  • Add values (which are the same as the name) to all minimized Boolean attributes. For example, change <input type="checkbox" checked> to <input type="checkbox" checked="checked">.

  • Replace any occurrences of & or < in character data or attribute values with &amp; and &lt;. For instance, change A&P to A&amp;P and <a href="http://www.google.com/search?client=googlet&q=Java%20XML"> to <a href="http://www.google.com/search?client=googlet&amp;q=Java%20XML">.

  • Make sure the document has a single root html element.

  • Change empty elements like <hr> to <hr/> or <hr></hr>.

  • Add hyphens to comments so that <! this is a comment> becomes <!-- this is a comment -->.

  • Encode the document in UTF-8 or UTF-16, or add an XML declaration that specifies in which character set it is encoded.

However, XHTML doesn't merely require well-formedness; it requires validity. In order to create a valid XHTML document, you'll need to make these changes as well:

  • Add a DOCTYPE declaration to the document pointing to one of the three XHTML DTDs.

  • Make all element and attribute names lowercase.

  • Make any other changes you have to make to your markup so that the document validates against the DTD: for example, eliminating nonstandard elements like marquee, adding required attributes like the alt attribute of img, or moving child elements out from inside elements where they're not allowed such as a blockquote inside a p.

In addition, the XHTML specification imposes several requirements that, strictly speaking, are not required for either well-formedness or validity. However, they do make parsing XHTML documents a little easier. These are:

  • The root element of the document must be html.

  • There must be a DOCTYPE declaration that uses a PUBLIC ID to identify one of the three XHTML DTDs.

  • The root element of the document must have an xmlns attribute identifying the default namespace as http://www.w3.org/1999/xhtml.

Finally, if you wish, you may--but do not have to--add an XML declaration or an xml-stylesheet processing instruction to the prolog of your document.

Example 7-1 shows an HTML document from the O'Reilly web site that exhibits many of the validity problems you'll find on the Web today. In fact, this is a much neater page than most. Nonetheless, not all attribute values are quoted. The noshade attribute of the HR element doesn't even have a value. There's no document type declaration. Tags are a mix of upper- and lowercase, mostly uppercase. The DD elements are missing end-tags, and there's some character data inside the second definition that's not part of a DT or a DD.

Example 7-1. A typical HTML document

<HTML><HEAD>
  <TITLE>O'Reilly Shipping Information</TITLE>
</HEAD>
<BODY BGCOLOR="#ffffff" VLINK="#0000CC" LINK="#990000" TEXT="#000000">
<table border=0 width=515>
<tr>
<td>
<IMG SRC="/www/graphics_new/generic_ora_header_wide.gif" BORDER=0>
<H2>U.S. Shipping Information </H2>
<HR size="1" align=left noshade>
<DL>
<DT> <B>UPS Ground Service (Continental US only -- 5-7 business
days):</B></DT>
<DD>
<PRE>
$  5.95 - $ 49.99 ......................... $ 4.50
$ 50.00 - $ 99.99 ......................... $ 6.50
$100.00 - $149.99 ......................... $ 8.50
$150.00 - $199.99 ......................... $10.50
$200.00 - $249.99 ......................... $12.50
$250.00 - $299.99 ......................... $14.50

</PRE>
<DT> <B>Federal Express:</B></DT>
(Shipping within 24 hours of receipt of order by O'Reilly)
<DD>
<PRE>
<EM>1 or 2 books</EM>:
Economy 2-day ............................. $ 8.75
Overnight Standard (Afternoon Delivery) ... $12.75
Overnight Priority (Morning Delivery) ..... $16.50
</PRE>
</DL>
<b>Alaska and Hawaii:</b> add $10 to Federal Express rates.
<P>
<A HREF="int-ship.html"><b>International Shipping Information</b></A>
<P>
<CENTER>
<HR SIZE="1" NOSHADE>
<FONT SIZE="1" FACE="Verdana, Arial, Helvetica">
<A HREF="http://www.oreilly.com/">
<B>O'Reilly Home</B></A> <B> | </B>
<A HREF="http://www.oreilly.com/sales/bookstores">
<B>O'Reilly Bookstores</B></A> <B> | </B>
<A HREF="http://www.oreilly.com/order_new/">
<B>How to Order</B></A> <B> | </B>
<A HREF="http://www.oreilly.com/oreilly/contact.html">
<B>O'Reilly Contacts<BR></B></A>
<A HREF="http://www.oreilly.com/international/">
<B>International</B></A> <B> | </B>
<A HREF="http://www.oreilly.com/oreilly/about.html">
<B>About O'Reilly</B></A> <B> | </B>
<A HREF="http://www.oreilly.com/affiliates.html">
<B>Affiliated Companies</B></A><p>
<EM>&copy; 2000, O'Reilly &amp; Associates, Inc.</EM>
</FONT>
</CENTER>
</td>
</tr>
</table>

</BODY>
</HTML>

Example 7-2 shows this document after it's been converted to XHTML. All the previously noted problems and a few more besides have been fixed. A number of deprecated presentational attributes, such as the size and noshade attributes of hr, had to be replaced with CSS styles. We've also added the necessary document type and namespace declarations. This document can now be read by both HTML and XML browsers and parsers.

Example 7-2. A valid XHTML document

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<style type="text/css">
  body      {backgroundColor: #FFFFFF; color: #000000}
  a:visited {color: #0000CC}
  a:link    {color: #990000}
</style>
<title>O'Reilly Shipping Information</title>
</head>
<body>
<table border="0" width="515">
<tr>
<td><img src="/www/graphics_new/generic_ora_header_wide.gif"
style="border-width: 0" alt="O'Reilly"/>
<h2>U.S. Shipping Information</h2>

<hr style="height: 1; text-align: left"/>
<dl>
<dt><b>UPS Ground Service (Continental US only -- 5-7 business
days):</b></dt>

<dd>
<pre>
$  5.95 - $ 49.99 ......................... $ 4.50
$ 50.00 - $ 99.99 ......................... $ 6.50
$100.00 - $149.99 ......................... $ 8.50
$150.00 - $199.99 ......................... $10.50
$200.00 - $249.99 ......................... $12.50
$250.00 - $299.99 ......................... $14.50
</pre>
</dd>

<dt><b>Federal Express:</b></dt>

<dd>(Shipping within 24 hours of receipt of order by O'Reilly)</dd>

<dd>
<pre>
<em>1 or 2 books</em>:
Economy 2-day ............................. $ 8.75
Overnight Standard (Afternoon Delivery) ... $12.75
Overnight Priority (Morning Delivery) ..... $16.50

</pre>
</dd>
</dl>

<b>Alaska and Hawaii:</b> add $10 to Federal Express rates.

<p><a href="int-ship.html"><b>International Shipping
Information</b></a></p>

<div style="font-size: xx-small; font-face: Verdana, Arial, Helvetica;
            text-align: center">
<hr style="height: 1"/>
<a
href="http://www.oreilly.com/"><b>O'Reilly Home</b></a> <b>|</b> <a
href="http://www.oreilly.com/sales/bookstores"><b>O'Reilly
Bookstores</b></a> <b>|</b> <a
href="http://www.oreilly.com/order_new/"><b>How to Order</b></a>
<b>|</b> <a href="http://www.oreilly.com/oreilly/contact.html"><b>
O'Reilly Contacts<br />
</b></a> <a href="http://www.oreilly.com/international/"><b>
International</b></a> <b>|</b> <a
href="http://www.oreilly.com/oreilly/about.html"><b>About
O'Reilly</b></a> <b>|</b> <a
href="http://www.oreilly.com/affiliates.html"><b>Affiliated
Companies</b></a></div>

<p style="font-size: xx-small;
          font-family: Verdana, Arial, Helvetica"><em>&copy; 2000,
O'Reilly &amp; Associates, Inc.</em></p>
</td>
</tr>
</table>
</body>
</html>
TIP: Making all these changes can be quite tedious for large documents or collections of many documents. Fortunately, there's an open source tool that can do most of the work for you. Dave Ragget's Tidy, http://tidy.sourceforge.net, is a C program that has been ported to most major operating systems and can convert some pretty nasty HTML into valid XHTML. For example, to convert the file bad.html to good.xml, you would type:

% tidy --output-xhtml yes bad.html good.xml

Tidy fixes as much as it can and warns you about what it can't fix so you can fix it manually--for instance, telling you that a required alt attribute is missing from an img element.

7.1.2. Three DTDs for XHTML

XHTML comes in three flavors, depending on which DTD you choose:

Strict
This is the W3C's recommended form of XHTML. This includes all the basic elements and attributes such as p and class. However, it does not include deprecated elements and attributes such as applet and center. It also forbids the use of presentational attributes such as the body element's bgcolor, vlink, link, and text. These capabilities are provided by CSS instead. Strict XHTML is identified with this DOCTYPE declaration:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
                      "DTD/xhtml1-strict.dtd" >

Example 7-2 used this DTD.

Transitional
This is a looser form of XHTML for when you can't easily do without deprecated elements and attributes such as applet and bgcolor. It is identified with this DOCTYPE declaration:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
                      "DTD/xhtml1-transitional.dtd" >
Frameset
This is the same as the transitional DTD except that it also allows frame-related elements such as frameset and iframe. It is identified with this DOCTYPE declaration:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
                      "DTD/xhtml1-frameset.dtd" >

All three DTDs use the same http://www.w3.org/1999/xhtml namespace. You should choose the strict DTD unless you've got a specific reason to use another one.

7.1.3. Browser Support for XHTML

Many current web browsers, especially Internet Explorer 5.0 and earlier and Netscape 4.79 and earlier, deal inconsistently with XHTML. Certainly they don't require it, accepting as they do such a wide variety of malformed, invalid, and out-and-out mistaken HTML. However, beyond that they do have some problems when they encounter certain common XHTML constructs.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.