Book HomeXML in a Nutshell

7.4. Prospects for Improved Web-Search Methods

Part of the hype of XML has been that web search engines will finally understand what a document means by looking at its markup. For instance, you can search for the movie Sneakers and just get back hits about the movie without having to sort through "Internet Wide Area `Tiger Teamers' mailing list," "Children's Side Zip Sneakers Recalled by Reebok," "Infant's `Little Air Jordan' Sneakers Recalled by NIKE," "Sneakers.com - Athletic shoes from Nike, Reebok, Adidas, Fila, New," and the 32,395 other results that Google pulled up on this search that had nothing to do with the movie.[6]

[6]In fairness to Google, four of the first ten hits it returned were about the movie.

In practice, this is still vapor, mostly because few web pages are available on the frontend in XML, even though more and more backends are XML. The search-engine robots only see the frontend HTML. As this slowly changes, and as the search engines get smarter, we should see more and more useful results. Meanwhile, it's possible to add some XML hints to your HTML pages that knowledgeable search engines can take advantage of using the Resource Description Framework (RDF), the Dublin Core, and the robots processing instruction.

7.4.1. RDF

The Resource Description Framework (RDF, http://www.w3.org/RDF/) can be understood as an XML encoding for a particularly simple data model. An RDF document describes resources. Each resource has zero or more properties. Each property has a name and a value. The value may itself be another resource.

The root element of an RDF document is an RDF element. Each resource the RDF element describes is represented as a Description element whose about attribute contains a URI or other identifier pointing to the resource described. Each child element of the Description element represents a property of the resource. The contents of that child element are the value of that property. All RDF elements like RDF and Description are placed in the http://www.w3.org/1999/02/22-rdf-syntax-ns# namespace. Property values generally come from other namespaces.

For example, suppose we want to say that the book XML in a Nutshell has the authors W. Scott Means and Elliotte Rusty Harold. In other words, we want to say that the resource identified by the URI urn:isbn:0596002920 has one author property with the value "W. Scott Means" and another author property with the value "Elliotte Rusty Harold." Example 7-10 does this.

Example 7-10. A simple RDF document saying that W. Scott Means and Elliotte Rusty Harold are the authors of XML in a Nutshell

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

  <rdf:Description about="urn:isbn:0596002920">
    <author>Elliotte Rusty Harold</author>
    <author>W. Scott Means</author>
  </rdf:Description>

</rdf:RDF>

In this simple example the values of the author properties are merely text. However, they could be XML as well. Indeed, they could be other RDF elements.

There's more to RDF, including containers, schemas, and nested properties. However, this will be sufficient description for web metadata.

7.4.2. Dublin Core

The Dublin Core, http://purl.org/dc/, is a standard set of ten information items with specified semantics that reflect the sort of data you'd be likely to find in a card catalog or annotated bibliography. These are:

Title
Fairly self-explanatory; this is the name by which the resource is known. For instance, the title of this book is "XML in a Nutshell."

Creator
The person or organization who created the resource, e.g., a painter, author, illustrator, composer, and so on. For instance, the creators of this book are W. Scott Means and Elliotte Rusty Harold.

Subject
A list of keywords, very likely from some other vocabulary such as the Dewey Decimal System or Yahoo categories, identifying the topics of the resource. For instance, using the Library of Congress Subject Headings vocabulary, the subject of this book is "XML (Document markup language)."

Description
Typically, a brief amount of text describing the content of the resource in prose, but it may also include a picture, a table of contents, or any other description of the resource. For instance, a description of this book might be "A brief tutorial on and quick reference to XML and related technologies and specifications."

Publisher
The name of the person, company, or organization who makes the resource available. For instance, the publisher of this book is "O'Reilly & Associates."

Contributor
A person or organization who made some contribution to the resource but is not the primary creator of the resource. For example, the editors of this book, Laurie Petrycki, Simon St.Laurent, and Jeni Tennison, might be identified as contributors, as would Susan Hart, the artist who drew the picture on the cover.

Date
The date when the book was created or published, normally given in the form YYYY-MM-DD. For instance, this book's date might be 2002-05-23.

Type
The abstract kind of resource such as image, text, sound, or software. For instance, a description of this book would have the type text.

Format
For hard objects like books, the physical dimensions of the resource. For instance, the paper version of XML in a Nutshell has the dimensions 6" x 9". For digital objects like web pages, this is possibly the MIME media type. For instance, an online version of this book would have the Format text/html.

Identifier
A formal identifier for the resource, such as an ISBN number, a URI, or a Social Security number. This book's identifier is "0596002920."

Source
The resource from which the present resource was derived. For instance, the French translation of this book might reference the original English edition as its source.

Language
The language in which this resource is written, typically an ISO-639 language code, optionally suffixed with a hyphen and an ISO-3166 country code. For instance, the language for this book is en-US. The language for the French translation of this book might be fr-FR.

Relation
A reference to a resource that is in some way related to the current one, generally using a formal identifier, such as a URI or an ISBN number. For instance, this might refer to the web page for this book.

Coverage
The location, time, or jurisdiction the resource covers. For instance, the coverage of this book might be the U.S., Canada, Australia, the U.K., and Ireland. The coverage of the French translation of this book might be France, Canada, Haiti, Belgium, and Switzerland. Generally these will be listed in some formal syntax such as country codes.

Rights
Information about copyright, patent, trademark and other restrictions on the content of the resource. For instance, a rights statement about this book may say "Copyright 2002 O'Reilly & Associates."

Dublin Core can be encoded in a variety of forms including HTML META tags and RDF. Here we concentrate on its encoding in RDF. Typically, each resource is described with an rdf:Description element. This element contains child elements for as many of the Dublin Core information items as are known about the resource. The name of each of these elements matches the name of one of the 14 Dublin Core properties. These are placed in the http://purl.org/dc/elements/1.1/ namespace. Example 7-11 shows an RDF-encoded Dublin Core description of this book.

Example 7-11. An RDF-encoded Dublin Core description for XML in a Nutshell

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:dc="http://purl.org/dc/elements/1.1/">

  <rdf:Description about="urn:isbn:0596002920">
    <dc:Title>XML in a Nutshell</dc:Title>
    <dc:Creator>W. Scott Means</dc:Creator>
    <dc:Creator>Elliotte Rusty Harold</dc:Creator>
    <dc:Subject>XML (Document markup language)</dc:Subject>.
    <dc:Description>
      A brief tutorial on and quick reference to XML and
      related technologies and specifications
    </dc:Description>
    <dc:Publisher>O'Reilly &amp; Associates</dc:Publisher>
    <dc:Contributor>Laurie Petrycki</dc:Contributor>
    <dc:Contributor>Simon St. Laurent</dc:Contributor>
    <dc:Contributor>Jeni Tennison</dc:Contributor>
    <dc:Contributor>Susan Hart</dc:Contributor>
    <dc:Date>2002-04-23</dc:Date>
    <dc:Type>text</dc:Type>
    <dc:Format>6" x 9"</dc:Format>
    <dc:Identifier>0596002920</dc:Identifier>
    <dc:Language>en-US</dc:Language>
    <dc:Relation>http://www.oreilly.com/catalog/xmlnut/</dc:Relation>
    <dc:Coverage>US UK ZA CA AU NZ</dc:Coverage>
    <dc:Rights>Copyright 2002 O'Reilly &amp; Associates</dc:Rights>
  </rdf:Description>

</rdf:RDF>

There is as yet no standard for how an RDF document should be associated with the XML document it describes. One possibility is for the rdf:RDF element to be embedded in the document it describes, for instance, as a child of the BookInfo element of the DocBook source for this book. Another possibility is that servers provide this meta information through an extra-document channel. For instance, a standard protocol could be defined that would allow search engines to request this information for any page on the site. A convention could be adopted so that for any URL xyz on a given web site, the URL xyz/meta.rdf would contain the RDF-encoded Dublin Core metadata for that URL.

7.4.3. Robots

In HTML the robots META tag tells search engines and other robots whether they're allowed to index a page. Walter Underwood has proposed the following processing instruction as an equivalent for XML documents:

<?robots index="yes" follow="no"?>

Robots will look for this in the prolog of any XML document they encounter. The syntax of this particular processing instruction is two pseudoattributes, one named index and one named follow, whose values are either yes or no. If the index attribute has the value yes, then this page will be indexed by a search-engine robot. If index has the value no, then it won't be. Similarly, if follow has the value yes, then links from this document will be followed. If follow has the value no, then they won't be.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.