Book HomePerl & XML

3.7. Document Validation

Being well-formed is a minimal requirement for XML everywhere. However, XML processors have to accept a lot on blind faith. If we try to build a document to meet some specific XML application's specifications, it doesn't do us any good if a content generator slips in a strange element we've never seen before and the parser lets it go by with nary a whimper. Luckily, a higher level of quality control is available to us when we need to check for things like that. It's called document validation.

Validation is a sophisticated way of comparing a document instance against a template or grammar specification. It can restrict the number and type of elements a document can use and control where they go. It can even regulate the patterns of character data in any element or attribute. A validating parser tells you whether a document is valid or not, when given a DTD or schema to check against.

Remember that you don't need to validate every XML document that passes over your desk. DTDs and other validation schemes shine when working with specific XML-based markup languages (such as XHTML for web pages, MathML for equations, or CaveML for spelunking), which have strict rules about which elements and attributes go where (because having an automated way to draw attention to something fishy in the document structure becomes a feature).

However, validation usually isn't crucial when you use Perl and XML to perform a less specific task, such as tossing together XML documents on the fly based on some other, less sane data format, or when ripping apart and analyzing existing XML documents.

Basically, if you feel that validation is a needless step for the job at hand, you're probably right. However, if you knowingly generate or modify some flavor of XML that needs to stick to a defined standard, then taking the extra step or three necessary to perform document validation is probably wise. Your toolbox, naturally, gives you lots of ways to do this. Read on.

3.7.1. DTDs

Document type descriptions (DTDs) are documents written in a special markup language defined in the XML specification, though they themselves are not XML. Everything within these documents is a declaration starting with a <! delimiter and comes in four flavors: elements, attributes, entities, and notations.

Example 3-8 is a very simple DTD.

Example 3-8. A wee little DTD

<!ELEMENT memo (to, from, message)>
<!ATTLIST memo priority (urgent|normal|info) 'normal'>
<!ENTITY % text-only "(#PCDATA)*">
<!ELEMENT to %text-only;>
<!ELEMENT from %text-only;>
<!ELEMENT message (#PCDATA | emphasis)*>
<!ELEMENT emphasis %text-only;>
<!ENTITY myname "Bartholomus Chiggin McNugget">

This DTD declares five elements, an attribute for the <memo> element, a parameter entity to make other declarations cleaner, and an entity that can be used inside a document instance. Based on this information, a validating parser can reject or approve a document. The following document would pass muster:

<!DOCTYPE memo SYSTEM "/dtdstuff/memo.dtd">
<memo priority="info">
  <to>Sara Bellum</to>
  <from>&myname;</from>
  <message>Stop reading memos and get back to work!</message>
</memo>

If you removed the <to> element from the document, it would suddenly become invalid. A well-formedness checker wouldn't give a hoot about missing elements. Thus, you see the value of validation.

Because DTDs are so easy to parse, some general XML processors include the ability to validate the documents they parse against DTDs. XML::LibXML is one such parser. A very simple validating parser is shown in Example 3-9.

Example 3-9. A validating parser

use XML::LibXML;
use IO::Handle;

# initialize the parser
my $parser = new XML::LibXML;

# open a filehandle and parse
my $fh = new IO::Handle;
if( $fh->fdopen( fileno( STDIN ), "r" )) {
    my $doc = $parser->parse_fh( $fh );
    if( $doc and $doc->is_valid ) {
        print "Yup, it's valid.\n";
    } else {
        print "Yikes! Validity error.\n";
    }
    $fh->close;
}

This parser would be simple to add to any program that requires valid input documents. Unfortunately, it doesn't give any information about what specific problem makes it invalid (e.g., an element in an improper place), so you wouldn't want to use it as a general-purpose validity checking tool.[19] T. J. Mather's XML::Checker is a better module for reporting specific validation errors.

[19]The authors prefer to use a command-line tool called nsgmls available from http://www.jclark.com/. Public web sites, such as http://www.stg.brown.edu/service/xmlvalid/, can also validate arbitrary documents. Note that, in these cases, the XML document must have a DOCTYPE declaration, whose system identifier (if it has one) must contain a resolvable URL and not a path on your local system.

3.7.2. Schemas

DTDs have limitations; they aren't able to check what kind of character data is in an element and if it matches a particular pattern. What if you wanted a parser to tell you if a <date> element has the wrong format for a date, or if it contains a street address by mistake? For that, you need a solution such as XML Schema. XML Schema is a second generation of DTD and brings more power and flexibility to validation.

As noted in Chapter 2, "An XML Recap", XML Schema enjoys the dubious distinction among the XML-related W3C specification family for being the most controversial schema (at least among hackers). Many people like the concept of schemas, but many don't approve of the XML Schema implementation, which is seen as too cumbersome or constraining to be used effectively.

Alternatives to XML Schema include OASIS-Open's RelaxNG (http://www.oasis-open.org/committees/relax-ng/) and Rick Jelliffe's Schematron (http://www.ascc.net/xml/resource/schematron/schematron.html). Like XML Schema, these specifications detail XML-based languages used to describe other XML-based languages and let a program that knows how to speak that schema use it to validate other XML documents. We find Schematron particularly interesting because it has had a Perl module attached to it for a while (in the form of Kip Hampton's XML::Schematron family).

Schematron is especially interesting to many Perl and XML hackers because it builds on existing popular XML technologies that already have venerable Perl implementations. Schematron defines a very simple language with which you list and group together assertions of what things should look like based on XPath expressions. Instead of a forward-looking grammar that must list and define everything that can possibly appear in the document, you can choose to validate a fraction of it. You can also choose to have elements and attributes validate based on conditions involving anything anywhere else in the document (wherever an XPath expression can reach). In practice, a Schematron document looks and feels like an XSLT stylesheet, and with good reason: it's intended to be fully implementable by way of XSLT. In fact, two of the XML::Schematron Perl modules work by first transforming the user-specified schema document into an XSLT sheet, which it then simply passes through an XSLT processor.

Schematron lacks any kind of built-in data typing, so you can't, for example, do a one-word check to insist that an attribute conforms to the W3C date format. You can, however, have your Perl program make a separate step using any method you'd like (perhaps through the XML::XPath module) to come through date attributes and run a good old Perl regular expression on them. Also note that no schema language will ever provide a way to query an element's content against a database, or perform any other action outside the realm of the document. This is where mixing Perl and schemas can come in very handy.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.