Book HomePerl & XML

5.7. XML::SAX: The Second Generation

The proliferation of SAX parsers presents two problems: how to keep them all synchronized with the standard API and how to keep them organized on your system. XML::SAX, a marvelous team effort by Matt Sergeant, Kip Hampton, and Robin Berjon, solves both problems at once. As a bonus, it also includes support for SAX Level 2 that previous modules lacked.

"What," you ask, "do you mean about keeping all the modules synchronized with the API?" All along, we've touted the wonders of using a standard like SAX to ensure that modules are really interchangeable. But here's the rub: in Perl, there's more than one way to implement SAX. SAX was originally designed for Java, which has a wonderful interface type of class that nails down things like what type of argument to pass to which method. There's nothing like that in Perl.

This wasn't as much of a problem with the older SAX modules we've been talking about so far. They all support SAX Level 1, which is fairly simple. However, a new crop of modules that support SAX2 is breaking the surface. SAX2 is more complex because it introduces namespaces to the mix. An element event handler should receive both the namespace prefix and the local name of the element. How should this information be passed in parameters? Do you keep them together in the same string like foo:bar? Or do you separate them into two parameters?

This debate created a lot of heat on the perl-xml mailing list until a few members decided to hammer out a specification for "Perlish" SAX (we'll see in a moment how to use this new API for SAX2). To encourage others to adhere to this convention, XML::SAX includes a class called XML::SAX::ParserFactory. A factory is an object whose sole purpose is to generate objects of a specific type -- in this case, parsers. XML::SAX::ParserFactory is a useful way to handle housekeeping chores related to the parsers, such as registering their options and initialization requirements. Tell the factory what kind of parser you want and it doles out a copy to you.

XML::SAX represents a shift in the way XML and Perl work together. It builds on the work of the past, including all the best features of previous modules, while avoiding many of the mistakes. To ensure that modules are truly compatible, the kit provides a base class for parsers, abstracting out most of the mundane work that all parsers have to do, leaving the developer the task of doing only what is unique to the task. It also creates an abstract interface for users of parsers, allowing them to keep the plethora of modules organized with a registry that is indexed by properties to make it easy to find the right one with a simple query. It's a bold step and carries a lot of heft, so be prepared for a lot of information and detail in this section. We think it will be worth your while.

5.7.1. XML::SAX::ParserFactory

We start with the parser selection interface, XML::SAX::ParserFactory. For those of you who have used DBI, this class is very similar. It's a front end to all the SAX parsers on your system. You simply request a new parser from the factory and it will dig one up for you. Let's say you want to use any SAX parser with your handler package XML::SAX::MyHandler.

Here's how to fetch the parser and use it to read a file:

use XML::SAX::ParserFactory;
use XML::SAX::MyHandler;
my $handler = new XML::SAX::MyHandler;
my $parser = XML::SAX::ParserFactory->parser( Handler => $handler );
$parser->parse_uri( "foo.xml" );

The parser you get depends on the order in which you've installed the modules. The last one (with all the available features specified with RequiredFeatures, if any) will be returned by default. But maybe you don't want that one. No problem; XML::SAX maintains a registry of SAX parsers that you can choose from. Every time you install a new SAX parser, it registers itself so you can call upon it with ParserFactory. If you know you have the XML::SAX::BobsParser parser installed, you can require an instance of it by setting the variable $XML::SAX::ParserPackage as follows:

use XML::SAX::ParserFactory;
use XML::SAX::MyHandler;
my $handler = new XML::SAX::MyHandler;
$XML::SAX::ParserPackage = "XML::SAX::BobsParser( 1.24 )";
my $parser = XML::SAX::ParserFactory->parser( Handler => $handler );

Setting $XML::SAX:ParserPackage to XML::SAX::BobsParser( 1.24 ) returns an instance of the package. Internally, ParserFactory is require( )-ing that parser and calling its new( ) class method. The 1.24 in the variable setting specifies a minimum version number for the parser. If that version isn't on your system, an exception will be thrown.

To see a list of all the parsers available to XML::SAX, call the parsers( ) method:

use XML::SAX;

my @parsers = @{XML::SAX->parsers( )};

foreach my $p ( @parsers ) {
    print "\n", $p->{ Name }, "\n";
    foreach my $f ( sort keys %{$p->{ Features }} ) {
        print "$f => ", $p->{ Features }->{ $f }, "\n";
    }
}

It returns a reference to a list of hashes, with each hash containing information about a parser, including the name and a hash of features. When we ran the program above we were told that XML::SAX had two registered parsers, each supporting namespaces:

XML::LibXML::SAX::Parser
http://xml.org/sax/features/namespaces => 1

XML::SAX::PurePerl
http://xml.org/sax/features/namespaces => 1

At the time this book was written, these parsers were the only two parsers included with XML::SAX. XML::LibXML::SAX::Parser is a SAX API for the libxml2 library we use in Chapter 6, "Tree Processing". To use it, you'll need to have libxml2, a compiled, dynamically linked library written in C, installed on your system. It's fast, but unless you can find a binary or compile it yourself, it isn't very portable. XML::SAX::PurePerl is, as the name suggests, a parser written completely in Perl. As such, it's completely portable because you can run it wherever Perl is installed. This starter set of parsers already gives you some different options.

The feature list associated with each parser is important because it allows a user to select a parser based on a set of criteria. For example, suppose you wanted a parser that did validation and supported namespaces. You could request one by calling the factory's require_feature( ) method:

my $factory = new XML::SAX::ParserFactory;
$factory->require_feature( 'http://xml.org/sax/features/validation' );
$factory->require_feature( 'http://xml.org/sax/features/namespaces' );
my $parser = $factory->parser( Handler => $handler );

Alternatively, you can pass such information to the factory in its constructor method:

my $factory = new XML::SAX::ParserFactory(
             Required_features => {
                    'http://xml.org/sax/features/validation' => 1
                    'http://xml.org/sax/features/namespaces' => 1
             }
);
my $parser = $factory->parser( Handler => $handler );

If multiple parsers pass the test, the most recently installed one is used. However, if the factory can't find a parser to fit your requirements, it simply throws an exception.

To add more SAX modules to the registry, you only need to download and install them. Their installer packages should know about XML::SAX and automatically register the modules with it. To add a module of your own, you can use XML::SAX's add_parser( ) with a list of module names. Make sure it follows the conventions of SAX modules by subclassing XML::SAX::Base. Later, we'll show you how to write a parser, install it, and add it to the registry.

5.7.2. SAX2 Handler Interface

Once you've selected a parser, the next step is to code up a handler package to catch the parser's event stream, much like the SAX modules we've seen so far. XML::SAX specifies events and their properties in exquisite detail and in large numbers. This specification gives your handler considerable control while ensuring absolute conformance to the API.

The types of supported event handlers fall into several groups. The ones we are most familiar with include the content handlers, including those for elements and general document information, entity resolvers, and lexical handlers that handle CDATA sections and comments. DTD handlers and declaration handlers take care of everything outside of the document element, including element and entity declarations. XML::SAX adds a new group, the error handlers, to catch and process any exceptions that may occur during parsing.

One important new facet to this class of parsers is that they recognize namespaces. This recognition is one of the innovations of SAX2. Previously, SAX parsers treated a qualified name as a single unit: a combined namespace prefix and local name. Now you can tease out the namespaces, see where their scope begins and ends, and do more than you could before.

5.7.2.1. Content event handlers

Focusing on the content of the document, these handlers are the most likely ones to be implemented in a SAX handling program. Note the useful addition of a document locator reference, which gives the handler a special window into the machinations of the parser. The support for namespaces is also new.

set_document_locator( locator )

Called at the beginning of parsing, a parser uses this method to tell the handler where the events are coming from. The locator parameter is a reference to a hash containing these properties:

PublicID

The public identifier of the current entity being parsed.

SystemID

The system identifier of the current entity being parsed.

LineNumber

The line number of the current entity being parsed.

ColumnNumber

The last position in the line currently being parsed.

The hash is continuously updated with the latest information. If your handler doesn't like the information it's being fed and decides to abort, it can check the locator to construct a meaningful message to the user about where in the source document an error was found. A SAX parser isn't required to give a locator, though it is strongly encouraged to do so. You should check to make sure that you have a locator before trying to access it. Don't try to use the locator except inside an event handler, or you'll get unpredictable results.

start_document( document )

This handler routine is called right after set_document_locator( ), just as parsing on a document begins. The parameter, document, is an empty reference, as there are no properties for this event.

end_document( document )

This is the last handler method called. If the parser has reached the end of input or has encountered an error and given up, it sends notification of this event. The return value for this method is used as the value returned by the parser's parse( ) method. Again, the document parameter is empty.

start_element( element )

Whenever the parser encounters a new element start tag, it calls this method. The parameter element is a hash containing properties of the element, including:

Name

The string containing the name of the element, including its namespace prefix.

Attributes

The hash of attributes, in which each key is encoded as {NamespaceURI}LocalName. The value of each item in the hash is a hash of attribute properties.

NamespaceURI

The element's namespace.

Prefix

The prefix part of the qualified name.

LocalName

The local part of the qualified name.

Properties for attributes include:

Name

The qualified name (prefix + local).

Value

The attribute's value, normalized (leading and trailing spaces are removed).

NamespaceURI

The source of the namespace.

Prefix

The prefix part of the qualified name.

LocalName

The local part of the qualified name.

The properties NamespaceURI, LocalName, and Prefix are given only if the parser supports the namespaces feature.

end_element( element )

After all the content is processed and an element's end tag has come into view, the parser calls this method. It is even called for empty elements. The parameter element is a hash containing these properties:

Name

The string containing the element's name, including its namespace prefix.

NamespaceURI

The element's namespace.

Prefix

The prefix part of the qualified name.

LocalName

The local part of the qualified name.

The properties NamespaceURI, LocalName, and Prefix are given only if the parser supports the namespaces feature.

characters( characters )

The parser calls this method whenever it finds a chunk of plain text (character data). It might break up a chunk into pieces and deliver each piece separately, but the pieces must always be sent in the same order as they were read. Within a piece, all text must come from the same source entity. The characters parameter is a hash containing one property, Data, which is a string containing the characters from the document.

ignorable_whitespace( characters )

The term ignorable whitespace is used to describe space characters that appear in places where the element's content model declaration doesn't specifically call for character data. In other words, the newlines often used to make XML more readable by spacing elements apart can be ignored because they aren't really content in the document. A parser can tell if whitespace is ignorable only by reading the DTD, and it would do that only if it supports the validation feature. (If you don't understand this, don't worry; it's not important to most people.) The characters parameter is a hash containing one property, Data, containing the document's whitespace characters.

start_prefix_mapping( mapping )

This method is called when the parser detects a namespace coming into scope. For parsers that are not namespace-aware, this event is skipped, but element and attribute names still include the namespace prefixes. This event always occurs before the start of the element for which the scope holds. The parameter mapping is a hash with these properties:

Prefix

The namespace prefix.

NamespaceURI

The URI that the prefix maps to.

end_prefix_mapping( mapping )

This method is called when a namespace scope closes. This routine's parameter mapping is a hash with one property:

Prefix

The namespace prefix.

This event is guaranteed to come after the end element event for the element in which the scope is declared.

processing_instruction( pi )

This routine handles processing instruction events from the parser, including those found outside the document element. The pi parameter is a hash with these properties:

Target

The target for the processing instruction.

Data

The instruction's data (or undef if there isn't any).

skipped_entity( entity )

Nonvalidating parsers may skip entities rather than resolve them. For example, if they haven't seen a declaration, they can just ignore the entity rather than abort with an error. This method gives the handler a chance to do something with the entity, and perhaps even implement its own entity resolution scheme.

If a parser skips entities, it will have one or more of these features set:

  • Handle external parameter entities (feature-ID is http://xml.org/sax/features/external-parameter-entities)

  • Handle external general entities (feature-ID is http://xml.org/sax/features/external-general-entities)

(In XML, features are represented as URIs, which may or may not actually exist. See Chapter 10, "Coding Strategies" for a fuller explanation.)

The parameter entity is a hash with this property:

Name

The name of the entity that was skipped. If it's a parameter entity, the name will be prefixed with a percent sign (%).

5.7.2.2. Entity resolver

By default, XML parsers resolve external entity references without your program ever knowing they were there. You may want to override that behavior occasionally. For example, you may have a special way of resolving public identifiers, or the entities are entries in a database. Whatever the reason, if you implement this handler, the parser will call it before attempting to resolve the entity on its own.

The argument to resolve_entity( ) is a hash with two properties: PublicID, a public identifier for the entity, and SystemID, the system-specific location of the identity, such as a filesystem path or a URI. If the public identifier is undef, then none was given, but a system identifier will always be present.

5.7.2.3. Lexical event handlers

Implementation of this group of events is optional. You probably don't need to see these events, so not all parsers will give them to you. However, a few very complete ones will. If you want to be able to duplicate the original source XML down to the very comments and CDATA sections, then you need a parser that supports these event handlers.

They include:

5.7.2.4. Error event handlers and catching exceptions

XML::SAX lets you customize your error handling with this group of handlers. Each handler takes one argument, called an exception, that describes the error in detail. The particular handler called represents the severity of the error, as defined by the W3C recommendation for parser behavior. There are three types:

warning( )

This is the least serious of the exception handlers. It represents any error that is not bad enough to halt parsing. For example, an ID reference without a matching ID would elicit a warning, but allow the parser to keep grinding on. If you don't implement this handler, the parser will ignore the exception and keep going.

error( )

This kind of error is considered serious, but recoverable. A validity error falls in this category. The parser should still trundle on, generating events, unless your application decides to call it quits. In the absence of a handler, the parser usually continues parsing.

fatal_error( )

A fatal error might cause the parser to abort parsing. The parser is under no obligation to continue, but might just to collect more error messages. The exception could be a syntax error that makes the document into non-well-formed XML, or it might be an entity that can't be resolved. In any case, this example shows the highest level of error reporting provided in XML::SAX.

According to the XML specification, conformant parsers are supposed to halt when they encounter any kind of well-formedness or validity error. In Perl SAX, halting results in a call to die( ). That's not the end of story, however. Even after the parse session has died, you can raise it from the grave to continue where it left off, using the eval{} construct, like this:

eval{ $parser->parse( $uri ) };
if( $@ ) {
  # yikes! handle error here...
}

The $@ variable is a blessed hash of properties that piece together the story about why parsing failed.

These properties include:

Message

A text description about what happened

ColumnNumber

The number of characters into the line where the error occurred, if this error is a parse error

LineNumber

Which line the error happened on, if the exception was thrown while parsing

PublicID

A public identifier for the entity in which the error occurred, if this error is a parse error

SystemID

A system identifier pointing to the offending entity, if a parse error occurred

Not all thrown exceptions indicate that a failure to parse occurred. Sometimes the parser throws an exception because of a bad feature setting.

5.7.3. SAX2 Parser Interface

After you've written a handler package, you need to create an instance of the parser, set its features, and run it on the XML source. This section discusses the standard interface for XML::SAX parsers.

The parse( ) method, which gets the parsing process rolling, takes a hash of options as an argument. Here you can assign handlers, set features, and define the data source to be parsed. For example, the following line sets both the handler package and the source document to parse:

$parser->parse( Handler => $handler, 
                 Source => { SystemId => "data.xml" });

The Handler property sets a generic set of handlers that will be used by default. However, each class of handlers has its own assignment slot that will be checked before Handler. These settings include: ContentHandler, DTDHandler, EntityResolver, and ErrorHandler. All of these settings are optional. If you don't assign a handler, the parser will silently ignore events and handle errors in its own way.

The Source parameter is a hash used by a parser to hold all the information about the XML being input. It has the following properties:

CharacterStream

This kind of filehandle works in Perl Version 5.7.2 and higher using PerlIO. No encoding translation should be necessary. Use the read( ) function to get a number of characters from it, or use sysread( ) to get a number of bytes. If the CharacterStream property is set, the parser ignores ByteStream or SystemId.

ByteStream

This property sets a byte stream to be read. If CharacterStream is set, this property is ignored. However, it supersedes SystemId. The Encoding property should be set along with this property.

PublicId

This property is optional, but if the application submits a public identifier, it is stored here.

SystemId

This string represents a system-specific location for a document, such as a URI or filesystem path. Even if the source is a character stream or byte stream, this parameter is still useful because it can be used as an offset for external entity references.

Encoding

The character encoding, if known, is stored here.

Any other options you want to set are in the set of features defined for SAX2. For example, you can tell a parser that you are interested in special treatment for namespaces. One way to set features is by defining the Features property in the options hash given to the parse( ) method. Another way is with the method set_feature( ). For example, here's how you would turn on validation in a validating parser using both methods:

$parser->parse( Features => { 'http://xml.org/sax/properties/validate' => 1 } );
$parser->set_feature( 'http://xml.org/sax/properties/validate', 1 );

For a complete list of features defined for SAX2, see the documentation at http://sax.sourceforge.net/apidoc/org/xml/sax/package-summary.html. You can also define your own features if your parser has special abilities others don't. To see what features your parser supports, get_features( ) returns a list and get_feature( ) with a name parameter reports the setting of a specific feature.

5.7.4. Example: A Driver

Making your own SAX parser is simple, as most of the work is handled by a base class, XML::SAX::Base. All you have to do is create a subclass of this object and override anything that isn't taken care of by default. Not only is it convenient to do this, but it will result in code that is much safer and more reliable than if you tried to create it from scratch. For example, checking if the handler package implements the handler you want to call is done for you automatically.

The next example proves just how easy it is to create a parser that works with XML::SAX. It's a driver, similar to the kind we saw in Section 5.4, "Drivers for Non-XML Sources", except that instead of turning Excel documents into XML, it reads from web server log files. The parser turns a line like this from a log file:

10.16.251.137 - - [26/Mar/2000:20:30:52 -0800] "GET /index.html HTTP/1.0" 200 16171

into this snippet of XML:

<entry>
<ip>10.16.251.137<ip>
<date>26/Mar/2000:20:30:52 -0800<date>
<req>GET /apache-modlist.html HTTP/1.0<req>
<stat>200<stat>
<size>16171<size>
<entry>

Example 5-8 implements the XML::SAX driver for web logs. The first subroutine in the package is parse( ). Ordinarily, you wouldn't write your own parse( ) method because the base class does that for you, but it assumes that you want to input some form of XML, which is not the case for drivers. Thus, we shadow that routine with one of our own, specifically trained to handle web server log files.

Example 5-8. Web log SAX driver

package LogDriver;

require 5.005_62;
use strict;
use XML::SAX::Base;
our @ISA = ('XML::SAX::Base');
our $VERSION = '0.01';


sub parse {
    my $self = shift;
    my $file = shift;
    if( open( F, $file )) {
        $self->SUPER::start_element({ Name => 'server-log' });
        while( <F> ) {
            $self->_process_line( $_ );
        }
        close F;
        $self->SUPER::end_element({ Name => 'server-log' });
    }
}


sub _process_line {
    my $self = shift;
    my $line = shift;

    if( $line =~ 
          /(\S+)\s\S+\s\S+\s\[([^\]]+)\]\s\"([^\"]+)\"\s(\d+)\s(\d+)/ ) {
        my( $ip, $date, $req, $stat, $size ) = ( $1, $2, $3, $4, $5 );

        $self->SUPER::start_element({ Name => 'entry' });
        
        $self->SUPER::start_element({ Name => 'ip' });
        $self->SUPER::characters({ Data => $ip });
        $self->SUPER::end_element({ Name => 'ip' });
        
        $self->SUPER::start_element({ Name => 'date' });
        $self->SUPER::characters({ Data => $date });
        $self->SUPER::end_element({ Name => 'date' });
        
        $self->SUPER::start_element({ Name => 'req' });
        $self->SUPER::characters({ Data => $req });
        $self->SUPER::end_element({ Name => 'req' });
        
        $self->SUPER::start_element({ Name => 'stat' });
        $self->SUPER::characters({ Data => $stat });
        $self->SUPER::end_element({ Name => 'stat' });
        
        $self->SUPER::start_element({ Name => 'size' });
        $self->SUPER::characters({ Data => $size });
        $self->SUPER::end_element({ Name => 'size' });
        
        $self->SUPER::end_element({ Name => 'entry' });
    }
}

1;

Since web logs are line oriented (one entry per line), it makes sense to create a subroutine that handles a single line, _process_line( ). All it has to do is break down the web log entry into component parts and package them in XML elements. The parse( ) routine simply chops the document into separate lines and feeds them into the line processor one at a time.

Notice that we don't call event handlers in the handler package directly. Rather, we pass the data through routines in the base class, using it as an abstract layer between the parser and the handler. This is convenient for you, the parser developer, because you don't have to check if the handler package is listening for that type of event. Again, the base class is looking out for us, making our lives easier.

Let's test the parser now. Assuming that you have this module already installed (don't worry, we'll cover the topic of installing XML::SAX parsers in the next section), writing a program that uses it is easy. Example 5-9 creates a handler package and applies it to the parser we just developed.

Example 5-9. A program to test the SAX driver

use XML::SAX::ParserFactory;
use LogDriver;
my $handler = new MyHandler;
my $parser = XML::SAX::ParserFactory->parser( Handler => $handler );
$parser->parse( shift @ARGV );

package MyHandler;

# initialize object with options
#
sub new {
    my $class = shift;
    my $self = {@_};
    return bless( $self, $class );
}


sub start_element {
    my $self = shift;
    my $data = shift;
    print "<", $data->{Name}, ">";
    print "\n" if( $data->{Name} eq 'entry' );
    print "\n" if( $data->{Name} eq 'server-log' );
}

sub end_element {
    my $self = shift;
    my $data = shift;
    print "<", $data->{Name}, ">\n";
}

sub characters {
    my $self = shift;
    my $data = shift;
    print $data->{Data};
}

We use XML::SAX::ParserFactory to demonstrate how a parser can be selected once it is registered. If you wish, you can define attributes for the parser so that subsequent queries can select it based on those properties rather than its name.

The handler package is not terribly complicated; it turns the events into an XML character stream. Each handler receives a hash reference as an argument through which you can access each object's properties by the appropriate key. An element's name, for example, is stored under the hash key Name. It all works pretty much as you would expect.

5.7.5. Installing Your Own Parser

Our coverage of XML::SAX wouldn't be complete without showing you how to create an installation package that adds a parser to the registry automatically. Adding a parser is very easy with the h2xs utility. Though it was originally made to facilitate extensions to Perl written in C, it is invaluable in other ways.

Here, we will use it to create something much like the module installers you've downloaded from CPAN.[26]

[26]For a helpful tutorial on using h2xs, see O'Reilly's The Perl Cookbook by Tom Christiansen and Nat Torkington.

First, we start a new project with the following command:

h2xs -AX -n LogDriver

h2xs automatically creates a directory called LogDriver, stocked with several files.

LogDriver.pm

A stub for our module, ready to be filled out with subroutines.

Makefile.PL

A Perl program that generates a Makefile for installing the module. (Look familiar, CPAN users?)

test.pl

A stub for adding test code to check on the success of installation.

Changes, MANIFEST

Other files used to aid in installation and give information to users.

LogDriver.pm, the module to be installed, doesn't need much extra code to make h2xs happy. It only needs a variable, $VERSION, since h2xs is (justifiably) finicky about that information.

As you know from installing CPAN modules, the first thing you do when opening an installer archive is run the command perl Makefile.PM. Running this command generates a file called Makefile, which configures the installer to your system. Then you can run make and make install to load the module in the right place.

Any deviation from the default behavior of the installer must be coded in the Makefile.PM program. Untouched, it looks like this:

use ExtUtils::MakeMaker;
WriteMakefile(
    'NAME'                => 'LogDriver',         # module name
    'VERSION_FROM'        => 'LogDriver.pm',      # finds version
 );

The argument to WriteMakeFile( ) is a hash of properties about the module, used in generating a Makefile file. We can add more properties here to make the installer do more sophisticated things than just copy a module onto the system. For our parser, we want to add this line:

'PREREQ_PM' => { 'XML::SAX' => 0 }

Adding this line triggers a check during installation to see if XML::SAX exists on the system. If not, the installation aborts with an error message. We don't want to install our parser until there is a framework to accept it.

This subroutine should also be added to Makefile.PM:

sub MY::install {
    package MY;
    my $script = shift->SUPER::install(@_);
    $script =~ s/install :: (.*)$/install :: $1 install_sax_driver/m;
    $script .= <<"INSTALL";

    install_sax_driver :
       \t\@\$(PERL) -MXML::SAX -e "XML::SAX->add_parser(q(\$(NAME)))->save_parsers( )"

INSTALL

    return $script;
}

This example adds the parser to the list maintained by XML::SAX. Now you can install your module.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.