Book HomePerl & XML

5.5. A Handler Base Class

SAX doesn't distinguish between different elements; it leaves that burden up to you. You have to sort out the element name in the start_element( ) handler, and maybe use a stack to keep track of element hierarchy. Don't you wish there were some way to abstract that stuff? Ken MacLeod has done just that with his XML::Handler::Subs module.

This module defines an object that branches handler calls to more specific handlers. If you want a handler that deals only with <title> elements, you can write that handler and it will be called. The handler dealing with a start tag must begin with s_, followed by the element's name (replace special characters with an underscore). End tag handlers are the same, but start with e_ instead of s_.

That's not all. The base object also has a built-in stack and provides an accessor method to check if you are inside a particular element. The $self->{Names} variable refers to a stack of element names. Use the method in_element( $name ) to test whether the parser is inside an element named $name at any point in time.

To try this out, let's write a program that does something element-specific. Given an HTML file, the program outputs everything inside an <h1> element, even inline elements used for emphasis. The code, shown in Example 5-7, is breathtakingly simple.

Example 5-7. A program subclassing the handler base

use XML::Parser::PerlSAX;
use XML::Handler::Subs

#
# initialize the parser
#
use XML::Parser::PerlSAX;
my $parser = XML::Parser::PerlSAX->new( Handler => H1_grabber->new( ) );
$parser->parse( Source => {SystemId => shift @ARGV} );

## Handler object: H1_grabber
##
package H1_grabber;
use base( 'XML::Handler::Subs' );

sub new {
    my $type = shift;
    my $self = {@_};
    return bless( $self, $type );
}

#
# handle start of document
#
sub start_document {
  SUPER::start_document( );
  print "Summary of file:\n";
}

#
# handle start of <h1>: output bracket as delineator
#
sub s_h1 {
  print "[";
}

#
# handle end of <h1>: output bracket as delineator
#
sub e_h1 {
  print "]\n";
}

#
# handle character data
#
sub characters {
  my( $self, $props ) = @_;
  my $data = $props->{Data};
  print $data if( $self->in_element( h1 ));
}

Let's feed the program a test file:

<html>
  <head><title>The Life and Times of Fooby</title></head>
  <body>
    <h1>Fooby as a child</h1>
    <p>...</p>
    <h1>Fooby grows up</h1>
    <p>...</p>
    <h1>Fooby is in <em>big</em> trouble!</h1>
    <p>...</p>
  </body>
</html>

This is what we get on the other side:

Summary of file:
[Fooby as a child]
[Fooby grows up]
[Fooby is in big trouble!]

Even the text inside the <em> element was included, thanks to the call to in_element( ). XML::Handler::Subs is definitely a useful module to have when doing SAX processing.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.