Book HomePerl & XML

3.2. XML::Parser

Writing a parser requires a lot of work. You can't be sure if you've covered everything without a lot of testing. Unless you're a mutant who loves to write efficient, low-level parser code, your program will probably be slow and resource-intensive. The good news is that a wide variety of free, high quality, and easy-to-use XML parser packages (written by friendly mutants) already exist to help you. People have bashed Perl and XML together for years, and you have a barnful of conveniently pre-invented wheels at your disposal.

Where do Perl programmers go to find ready-made modules to use in their programs? They go to the Comprehensive Perl Archive Network (CPAN), a many-mirrored public resource full of free, open-source Perl code. If you aren't familiar with using CPAN, you must change your isolationist ways and learn to become a programmer of the world. You'll find a multitude of modules authored by folks who have walked the path of Perl and XML before you, and who've chosen to share the tools they've made with the rest of the world.

TIP: Don't think of CPAN as a catalog of ready-made solutions for all specific XML problems. Rather, look at it as a toolbox or a source of building blocks you can assemble and configure to craft a solution. While some modules specialize in popular XML applications like RSS and SOAP, most are more general-purpose. Chances are, you won't find a module that specifically addresses your needs. You'll more likely take one of the general XML modules and adapt it somehow. We'll show that this process is painless and reveal several ways to configure general modules to your particular application.

XML parsers differ from one another in two major ways. First, they differ in their parsing style , which is how the parser works with XML. There are a few different strategies, such as building a data structure or creating an event stream. Another attribute of parsers, called standards-completeness, is a spectrum ranging from ad hoc on one extreme to an exhaustive, standards-based solution on the other. The balance on the latter axis is slowly moving from the eccentric, nonstandard side toward the other end as the Perl community agrees on how to implement major standards like SAX and DOM.

The XML::Parser module is the great-grandpappy of all Perl-based XML processors. It is a multifaceted parser, offering a handful of different parsing styles. On the standards axis, it's closer to ad hoc than standards-compliant; however, being the first efficient XML parser to appear on the Perl horizon, it has a dear place in our hearts and is still very useful. While XML::Parser uses a nonstandard API and has a reputation for getting a bit persnickety over some issues, it works. It parses documents with reasonable speed and flexibility, and as all Perl hackers know, people tend to glom onto the first usable solution that appears on the radar, no matter how ugly it is. Thus, nearly all of the first few years' worth of Perl and XML modules and programs based themselves on XML::Parser.

Since 2001 or so, however, other low-level parsing modules have emerged that base themselves on faster and more standards-compliant core libraries. We'll touch on these modules shortly. However, we'll start out with an examination of XML::Parser, giving a nod to its venerability and functionality.

In the early days of XML, a skilled programmer named James Clark wrote an XML parser library in C and called it Expat.[15] Fast, efficient, and very stable, it became the parser of choice among early adopters of XML. To bring XML into the Perl realm, Larry Wall wrote a low-level API for it and called the module XML::Parser::Expat. Then he built a layer on top of that, XML::Parser, to serve as a general-purpose parser for everybody. Now maintained by Clark Cooper, XML::Parser has served as the foundation of many XML modules.

[15]James Clark is a big name in the XML community. He tirelessly promotes the standard with his free tools and involvement with the W3C. You can see his work at http://www.jclark.com/. Clark is also editor of the XSLT and XPath recommendation documents at http://www.w3.org/.

The C underpinnings are the secret to XML::Parser's success. We've seen how to write a basic parser in Perl. If you apply our previous example to a large XML document, you'll wait a long time before it finishes. Others have written complete XML parsers in Perl that are portable to any system, but you'll find much better performance in a compiled C parser like Expat. Fortunately, as with every other Perl module based on C code (and there are actually lots of these modules because they're not too hard to make, thanks to Perl's standard XS library),[16] it's easy to forget you're driving Expat around when you use XML::Parser.

[16]See man perlxs or Chapter 25 of O'Reilly's Programming Perl, Third Edition for more information.

3.2.1. Example: Well-Formedness Checker Revisited

To show how XML::Parser might be used, let's return to the well-formedness checker problem. It's very easy to create this tool with XML::Parser, as shown in Example 3-2.

Example 3-2. Well-formedness checker using XML::Parser

use XML::Parser;

my $xmlfile = shift @ARGV;              # the file to parse

# initialize parser object and parse the string
my $parser = XML::Parser->new( ErrorContext => 2 );
eval { $parser->parsefile( $xmlfile ); };

# report any error that stopped parsing, or announce success
if( $@ ) {
    $@ =~ s/at \/.*?$//s;               # remove module line number
    print STDERR "\nERROR in '$file':\n$@\n";
} else {
    print STDERR "'$file' is well-formed\n";
}

Here's how this program works. First, we create a new XML::Parser object to do the parsing. Using an object rather than a static function call means that we can configure the parser once and then process multiple files without the overhead of repeatedly recreating the parser. The object retains your settings and keeps the Expat parser routine alive for as long as you want to parse files, and then cleans everything up when you're done.

Next, we call the parsefile( ) method inside an eval block because XML::Parser tends to be a little overzealous when dealing with parse errors. If we didn't use an eval block, our program would die before we had a chance to do any cleanup. We check the variable $@ for content in case there was an error. If there was, we remove the line number of the module at which the parse method "died" and then print out the message.

When initializing the parser object, we set an option ErrorContext => 2. XML::Parser has several options you can set to control parsing. This one is a directive sent straight to the Expat parser that remembers the context in which errors occur and saves two lines before the error. When we print out the error message, it tells us what line the error happened on and prints out the region of text with an arrow pointing to the offending mistake.

Here's an example of our checker choking on a syntactic faux pas (where we decided to name our program xwf as an XML well-formedness checker):

$ xwf ch01.xml 

ERROR in 'ch01.xml':

not well-formed (invalid token) at line 66, column 22, byte 2354:

<chapter id="dorothy-in-oz">
<title>Lions, Tigers & Bears</title>
=====================^

Notice how simple it is to set up the parser and get powerful results. What you don't see until you run the program yourself is that it's fast. When you type the command, you get a result in a split second.

You can configure the parser to work in different ways. You don't have to parse a file, for example. Use the method parse( ) to parse a text string instead. Or, you could give it the option NoExpand => 1 to override default entity expansion with your own entity resolver routine. You could use this option to prevent the parser from opening external entities, limiting the scope of its checking.

Although the well-formedness checker is a very useful tool that you certainly want in your XML toolbox if you work with XML files often, it only scratches the surface of what we can do with XML::Parser. We'll see in the next section that a parser's most important role is in shoveling packaged data into your program. How it does this depends on the particular style you select.

3.2.2. Parsing Styles

XML::Parser supports several different styles of parsing to suit various development strategies. The style doesn't change how the parser reads XML. Rather, it changes how it presents the results of parsing. If you need a persistent structure containing the document, you can have it. Or, if you'd prefer to have the parser call a set of routines you write, you can do it that way. You can set the style when you initialize the object by setting the value of style. Here's a quick summary of the available styles:

Debug

This style prints the document to STDOUT, formatted as an outline (deeper elements are indented more). parse( ) doesn't return anything special to your program.

Tree

This style creates a hierarchical, tree-shaped data structure that your program can use for processing. All elements and their data are crystallized in this form, which consists of nested hashes and arrays.

Object

Like tree, this method returns a reference to a hierarchical data structure representing the document. However, instead of using simple data aggregates like hashes and lists, it consists of objects that are specialized to contain XML markup objects.

Subs

This style lets you set up callback functions to handle individual elements. Create a package of routines named after the elements they should handle and tell the parser about this package by using the pkg option. Every time the parser finds a start tag for an element called <fooby>, it will look for the function fooby( ) in your package. When it finds the end tag for the element, it will try to call the function _fooby( ) in your package. The parser will pass critical information like references to content and attributes to the function, so you can do whatever processing you need to do with it.

Stream

Like Subs, you can define callbacks for handling particular XML components, but callbacks are more general than element names. You can write functions called handlers to be called for "events" like the start of an element (any element, not just a particular kind), a set of character data, or a processing instruction. You must register the handler package with either the Handlers option or the setHandlers( ) method.

custom

You can subclass the XML::Parser class with your own object. Doing so is useful for creating a parser-like API for a more specific application. For example, the XML::Parser::PerlSAX module uses this strategy to implement the SAX event processing standard.

Example 3-3 is a program that uses XML::Parser with Style set to Tree. In this mode, the parser reads the whole XML document while building a data structure. When finished, it hands our program a reference to the structure that we can play with.

Example 3-3. An XML tree builder

use XML::Parser;

# initialize parser and read the file
$parser = new XML::Parser( Style => 'Tree' );
my $tree = $parser->parsefile( shift @ARGV );

# serialize the structure
use Data::Dumper;
print Dumper( $tree );

In tree mode, the parsefile( ) method returns a reference to a data structure containing the document, encoded as lists and hashes. We use Data::Dumper, a handy module that serializes data structures, to view the result. Example 3-4 is the datafile.

Example 3-4. An XML datafile

<preferences>
  <font role="console">
    <fname>Courier</name>
    <size>9</size>
  </font>
  <font role="default">
    <fname>Times New Roman</name>
    <size>14</size>
  </font>
  <font role="titles">
    <fname>Helvetica</name>
    <size>10</size>
  </font>
</preferences>

With this datafile, the program produces the following output (condensed and indented to be easier to read):

$tree = [ 
          'preferences', [ 
            {}, 0, '\n', 
            'font', [ 
              { 'role' => 'console' }, 0, '\n',
              'size', [ {}, 0, '9' ], 0, '\n', 
              'fname', [ {}, 0, 'Courier' ], 0, '\n'
            ], 0, '\n',
            'font', [ 
              { 'role' => 'default' }, 0, '\n',
              'fname', [ {}, 0, 'Times New Roman' ], 0, '\n',
              'size', [ {}, 0, '14' ], 0, '\n'
            ], 0, '\n', 
            'font', [ 
               { 'role' => 'titles' }, 0, '\n',
               'size', [ {}, 0, '10' ], 0, '\n',
               'fname', [ {}, 0, 'Helvetica' ], 0, '\n',
            ], 0, '\n',
          ]
        ];

It's a lot easier to write code that dissects the above structure than to write a parser of your own. We know, because the parser returned a data structure instead of dying mid-parse, that the document was 100 percent well-formed XML. In Chapter 4, "Event Streams", we will use the Stream mode of XML::Parser, and in Chapter 6, "Tree Processing", we'll talk more about trees and objects.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.