Book HomePerl & XML

5.4. Drivers for Non-XML Sources

The filter example used a file containing an XML document as an input source. This example shows just one of many ways to use SAX. Another popular use is to read data from a driver, which is a program that generates a stream of data from a non-XML source, such as a database. A SAX driver converts the data stream into a sequence of SAX events that we can process the way we did previously. What makes this so cool is that we can use the same code regardless of where the data came from. The SAX event stream abstracts the data and markup so we don't have to worry about it. Changing the program to work with files or other drivers would be trivial.

To see a driver in action, we will write a program that uses Ilya Sterin's module XML::SAXDriver::Excel to convert Microsoft Excel spreadsheets into XML documents. This example shows how a data stream can be processed in a pipeline fashion to ultimately arrive in the form we want it. A Spreadsheet::ParseExcel object reads the file and generates a generic data stream, which an XML::SAXDriver::Excel object translates into a SAX event stream. This stream is then output as XML by our program.

Here's a test Excel spreadsheet, represented as a table:

 

A

B

1

baseballs

55

2

tennisballs

33

3

pingpong balls

12

4

footballs

77

The SAX driver will create new elements for us, giving us the names in the form of arguments to handler method calls. We will just print them out as they come and see how the driver structures the document. Example 5-6 is a simple program that does this.

Example 5-6. Excel parsing program

use XML::SAXDriver::Excel;

# get the file name to process
die( "Must specify an input file" ) unless( @ARGV );
my $file = shift @ARGV;
print "Parsing $file...\n";

# initialize the parser
my $handler = new Excel_SAX_Handler;
my %props = ( Source => { SystemId => $file },
              Handler => $handler );
my $driver = XML::SAXDriver::Excel->new( %props );

# start parsing
$driver->parse( %props );

# The handler package we define to print out the XML
# as we receive SAX events.
package Excel_SAX_Handler;

# initialize the package
sub new {
    my $type = shift;
    my $self = {@_};
    return bless( $self, $type );
}

# create the outermost element
sub start_document {
    print "<doc>\n";
}

# end the document element
sub end_document {
    print "</doc>\n";
}

# handle any character data

sub characters {
    my( $self, $properties ) = @_;
    my $data = $properties->{'Data'};
    print $data if defined($data);
}

# start a new element, outputting the start tag
sub start_element {
    my( $self, $properties ) = @_;
    my $name = $properties->{'Name'};
    print "<$name>";
}

# end the new element
sub end_element {
    my( $self, $properties ) = @_;
    my $name = $properties->{'Name'};
    print "</$name>";
}

As you can see, the handler methods look very similar to those used in the previous SAX example. All that has changed is what we do with the arguments. Now let's see what the output looks like when we run it on the test file:

<doc>

<records>
        <record>
                <column1>baseballs</column1>
                <column2>55</column2>
        </record>
        <record>
                <column1>tennisballs</column1>
                <column2>33</column2>
        </record>
        <record>
                <column1>pingpong balls</column1>
                <column2>12</column2>
        </record>
        <record>
                <column1>footballs</column1>
                <column2>77</column2>
        </record>
        <record>
Use of uninitialized value in print at conv line 39.
                <column1></column1>
Use of uninitialized value in print at conv line 39.
                <column2></column2>
        </record>
</records></doc>

The driver did most of the work in creating elements and formatting the data. All we did was output the packages it gave us in the form of method calls. It wrapped the whole document in <records>, making our use of <doc> superfluous. (In the next revision of the code, we'll make the start_document( ) and end_document( ) methods output nothing.) Each row of the spreadsheet is encapsulated in a <record> element. Finally, the two columns are differentiated with <column1> and <column2> labels. All in all, not a bad job.

You can see that with a minimal amount of effort on our part, we have harnessed the power of SAX to do some complex work converting from one format to another. The driver actually automates the conversion, but it gives us enough flexibility in interpreting the events so that we can reject bad data (the empty row, for example) or rename elements. We can even perform complex processing, such as adding up values or sorting rows.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.