Appendix C. Utilities for Processing OpenDocument Files

Appendix C. Utilities for Processing OpenDocument Files
Prev		Next

Using XSLT to Indent OpenDocument Files

As an application of the preceding script, we present an alternate method of indenting the unpacked files via a simple XSLT transformation. Example C.4, “XSLT Transformation for Indenting” shows this transformation, which simply copies the entire document tree while setting indent to yes in the <xsl:output> element.

Example C.4. XSLT Transformation for Indenting

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="xml" indent="yes"/>

<xsl:template match="/">
    <xsl:copy-of select="."/>
</xsl:template>

</xsl:stylesheet>

We now present a Perl program to invoke this transformation on all the XML files in an unpacked OpenDocument file. We will need to set two paths: one to the transformation script, and one to the location of the preceding XSLT transformation. Make sure you use absolute paths for setting variables $script_location and $transform_location, because find() changes directories as it traverses the directory tree. This is file od_indent.pl in the appc directory in the downloadable example files.

Example C.5. Program to Indent OpenDocument Files via XSLT

#!/usr/bin/perl

use File::Find;

#
#   This program indents XML files within a directory.
#   a simple XSLT transform is used to indent the XML.
#

#
#   Path where you have installed the OpenDocument transform script.
#
$script_location = "/your/path/to/odtransform.sh";

#
#   Path where you have installed the XSLT transformation.
#
$transform_location = "/your/path/to/od_indent.xsl";

if (scalar @ARGV != 1)
{
    print "Usage: $0 directory\n";
    exit;
}

if (!-e $script_location)
{
    print "Cannot find the transform script at $script_location\n";
    exit;
}

if (!-e $transform_location)
{
    print "Cannot find the XSLT transformation file at " ,
        "$transform_location\n";
    exit;
}   

$dir_name = $ARGV[0];

if (!-d $dir_name)
{
    print "The argument to $0 must be the name of a directory\n";
    print "containing XML files to be indented.\n";
    exit;
}

#
#   Indent all XML files.
#
find(\&indent, $dir_name);

#   Warning:
#   This subroutine creates a temporary file with the format
#   __tempnnnn.xml, where nnnn is the current time( ). This
#   will avoid name conflicts when used with OpenOffice.org documents,
#   even though the technique is not sufficiently robust for general use.
#
sub indent
{
    my $xmlfile = $_;
    my $command;
    my $result;
    if ($xmlfile =~ m/\.xml$/)
    {
        $time = time();
        print "Indenting $xmlfile\n";
        $command = "$script_location " .
            "-in $xmlfile -xsl $transform_location -out __temp$time.xml";
        $result = system( $command );
        if ($result == 0 && -e "__temp$time.xml")
        {
            unlink $xmlfile;
            rename "__temp$time.xml", $xmlfile;
        }   
        else
        {
            print "Error occurred while indenting $xmlfile\n";
        }   
    }
}

This process may insert newlines in text as well as between elements. In cases where elements contain other elements, this is not a problem, as OpenDocument ignores whitespace between elements. When expanding text elements, though, the extra newlines could cause extra spaces to appear when repacking the document. Thus, you should use this method to indent the XML document only when you do not want to repack the resulting files.

An XSLT Framework for OpenDocument files

When using XLST with OpenDocument files, you will want to make sure you have declared all the appropriate namespaces. Rather than selecting exactly the namespaces that your document uses, we provide all of the namespaces for OpenDocument in Example C.6, “XSLT Framework for Transforming OpenDocument”, which you may use as a framework for your transformations. This is file framework.xsl in directory appc in the downloadable example files.

Example C.6. XSLT Framework for Transforming OpenDocument

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
	xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
    xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0"
    xmlns:config="urn:oasis:names:tc:opendocument:xmlns:config:1.0"
    xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"
    xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0"
    xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0"
    xmlns:presentation="urn:oasis:names:tc:opendocument:xmlns:presentation:1.0"
    xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0"
    xmlns:chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0"
    xmlns:form="urn:oasis:names:tc:opendocument:xmlns:form:1.0"
    xmlns:script="urn:oasis:names:tc:opendocument:xmlns:script:1.0"
    xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0"
    xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0"
    xmlns:anim="urn:oasis:names:tc:opendocument:xmlns:animation:1.0"

    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:xlink="http://www.w3.org/1999/xlink"
    xmlns:math="http://www.w3.org/1998/Math/MathML"
    xmlns:xforms="http://www.w3.org/2002/xforms"

    xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0"
    xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0"
    xmlns:smil="urn:oasis:names:tc:opendocument:xmlns:smil-compatible:1.0"
    
    xmlns:ooo="http://openoffice.org/2004/office"
    xmlns:ooow="http://openoffice.org/2004/writer"
    xmlns:oooc="http://openoffice.org/2004/calc" 
>

<xsl:template match="/office:document-content">
    <xsl:apply-templates/>
</xsl:template>

</xsl:stylesheet>

OpenDocument White Space Representation

If you are creating an OpenDocument file from a file where white space has been preserved, you will have to convert runs of spaces into <text:s> elements, and convert tabs and line feeds into <text:tab-stop> and <text:line-break> elements. This task is not easily done in native XSLT. Example C.7, “Transforming Whitespace to OpenDocument XML” is a Java extension for Xalan which will do what you need. You will note that we create elements and attributes complete with namespace prefix. This is certainly not a recommended practice, but createElementNS() and setAttributeNS() create xmlns attributes rather than a prefixed name. You will find this Java code in file ODWhiteSpace.java in directory appc in the downloadable example files.

Example C.7. Transforming Whitespace to OpenDocument XML

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.w3c.dom.Text;
import org.apache.xpath.NodeSet;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;

public class ODWhiteSpace {

    public ODWhiteSpace () 
    {}

    public static NodeList compressString( String str )
    {
        ODWhiteSpace whiteSpace = new ODWhiteSpace();
        return whiteSpace.doCompress( str );
    }

    private Document tempDoc;       // necessary for creating elements
    private StringBuffer strBuf;    // where non-whitespace accumulates
    private NodeSet resultSet;      // the value to be returned
    private int pos;                // current position in string
    private int startPos;           // where blanks begin accumulating
    private int nSpaces;            // number of consecutive spaces
    private boolean inSpaces;       // handling spaces?
    private char ch;                // current character in buffer
    private char prevChar;          // previous character in buffer
    private Element element;        // element to be added to node list

    /**
     * Create OpenDocument elements for a string.
     * @param str the string to compress.
     * @return a NodeList for insertion into an OpenDocument file
    */
    public NodeList doCompress( String str )
    {  
        if (str.length() == 0)
        {
            return null;
        }
        tempDoc = null;
        strBuf = new StringBuffer( str.length() );

    
        try
        {
            tempDoc = DocumentBuilderFactory.newInstance().
                newDocumentBuilder().newDocument();
        }
        catch(ParserConfigurationException pce)
        {
            return null;
        }
 
        resultSet = new NodeSet();
        resultSet.setShouldCacheNodes(true);
        
        pos = 0;
        startPos = 0;
        nSpaces = 0;
        inSpaces = false;
        ch = '\u0000';

        while (pos < str.length())
        {
            prevChar = ch;
            ch = str.charAt( pos );
            if (ch == ' ')
            {
                if (inSpaces)
                {
                    nSpaces++;
                }
                else
                {
                    emitText( );
                    nSpaces = 1;
                    inSpaces = true;
                    startPos = pos;
                }
            }
            else if (ch == 0x000a || ch == 0x000d)
            {
                if (prevChar != 0x000d) // ignore LF or CR after CR.
                {
                    emitPending( );
                    element = tempDoc.createElement("text:line-break");
                    resultSet.addNode(element);
                }      
            }
            else if (ch == 0x09)
            {
                emitPending( );
                element = tempDoc.createElement("text:tab-stop");
                resultSet.addNode(element);
            }
            else
            {
                if (inSpaces)
                {
                    emitSpaces( );
                }
                strBuf.append( ch );
            }
            pos++;
        }
        
        emitPending( );     // empty out anything that's accumulated
        return resultSet;
    }
    
    /**
     * Emit accumulated spaces or text
     */
    private void emitPending( )
    {
        if (inSpaces)
        {
            emitSpaces( );
        }
        else
        {
            emitText( );
        }
    }

    /**
     * Emit accumulated text.
     * Creates a text node with currently accumulated text.
     * Side effect: empties accumulated text buffer
     */
    private void emitText( )
    {
        if (strBuf.length() != 0)
        {
            Text textNode = tempDoc.createTextNode( strBuf.toString( ) );
            resultSet.addNode( textNode );
            strBuf = new StringBuffer( );
        }
    }
    
    /**
     * Emit accumulated spaces.
     * If these are leading blanks, emit only a
     * &lt;text:s&gt; element; otherwise a blank plus
     * a &lt;text:s&gt; element (if necessary)
     * Side effect: sets accumulated number of spaces to zero.
     * Side effect: sets "inSpaces" flag to false
     */
    private void emitSpaces( )
    {
        Integer n;
        
        if (nSpaces != 0)
        {
            if (startPos != 0)
            {
                Text textNode = tempDoc.createTextNode( " " );
                resultSet.addNode( textNode );
                nSpaces--;
            }

            n = new Integer(nSpaces);
            if (nSpaces >= 1 || startPos == 0)
            {
                element = tempDoc.createElement( "text:s" );
                element.setAttribute( "text:c", 
                    (new Integer(nSpaces)).toString( ) );
                resultSet.addNode( element );
            }

            inSpaces = false;
            nSpaces = 0;
        }
    }
}

Showing Meta-information Using SAX

This is the same program as Example 2.3, “Program show_meta.pl”, except that it uses the XML::SAX module instead of XML::Simple. XML::SAX is a perl module for the Simple API for XML, which interfaces to an event-driven parser. The parser issues many kinds of events as it parses a document; the ones we are interested in are the events that occur when an element starts, when it ends, and when we encounter the element’s text content. To use XML::SAX, you must specify a handler object, which is a Perl package that contains subroutines that are called when the parser detects events. The handler subroutines receive two parameters: a reference to the parser, and data hash with information about the event. Here are the subroutines that we will implement, the keys from the data hash that we are interested in, and how we will use their values.

start_element

This subroutine is called whenever the parser detects an opening tag for an element. The relevant keys are

Name: The name of the element (with namespace prefix)
Attributes: The value of this key is yet another hash, whose keys are the attribute names, preceded by their namespace URIs. This value for each of these keys is yet another hash, with keys Name and Value, whose values are the attribute name and value.

The program will store the element name in a scalar $element and the attributes in a global array @attributes. It sets a global scalar $text to the null string; this variable will be used to collect all the element’s text content.

characters

This subroutine is called whenever the parser detects a series of characters within an element. The relevant key is

Data: The characters that have been parsed.

The text is concatenated to the end of the $text variable. This is necessary because a single sequence of text may generate multiple calls to the character handler.

end_element

This subroutine is called whenever the parser detects an opening tag for an element. The relevant key is

Name: The name of the element (with namespace prefix).

Upon encountering the end of an element, the program will add the element name as a key in a hash named %info. The hash value will be an anonymous array consisting of the $text content followed by the @attributes array.

Here is the rewritten program, which you will find in file sax_show_meta.pl in the appc directory in the downloadable example files.

Example C.8. Program sax_show_meta.pl

#!/usr/bin/perl

#
#   Show meta-information in an OpenDocument file.
#
use XML::SAX;
use IO::File;
use Text::Wrap;
use Carp;
use strict 'vars';

my $suffix;     # file suffix

my $parser;     # instance of XML::SAX parser
my $handler;    # module that handles elements, etc.
my $filehandle; # file handle for piped input

my $info;       # the hash returned from the parser
my @attributes; # attributes from a returned element
my %attr_hash;  # hash of attribute names and values
#
#   Check for one argument: the name of the OpenDocument file
#
if (scalar @ARGV != 1)
{
    croak("Usage: $0 document");
}

#
#   Get file suffix for later reference
#
($suffix) = $ARGV[0] =~ m/\.(\w\w\w)$/;

#
#   Create an object containing handlers for relevant events.
#
$handler = MetaElementHandler->new();


#
#   Create a parser and tell it where to find the handlers.
#
$parser =
    XML::SAX::ParserFactory->parser( Handler => $handler);

#
#   Input to the parser comes from the output of member_read.pl
# 
$ARGV[0] =~ s/[;|'"]//g;  #eliminate dangerous shell metacharacters     
$filehandle = IO::File->new( "perl member_read.pl $ARGV[0] meta.xml |" ); 

#
#   Parse and collect information.
#
$parser->parse_file( $filehandle );

#
#   Retrieve the information collected by the parser
#
$info = $handler->get_info();  

#
#   Output phase
#
print "Title:       $info->{'dc:title'}[0]\n"
    if ($info->{'dc:title'}[0]);
print "Subject:     $info->{'dc:subject'}[0]\n"
    if ($info->{'dc:subject'}[0]);

if ($info->{'dc:description'}[0])
{
    print "Description:\n";
    $Text::Wrap::columns = 60;
    print wrap("\t", "\t", $info->{'dc:description'}[0]), "\n";
}

print "Created:     ";
print format_date($info->{'meta:creation-date'}[0]);
print " by $info->{'meta:initial-creator'}[0]"
    if ($info->{'meta:initial-creator'}[0]);
print "\n";

print "Last edit:   ";
print format_date($info->{"dc:date"}[0]);
print " by $info->{'dc:creator'}[0]"
    if ($info->{'dc:creator'}[0]);
print "\n";

#
#   Take attributes from the meta:document-statistic element
#   (if any) and put them into %attr_hash
#
@attributes = @{$info->{'meta:document-statistic'}};

if (scalar(@attributes) > 1)
{
    shift @attributes;
    %attr_hash = @attributes;

    if ($suffix eq "sxw")
    {
        print "Pages:       $attr_hash{'meta:page-count'}\n";
        print "Words:       $attr_hash{'meta:word-count'}\n";
        print "Tables:      $attr_hash{'meta:table-count'}\n";
        print "Images:      $attr_hash{'meta:image-count'}\n";
    }
    elsif ($suffix eq "sxc")
    {
        print "Sheets:      $attr_hash{'meta:table-count'}\n";
        print "Cells:       $attr_hash{'meta:cell-count'}\n"
            if ($attr_hash{'meta:cell-count'});
    }
}

#
#   A convenience subroutine to make dates look
#   prettier than ISO-8601 format.
#
sub format_date
{
    my $date = shift;
    my ($year, $month, $day, $hr, $min, $sec);
    my @monthlist = qw (Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec);
    
    ($year, $month, $day, $hr, $min, $sec) =
        $date =~ m/(\d{4})-(\d{2})-(\d{2})T(\d{2}):(\d{2}):(\d{2})/;
    return "$hr:$min on $day $monthlist[$month-1] $year";
}


package MetaElementHandler; 

my %element_info;   # the data structure that we are creating
my $element;        # name of element being processed
my @attributes;     # attributes for this element
my $text;           # text content of the element


sub new { 
    my $class = shift;
    my %opts = @_;
    bless \%opts, $class;
}

sub reset {
    my $self = shift;
    %$self = ();
}

#
#   Store current element and its attribute.
#
sub start_element
{
    my ($self, $parser_data) = @_;
    
    my $hashref; 
    my $item;       # loop control variable

    $element = $parser_data->{"Name"};

    foreach $item (keys %{$parser_data->{"Attributes"}})
    {
        $hashref =  $parser_data->{"Attributes"}{$item};
        push @attributes, $hashref->{"Name"},  $hashref->{"Value"};
    }
    
    $text = ""; # no text content yet.
}

#
#   Create an entry into a hash for the element that is ending
#
sub end_element
{
    my ($self, $parser_data) = @_;

    $element = $parser_data->{"Name"};
    $element_info{$element} = [$text, @attributes];
}

#
#   Accumulate element's text content.
#
sub characters
{
    my ($self, $parser_data) = @_;
    $text .= $parser_data->{"Data"}; 
}

#   Return a reference to the %info hash 
#
sub get_info 
{
    my $self = shift;
    return \%element_info;
}

	XML::SAX doesn’t read from file handles opened with the standard Perl `open()` function; you have to use IO::File to create the file handle.
	The handler object has accumulated all the information from the `meta.xml` file into a hash. We ask the handler to return a reference to that hash.
	XML::SAX wants its handler subroutines to be in a Perl object. The `package` statement serves to “encapsulate” the variables and subroutines. As good citizens, we don’t directly access any of the variables from the main program.
	The `new` subroutine completes the work of making this package into a Perl object. The `reset` subroutine is for XML::SAX’s internal use.
	The `$hashref` variable is here for convenience; if we didn’t use it, then the `push` statement would be even less readable than it already is.
	Note the `.=` operation; since the text inside an element can come from many calls to `characters`, we have to concatentate them all.
	This is not an XML::SAX routine; we are providing it so that we can hand a reference to our accumulated data back to the main program.

Creating Multiple Directory Levels

If you need to create multiple directory levels, but your system doesn’t have the equivalent of Linux’s mkdir --parents option, use the program shown in Example C.9, “Program to Create Directories”, which is in file make_directories.pl in the appc directory in the downloadable example files.

Example C.9. Program to Create Directories

#
#   Command line parameter: pathname of directory to create
#
#   Creates directory and all intervening levels.
#
use File::Path;

if (scalar @ARGV == 1)
{
    mkpath($ARGV[0], 1, 0755);
}
else
{
    print "Usage: $0 path_to_create\n";
}

Copyright (c) 2005 O’Reilly & Associates, Inc. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".

Prev	Up	Next
Appendix B. The XSLT You Need for OpenDocument	Home	Appendix D. GNU Free Documentation License