Chapter 2. The meta.xml, styles.xml, settings.xml, and content.xml Files

Though content.xml is king, monarchs rule better when surrounded by able assistants. In an OpenDocument JAR file, these assistants are the meta.xml, style.xml, and settings.xml files. In this chapter, we will examine the assistant files, and then describe the general structure of the content.xml file.

The only files that are actually necessary are content.xml and the META-INF/manifest.xml file. If you create a file that contains word processor elements and zip it up and a manifest that points to that file, OpenOffice.org will be able to open it successfully. The result will be a plain text-only document with no styles. You won’t have any of the meta-information about who created the file or when it was last edited, and the printer settings, view area, and zoom factor will be set to the OpenOffice.org defaults.

The meta.xml file contains information about the document itself. We’ll look at the elements found in this file in decreasing order of importance; at the end of this section, we will list them in the order in which they appear in a document. Most of these elements are reflected in the tabs on OpenOffice.org’s File/Properties dialog, which are show in Figure 2.1, “General Document Properties”, Figure 2.2, “Document Description”, Figure 2.3, “User-defined Information”, and Figure 2.4, “Document Statistics”.

All elements borrowed from the Dublin Core namespace contain text and have no attributes. Table 2.1, “Dublin Core Elements in meta.xml” summarizes them.

Table 2.1. Dublin Core Elements in meta.xml

ElementDescriptionSample from XML file

<dc:title>

The document title; this appears in the title bar.

<dc:title>An Introduction to Digital Cameras</dc:title>

<dc:subject>

The Dublin Core recommends that this element contain keywords or key phrases to describe the topic of the document; OpenOffice.org keeps keywords in a separate set of elements.

<dc:subject>Digital Photography</dc:subject>

<dc:description>

This element’s content is shown in the Comments field in the dialog box.

<dc:description>This introduction…</dc:description>

<dc:creator>

This element’s content is shown in the Modified field in Figure 2.1, “General Document Properties”; it names the last person to edit the file. This may appear odd, but the Dublin Core says that the creator is simply an “entity primarily responsible for making the content of the resource.” That is not necessarily the original creator, whose name is stored in a different element.

<dc:creator>J David Eisenberg</dc:creator>

<dc:date>

This element’s content is also shown in the Modified field in Figure 2.1, “General Document Properties”. It is stored in a form compatible with ISO-8601. The time is shown in local time. See the section called “Time and Duration Formats” for details about times and dates.

<dc:date>2005-05-30T20:30:30</dc:date>

<dc:language>

The document’s language, written as a two or three-letter main language code followed by a two-letter sublanguage code. This field is not shown in the properties dialog, but is found in OpenOffice.org’s Tools/Options/Language Settings dialog.

<dc:language>en-US</dc:language>

The remaining elements in the meta.xml file come from OpenDocument’s meta namespace. Table 2.2, “OpenDocument Elements in meta.xml” describes these elements in the order in which they appear in the file.

Table 2.2. OpenDocument Elements in meta.xml

ElementDescriptionSample from XML file

<meta:generator>

The program that created this document. According to the specifcation, you should not “fake” being OpenOffice.org if you are creating the document using a different program; you should use a unique identifier.

<meta:generator>OpenOffice.org/1.9.100$Linux OpenOffice.org_project/680m100$Build-8909</meta:generator>

<meta:initial-creator>

The user who created the document. This is shown in the "Created:" area in Figure 2.1, “General Document Properties”.

<meta:initial-creator>Steven Eisenberg</meta:initial-creator>

<meta:creation-date>

The date and time when the document was created. This is shown in the “Created:” area in Figure 2.1, “General Document Properties”. It is in the same format as described in the section called “Time and Duration Formats”.

<meta:creation-date>2005-05-30T20:29:42</meta:creation-date>

<meta:keyword>

A document can have multiple <meta:keyword> elements. These elements reflect the entries in the “Keywords:” area in Figure 2.2, “Document Description”.


<meta:keyword>photography</meta:keyword>
<meta:keyword>cameras</meta:keyword>
<meta:keyword>optics</meta:keyword>
<meta:keyword>digital cameras</meta:keyword>

<meta:editing-cycles>

This element tells how many times the file has been edited; this is the “Revision Number:” in in Figure 2.1, “General Document Properties”.

<meta:editing-cycles>5</meta:editing-cycles>

<meta:editing-duration>

This element tells the total amount of time that has been spent editing the document in all editing sessions; this is the “Editing time:” in Figure 2.1, “General Document Properties”, and is represented as described in the section called “Time and Duration Formats”.

<meta:editing-duration>PT1H28M55S</meta:editing-duration>

<meta:user-defined>

OpenOffice.org allows you to define your own information, as shown in Figure 2.3, “User-defined Information”. This element has a meta:name attribute, giving the “title” of this information, and the content of the element is the information itself.

<meta:user-defined meta:name="Maximum Length">3 pages or
750 words</meta:user-defined>

<meta:document-statistic>

This is the information shown on the statistics tab of the properties dialog (see Figure 2.4, “Document Statistics”). This element has attributes whose names are largely self-explanatory, and are listed in Table 2.3, “Attributes of the <meta:document-statistic> Element”.

<meta:document-statistic meta:paragraph-count="4"…/>

Now that we know what the format of the meta file is, let’s construct a Perl program to extract that information. Again, rather than reinvent the wheel, we will use two existing modules from the Comprehensive Perl Archive Network, CPAN (http://www.cpan.org/). The first of these, Archive::Zip::MemberRead, will let us read the meta.xml file directly from a compressed OpenDocument document. We will use the XML::Simple module to do the main work of the extraction program.

The program that actually does the extraction, Example 2.3, “Program show_meta.pl”, takes one argument: the OpenDocument filename. The program receives its input from the piped output of member_read.pl.

After the file is parsed, the program prints the data. Information in the <meta:document-statistic> is selected depending upon the type of document being parsed. The program also uses the Text::Wrap module to format the description, which may be several lines long. [This is program show_meta.pl in directory ch02 in the downloadable example files.]

Example 2.3. Program show_meta.pl

#!/usr/bin/perl

#
#   Show meta-information from an OpenDocument meta.xml file.
#
use XML::Simple;
use IO::File;
use Text::Wrap;
use Carp;
use strict;

my $suffix;     # file suffix

#
#   Check for one argument: the name of the OpenOffice.org document
#
if (scalar @ARGV != 1)
{
    croak("Usage: $0 document");
}

#
#   Get file suffix for later reference
#
($suffix) = $ARGV[0] =~ m/\.(\w\w\w)$/;

#
#   Parse and collect information into the $meta hash reference
#
$ARGV[0] =~ s/[;|'"]//g;  #eliminate dangerous shell metacharacters     
my $fh = IO::File->new("perl member_read.pl $ARGV[0] meta.xml |");
my $xml= XMLin( $fh, forcearray => [ 'meta:keyword'] );
my $meta= $xml->{'office:meta'};

#
#   Output phase
#
print "Title:       $meta->{'dc:title'}\n"
    if ($meta->{'dc:title'});
print "Subject:     $meta->{'dc:subject'}\n"
    if ($meta->{'dc:subject'});

if ($meta->{'dc:description'})
{
    print "Description:\n";
    $Text::Wrap::columns = 60;
    print wrap("\t", "\t", $meta->{'dc:description'}), "\n";
}

print "Created:     ";
print format_date($meta->{'meta:creation-date'});
print " by $meta->{'meta:initial-creator'}"
    if ($meta->{'meta:initial-creator'});
print "\n";

print "Last edit:   ";
print format_date($meta->{"dc:date"});
print " by $meta->{'dc:creator'}"
    if ($meta->{'dc:creator'});
print "\n";

# Display keywords (which all appear to be in a single element)
#
print "Keywords:    ", join( ' - ',
  @{$meta->{'meta:keywords'}->{'meta:keyword'}}), "\n"
    if( $meta->{'meta:keywords'});

#
#   Take attributes from the meta:document-statistic element
#   (if any) and put them into the $statistics hash reference
#
my $statistics= $meta->{'meta:document-statistic'};
if ($suffix eq "sxw")
{
        print "Pages:       $statistics->{'meta:page-count'}\n";
        print "Words:       $statistics->{'meta:word-count'}\n";
        print "Tables:      $statistics->{'meta:table-count'}\n";
        print "Images:      $statistics->{'meta:image-count'}\n";
}
elsif ($suffix eq "sxc")
{
        print "Sheets:      $statistics->{'meta:table-count'}\n";
        print "Cells:       $statistics->{'meta:cell-count'}\n"
                if ($statistics->{'meta:cell-count'});
}


#
#   A convenience subroutine to make dates look
#   prettier than ISO-8601 format.
#
sub format_date
{
    my $date = shift;
    my ($year, $month, $day, $hr, $min, $sec);
    my @monthlist = qw (Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec);
    
    ($year, $month, $day, $hr, $min, $sec) =
        $date =~ m/(\d{4})-(\d{2})-(\d{2})T(\d{2}):(\d{2}):(\d{2})/;
    return "$hr:$min on $day $monthlist[$month-1] $year";
}   

These two lines from the preceding program are where all the parsing takes place:

my $fh = IO::File->new("perl member_read.pl $ARGV[0] meta.xml |");
my $xml= XMLin( $fh, forcearray => [ 'meta:keyword'] );

In the first line, we used IO::File->new, because our version of Perl wouldn’t read from a file handle opened with the standard Perl open(). In the second line, the forcearray parameter will force the content of the <meta:keyword> element to be an array type, even if there is only one element. This avoids scalar versus array problems.

While XML::Simple is the easiest way to accomplish this task, it is not the most flexible way to parse XML. For more general XML parsing, you probably want to use the XML::SAX module. the section called “Showing Meta-information Using SAX” shows this same program written with the XML::SAX module.

The styles.xml file contains information about the styles that are used in the document. Some of this information is also duplicated in the content.xml document.

File styles.xml begins with a <office:document-styles> element, which contains font declarations (<office:font-decls>), default and named styles (<office:styles>), "automatic," or unnamed styles (<office:automatic-styles>), and master styles (<office:master-styles>). All of these elements are optional.

The <office:styles> element is a container for (among other things) default styles and named styles. In OpenOffice.org, these are set with the Stylist tool. A spreadsheet’s <office:styles> element will also contain information about style for numbers, currency, percentage values, dates, times, and boolean data. A drawing will have information about default gradients, hatch patterns, fill images, markers, and dash patterns for drawing lines.

The most important elements that you will find within <office:styles> are <style:default-style> and <style:style>. Both elements contain a style:family attribute which tells what “level” the style applies to. The possible values of this required attribute are text (character level), paragraph, section, table, table-column, table-row, table-cell, table-page, chart, graphics, default, drawing-page, presentation, control, and ruby[1]

Both <style:default-style> and <style:style> have a style:name attribute. Styles built in to OpenOffice.org’s stylist, or ones that you create there, will have names like Heading_20_1 or Custom_20_Citation. Non-alphanumeric characters in names are converted to hexadecimal; thus blanks are converted to _20_. A style named !wow?#@$ would be stored as _21_wow_3f__23__40__24_. Automatic styles will have names consisting of a one- or two-letter abbreviation followed by a number; a style name such as T1 is the first automatic style for style:family="text"; P3 would be the third style for paragraphs, ta2 would be the second style for a table, ro4 would be the fourth style for a table row, etc.

The other attribute of interest is the optional parent-style-name, which you will find in styles that have been derived from other styles. In a text document, OpenOffice.org will often create a temporary style whose parent is the style found in the styles.xml file.

Within each <style:style> or <style:default-style>, you will find the <style:family-properties> element, which describes the style in minute detail via an immense number of attributes. The family is related to the style:family attribute; if a style has style:family="table", then it will contain a <style:table-properties> element; style:family="paragraph", will contain a <style:paragraph-properties> element, and so forth.

A full discussion of styles is beyond the scope of this book, so we will simply give you an idea of the range of style specifications, and take up specific details of styles when they are relevant in other chapters. Example 2.4, “Style Defintion in a Word Processing Document”, Example 2.5, “Style Defintion in a Spreadsheet Document”, and Example 2.6, “Style Defintion in a Drawing Document” are excerpts from the styles.xml files in a word processing, spreadsheet, and drawing document

Although the details of the content.xml vary widely depending upon the type of document you are dealing with, there are elements which are common to all content.xml files. The root element is the <office:document-content> element. It defines all the namespaces that will be used throughout the document. The office:version attribute tells you which version of OpenDocument was used in the document.

The following elements are contained within the <office:document-content> element. The optional <office:scripts> element does appear in most documents and is always empty, even if your document contains macros. Go figure.

The <office:scripts> is followed by elements that describe the document’s presentation. The optional <office:font-face-decls> element describes fonts used in your document, and duplicates the information found in styles.xml. If you have defined any styles “on the fly,” then these automatic styles are described in the optional <office:automatic-styles> element.

The last child element of <office:document-content> is the required, and all-important, <office:body> element. This is where all the action is, and we will spend much of the rest of this book examining its contents. Its first child element tells which kind of document we are dealing with:

Example 2.7, “Structure of the content.xml file” shows the skeleton for an OpenOffice.org document’s content.xml file.



[1] Ruby refers to “furigana,” which are small Japanese alphabetic characters placed near the Japanese ideograms to aid readers in determining their correct meaning.


Copyright (c) 2005 O’Reilly & Associates, Inc. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".