Table of Contents
Though content.xml is king, monarchs rule better when surrounded by able assistants. In an OpenDocument JAR file, these assistants are the meta.xml, style.xml, and settings.xml files. In this chapter, we will examine the assistant files, and then describe the general structure of the content.xml file.
The only files that are actually necessary are content.xml and the META-INF/manifest.xml file. If you create a file that contains word processor elements and zip it up and a manifest that points to that file, OpenOffice.org will be able to open it successfully. The result will be a plain text-only document with no styles. You won’t have any of the meta-information about who created the file or when it was last edited, and the printer settings, view area, and zoom factor will be set to the OpenOffice.org defaults.
The settings.xml file contains information intended for use exclusively by the application that created the file. From the viewpoint of an external application, there’s very little of use in this file, so we’ll just take a brief look at it before bidding it a fond farewell.
The root element, <office:document-settings> contains a <office:settings> element, which in turn contains one or more <config:config-item-set> entries. Each of these contains one or more items, named item maps,indexed item maps, or other <config:config-item-set>s.
The <config:config-item> element has a config:name attribute that describes the item and a config:type attribute which can be one of boolean, short, int, long, double, string, datetime, or base64Binary. The content of the element gives the value of that item. Example 2.1, “Example of Configuration Items” shows some representative configuration items from a word processing document:
Example 2.1. Example of Configuration Items
<config:config-item config:name="PrinterName" config:type="string">Generic Printer</config:config-item> <config:config-item config:name="ViewLeft" config:type="int">2043</config:config-item> <config:config-item config:name="ShowRedlineChanges" config:type="boolean">true</config:config-item>
The <config:config-item-map-named> element contains one or more <config:config-item-map-entry> sub-elements. Each of these map entries may contain one or more items, item sets, named item maps, or indexed item maps (yes, this is a very recursive data structure). Entries in a named item map are accessed by their unique name attribute. Spreadsheets use a named item map to store information about of each of the sheets in the document.
A <config:config-item-map-indexed> element also contains one or more <config:config-item-map-entry> elements. Each of these map entries may contain one or more items, item sets, named item maps, or indexed item maps. The order of the individual map entries is important; entries are accessed by their position, not by their unique name attribute.
The meta.xml file contains information about the document itself. We’ll look at the elements found in this file in decreasing order of importance; at the end of this section, we will list them in the order in which they appear in a document. Most of these elements are reflected in the tabs on OpenOffice.org’s File/Properties dialog, which are show in Figure 2.1, “General Document Properties”, Figure 2.2, “Document Description”, Figure 2.3, “User-defined Information”, and Figure 2.4, “Document Statistics”.
All elements borrowed from the Dublin Core namespace contain text and have no attributes. Table 2.1, “Dublin Core Elements in meta.xml” summarizes them.
Table 2.1. Dublin Core Elements in meta.xml
Element | Description | Sample from XML file |
---|---|---|
<dc:title> | The document title; this appears in the title bar. | <dc:title>An Introduction to Digital Cameras</dc:title> |
<dc:subject> | The Dublin Core recommends that this element contain keywords or key phrases to describe the topic of the document; OpenOffice.org keeps keywords in a separate set of elements. | <dc:subject>Digital Photography</dc:subject> |
<dc:description> | This element’s content is shown in the Comments field in the dialog box. | <dc:description>This introduction…</dc:description> |
<dc:creator> | This element’s content is shown in the Modified field in Figure 2.1, “General Document Properties”; it names the last person to edit the file. This may appear odd, but the Dublin Core says that the creator is simply an “entity primarily responsible for making the content of the resource.” That is not necessarily the original creator, whose name is stored in a different element. | <dc:creator>J David Eisenberg</dc:creator> |
<dc:date> | This element’s content is also shown in the Modified field in Figure 2.1, “General Document Properties”. It is stored in a form compatible with ISO-8601. The time is shown in local time. See the section called “Time and Duration Formats” for details about times and dates. | <dc:date>2005-05-30T20:30:30</dc:date> |
<dc:language> | The document’s language, written as a two or three-letter main language code followed by a two-letter sublanguage code. This field is not shown in the properties dialog, but is found in OpenOffice.org’s Tools/Options/Language Settings dialog. | <dc:language>en-US</dc:language> |
The remaining elements in the meta.xml file come from OpenDocument’s meta namespace. Table 2.2, “OpenDocument Elements in meta.xml” describes these elements in the order in which they appear in the file.
Table 2.2. OpenDocument Elements in meta.xml
Element | Description | Sample from XML file |
---|---|---|
<meta:generator> | The program that created this document. According to the specifcation, you should not “fake” being OpenOffice.org if you are creating the document using a different program; you should use a unique identifier. | <meta:generator>OpenOffice.org/1.9.100$Linux OpenOffice.org_project/680m100$Build-8909</meta:generator> |
<meta:initial-creator> | The user who created the document. This is shown in the "Created:" area in Figure 2.1, “General Document Properties”. | <meta:initial-creator>Steven Eisenberg</meta:initial-creator> |
<meta:creation-date> | The date and time when the document was created. This is shown in the “Created:” area in Figure 2.1, “General Document Properties”. It is in the same format as described in the section called “Time and Duration Formats”. | <meta:creation-date>2005-05-30T20:29:42</meta:creation-date> |
<meta:keyword> | A document can have multiple <meta:keyword> elements. These elements reflect the entries in the “Keywords:” area in Figure 2.2, “Document Description”. | <meta:keyword>photography</meta:keyword> <meta:keyword>cameras</meta:keyword> <meta:keyword>optics</meta:keyword> <meta:keyword>digital cameras</meta:keyword> |
<meta:editing-cycles> | This element tells how many times the file has been edited; this is the “Revision Number:” in in Figure 2.1, “General Document Properties”. | <meta:editing-cycles>5</meta:editing-cycles> |
<meta:editing-duration> | This element tells the total amount of time that has been spent editing the document in all editing sessions; this is the “Editing time:” in Figure 2.1, “General Document Properties”, and is represented as described in the section called “Time and Duration Formats”. | <meta:editing-duration>PT1H28M55S</meta:editing-duration> |
<meta:user-defined> | OpenOffice.org allows you to define your own information, as shown in Figure 2.3, “User-defined Information”. This element has a meta:name attribute, giving the “title” of this information, and the content of the element is the information itself. | <meta:user-defined meta:name="Maximum Length">3 pages or 750 words</meta:user-defined> |
<meta:document-statistic> | This is the information shown on the statistics tab of the properties dialog (see Figure 2.4, “Document Statistics”). This element has attributes whose names are largely self-explanatory, and are listed in Table 2.3, “Attributes of the <meta:document-statistic> Element”. | <meta:document-statistic meta:paragraph-count="4"…/> |
Table 2.3. Attributes of the <meta:document-statistic> Element
Attribute | Description |
---|---|
meta:page-count | Number of pages in a word processing document. This must be greater than zero. This attribute is not used in spreadsheets. The “number of pages” shown in the statistics dialog for a spreadsheet is a calculated value that tells how many sheets have filled cells on them, and this can be zero for a totally empty spreadsheet. |
meta:paragraph-count | Number of paragraphs in a word processing document. |
meta:word-count | Number of words in a word processing document. |
meta:character-count | Number of characters in a word processing document. |
meta:image-count | Number of images in a word processing document. |
meta:table-count | Number of tables in a word processing document, or number of sheets in a spreadsheet document. |
meta:cell-count | Number of non-empty cells in a spreadsheet document. |
meta:object-count | Number of objects in a document. This is shown as “Number of OLE objects” in the dialog box of Figure 2.4, “Document Statistics”. This attribute is used in drawing and presentation documents, but it does not bear any simple relationship to the number of items you see on the screen. |
meta:ole-object-count | Apparently unused in OpenOffice.org2.0. |
meta:row-count | Apparently unused in OpenOffice.org2.0. |
meta:draw-count | Apparently unused in OpenOffice.org2.0. |
The dates, times, and durations used in the metadata are patterned after the format described in the ISO 8601 standard. A date is written as a four-digit year, two-digit month, and two-digit day separated by hyphens. The capital letter T separates the date from the time, which is written in the form hh:mm:ss.
OpenOffice.org does not implement the full ISO 8601 standard. For example, you may not use a truncated form such as --06-20 for a date, nor may you add a time zone offset after the time.
When you insert a date or time field into a text document, the seconds field is followed by a comma and decimal fraction of a second. Thus, 2005-06-01T09:54:26,50 represents 9:54 and 26.5 seconds on the 1st of June, 2005.
Time durations, such as those in the <meta:editing-duration> element, describe a length of days, hours, minutes, and seconds, written in the form PdDThHmMsS. If the editing time is less than one day, the dD is omitted. Thus, PT12M34S describes a duration of twelve minutes and thirty-four seconds. A duration may not specify a number of years or months as described in the ISO 8601 standard.
Now that we know what the format of the meta file is, let’s construct a Perl program to extract that information. Again, rather than reinvent the wheel, we will use two existing modules from the Comprehensive Perl Archive Network, CPAN (http://www.cpan.org/). The first of these, Archive::Zip::MemberRead, will let us read the meta.xml file directly from a compressed OpenDocument document. We will use the XML::Simple module to do the main work of the extraction program.
It would be inefficient to unpack the document into a temporary directory, read the meta.xml file, and then remove the temporary directory. Archive::Zip::MemberRead avoids this problem by letting you read members of a ZIP archive without having to unpack them. Example 2.2, “Program member_read.pl” shows the program that uses this module. The program takes two arguments: the OpenDocument file name and the member of the packed document that you want to read. The program parcels out the data in 32 kilobyte chunks rather than reading it in as one huge string. It also sends its output to standard output so that it can be piped to other processes. [This is program member_read.pl in directory ch02 in the downloadable example files.]
Example 2.2. Program member_read.pl
#!/usr/bin/perl use Archive::Zip; use Archive::Zip::MemberRead; use Carp; use strict 'vars'; my $zip; # the zip file my $fh; # filehandle to the member being read my $buffer; # 32 kilobyte buffer # # Extract a single XML file from an OpenOffice.org file # Output goes to standard output # if (scalar @ARGV != 2) { croak("Usage: $0 document xmlfilename"); } $zip = new Archive::Zip($ARGV[0]); if (!$zip) { croak("$ARGV[0] cannot be opened as a .ZIP file"); } $fh = new Archive::Zip::MemberRead($zip, $ARGV[1]); if (!$fh) { croak("$ARGV[0] does not contain a file named $ARGV[1]"); } while ($fh->read($buffer, 32*1024)) { print $buffer; }
This module does all the heavy work of parsing an XML file. You just give it the name of a file to parse, and it returns a reference to a hash which contains nested hashes. For example, given this document:
<document> <h1>Information</h1> <para align="left" style="color:red;">More info</para> </document>
XML::Simple returns a data structure as if we had written this Perl statement:
$result = { 'h1' => 'Information', 'para' => { 'align' => 'left', 'content' => 'More info', 'style' => 'color:red;' } };
The program that actually does the extraction, Example 2.3, “Program show_meta.pl”, takes one argument: the OpenDocument filename. The program receives its input from the piped output of member_read.pl.
After the file is parsed, the program prints the data. Information in the <meta:document-statistic> is selected depending upon the type of document being parsed. The program also uses the Text::Wrap module to format the description, which may be several lines long. [This is program show_meta.pl in directory ch02 in the downloadable example files.]
Example 2.3. Program show_meta.pl
#!/usr/bin/perl # # Show meta-information from an OpenDocument meta.xml file. # use XML::Simple; use IO::File; use Text::Wrap; use Carp; use strict; my $suffix; # file suffix # # Check for one argument: the name of the OpenOffice.org document # if (scalar @ARGV != 1) { croak("Usage: $0 document"); } # # Get file suffix for later reference # ($suffix) = $ARGV[0] =~ m/\.(\w\w\w)$/; # # Parse and collect information into the $meta hash reference # $ARGV[0] =~ s/[;|'"]//g; #eliminate dangerous shell metacharacters my $fh = IO::File->new("perl member_read.pl $ARGV[0] meta.xml |"); my $xml= XMLin( $fh, forcearray => [ 'meta:keyword'] ); my $meta= $xml->{'office:meta'}; # # Output phase # print "Title: $meta->{'dc:title'}\n" if ($meta->{'dc:title'}); print "Subject: $meta->{'dc:subject'}\n" if ($meta->{'dc:subject'}); if ($meta->{'dc:description'}) { print "Description:\n"; $Text::Wrap::columns = 60; print wrap("\t", "\t", $meta->{'dc:description'}), "\n"; } print "Created: "; print format_date($meta->{'meta:creation-date'}); print " by $meta->{'meta:initial-creator'}" if ($meta->{'meta:initial-creator'}); print "\n"; print "Last edit: "; print format_date($meta->{"dc:date"}); print " by $meta->{'dc:creator'}" if ($meta->{'dc:creator'}); print "\n"; # Display keywords (which all appear to be in a single element) # print "Keywords: ", join( ' - ', @{$meta->{'meta:keywords'}->{'meta:keyword'}}), "\n" if( $meta->{'meta:keywords'}); # # Take attributes from the meta:document-statistic element # (if any) and put them into the $statistics hash reference # my $statistics= $meta->{'meta:document-statistic'}; if ($suffix eq "sxw") { print "Pages: $statistics->{'meta:page-count'}\n"; print "Words: $statistics->{'meta:word-count'}\n"; print "Tables: $statistics->{'meta:table-count'}\n"; print "Images: $statistics->{'meta:image-count'}\n"; } elsif ($suffix eq "sxc") { print "Sheets: $statistics->{'meta:table-count'}\n"; print "Cells: $statistics->{'meta:cell-count'}\n" if ($statistics->{'meta:cell-count'}); } # # A convenience subroutine to make dates look # prettier than ISO-8601 format. # sub format_date { my $date = shift; my ($year, $month, $day, $hr, $min, $sec); my @monthlist = qw (Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec); ($year, $month, $day, $hr, $min, $sec) = $date =~ m/(\d{4})-(\d{2})-(\d{2})T(\d{2}):(\d{2}):(\d{2})/; return "$hr:$min on $day $monthlist[$month-1] $year"; }
These two lines from the preceding program are where all the parsing takes place:
my $fh = IO::File->new("perl member_read.pl $ARGV[0] meta.xml |"); my $xml= XMLin( $fh, forcearray => [ 'meta:keyword'] );
In the first line, we used IO::File->new, because our version of Perl wouldn’t read from a file handle opened with the standard Perl open(). In the second line, the forcearray parameter will force the content of the <meta:keyword> element to be an array type, even if there is only one element. This avoids scalar versus array problems.
While XML::Simple is the easiest way to accomplish this task, it is not the most flexible way to parse XML. For more general XML parsing, you probably want to use the XML::SAX module. the section called “Showing Meta-information Using SAX” shows this same program written with the XML::SAX module.
The styles.xml file contains information about the styles that are used in the document. Some of this information is also duplicated in the content.xml document.
File styles.xml begins with a <office:document-styles> element, which contains font declarations (<office:font-decls>), default and named styles (<office:styles>), "automatic," or unnamed styles (<office:automatic-styles>), and master styles (<office:master-styles>). All of these elements are optional.
The <office:font-face-decls> element contains zero or more <style:font-face> elements. <style:font-face> is an empty element, some of whose attributes are described in Table 2.4, “Attributes of the <style:font-face> element”.
Table 2.4. Attributes of the <style:font-face> element
Attribute | Description |
---|---|
style:name | The name of the font (required). |
svg:font-family | The font family (optional). It is not necessarily the same as the font name. For example, a font named Courier Bold Oblique belongs to the Courier family, and its svg:font-family attribute would be Courier. If a font family name has blanks in it, such as Zapf Chancery, OpenOffice.org encloses the value in single quotes. |
style:font-family-generic | The generic class to which this font belongs. Valid values for this optional attribute are roman (serif), swiss (sans-serif), modern, decorative, script, and system. |
style:font-pitch | This optional attribute tells whether the font is fixed (fixed-width, as is the Courier font) or variable (proportional-width). |
style:font-charset | The encoding for this font; this attribute is optional. |
There is also a large number of attributes borrowed from SVG, such as svg:font-stretch, svg:units-per-em, svg:ascent, but current applications that create OpenDocument documents don’t appear to use them.
The <office:styles> element is a container for (among other things) default styles and named styles. In OpenOffice.org, these are set with the Stylist tool. A spreadsheet’s <office:styles> element will also contain information about style for numbers, currency, percentage values, dates, times, and boolean data. A drawing will have information about default gradients, hatch patterns, fill images, markers, and dash patterns for drawing lines.
The most important elements that you will find within <office:styles> are <style:default-style> and <style:style>. Both elements contain a style:family attribute which tells what “level” the style applies to. The possible values of this required attribute are text (character level), paragraph, section, table, table-column, table-row, table-cell, table-page, chart, graphics, default, drawing-page, presentation, control, and ruby[1]
Both <style:default-style> and <style:style> have a style:name attribute. Styles built in to OpenOffice.org’s stylist, or ones that you create there, will have names like Heading_20_1 or Custom_20_Citation. Non-alphanumeric characters in names are converted to hexadecimal; thus blanks are converted to _20_. A style named !wow?#@$ would be stored as _21_wow_3f__23__40__24_. Automatic styles will have names consisting of a one- or two-letter abbreviation followed by a number; a style name such as T1 is the first automatic style for style:family="text"; P3 would be the third style for paragraphs, ta2 would be the second style for a table, ro4 would be the fourth style for a table row, etc.
Internal names are stored in the style:name attribute, with non-alphanumeric characters translated to their hexadecimal equivalents. If there are any non-numeric characters, OpenDocument also provides a style:display-name attribute that gives the unencoded version of the name, suitable for display to a user in an application. Thus, the encoded style:name="_21_wow_3f__23__40__24_" has the display form style:display-name="!wow?#@$".
You will see this pairing of name and display-name in attributes in graphics as draw:name and draw:display-name.
The other attribute of interest is the optional parent-style-name, which you will find in styles that have been derived from other styles. In a text document, OpenOffice.org will often create a temporary style whose parent is the style found in the styles.xml file.
Within each <style:style> or <style:default-style>, you will find the <style:family-properties> element, which describes the style in minute detail via an immense number of attributes. The family is related to the style:family attribute; if a style has style:family="table", then it will contain a <style:table-properties> element; style:family="paragraph", will contain a <style:paragraph-properties> element, and so forth.
A full discussion of styles is beyond the scope of this book, so we will simply give you an idea of the range of style specifications, and take up specific details of styles when they are relevant in other chapters. Example 2.4, “Style Defintion in a Word Processing Document”, Example 2.5, “Style Defintion in a Spreadsheet Document”, and Example 2.6, “Style Defintion in a Drawing Document” are excerpts from the styles.xml files in a word processing, spreadsheet, and drawing document
Example 2.4. Style Defintion in a Word Processing Document
<style:style style:name="Text_20_body" style:display-name="Text body" style:family="paragraph" style:parent-style-name="Standard" style:class="text"> <style:paragraph-properties fo:margin-top="0in" fo:margin-bottom="0.0835in"/> </style:style>
Example 2.5. Style Defintion in a Spreadsheet Document
<number:currency-style style:name="N104"> <style:text-properties fo:color="#ff0000"/> <number:text>-</number:text> <number:currency-symbol number:language="en" number:country="US">$</number:currency-symbol> <number:number number:decimal-places="2" number:min-integer-digits="1" number:grouping="true"/> <style:map style:condition="value()>=0" style:apply-style-name="N104P0"/> </number:currency-style> <style:style style:name="Result2" style:family="table-cell" style:parent-style-name="Result" style:data-style-name="N104" />
Example 2.6. Style Defintion in a Drawing Document
<style:style style:name="objectwitharrow" style:family="graphic" style:parent-style-name="standard"> <style:graphic-properties draw:stroke="solid" svg:stroke-width="0.15cm" svg:stroke-color="#000000" draw:marker-start="Arrow" draw:marker-start-width="0.7cm" draw:marker-start-center="true" draw:marker-end-width="0.3cm"/> </style:style>
Although the details of the content.xml vary widely depending upon the type of document you are dealing with, there are elements which are common to all content.xml files. The root element is the <office:document-content> element. It defines all the namespaces that will be used throughout the document. The office:version attribute tells you which version of OpenDocument was used in the document.
The following elements are contained within the <office:document-content> element. The optional <office:scripts> element does appear in most documents and is always empty, even if your document contains macros. Go figure.
The <office:scripts> is followed by elements that describe the document’s presentation. The optional <office:font-face-decls> element describes fonts used in your document, and duplicates the information found in styles.xml. If you have defined any styles “on the fly,” then these automatic styles are described in the optional <office:automatic-styles> element.
The last child element of <office:document-content> is the required, and all-important, <office:body> element. This is where all the action is, and we will spend much of the rest of this book examining its contents. Its first child element tells which kind of document we are dealing with:
Example 2.7, “Structure of the content.xml file” shows the skeleton for an OpenOffice.org document’s content.xml file.
Example 2.7. Structure of the content.xml file
<office:document-content namespace declarations office:version="1.0" office:class="document type"> <office:scripts/> <office:font-face-decls> <!-- font specifications --> </office:font-decls> <office:styles> <office:automatic-styles> <!-- style information --> </office:automatic-styles> </office:styles> <office:body> <office:documentType> <!-- actual content here --> </office:documentType> </office:body> </office:document-content>
[1] Ruby refers to “furigana,” which are small Japanese alphabetic characters placed near the Japanese ideograms to aid readers in determining their correct meaning.
Copyright (c) 2005 O’Reilly & Associates, Inc. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".