Though content.xml is king, monarchs rule better when surrounded by able assistants. In an OpenDocument JAR file, these assistants are the meta.xml, style.xml, and settings.xml files. In this chapter, we will examine the assistant files, and then describe the general structure of the content.xml file.
The only files that are actually necessary are content.xml and the META-INF/manifest.xml file. If you create a file that contains word processor elements and zip it up and a manifest that points to that file, OpenOffice.org will be able to open it successfully. The result will be a plain text-only document with no styles. You won’t have any of the meta-information about who created the file or when it was last edited, and the printer settings, view area, and zoom factor will be set to the OpenOffice.org defaults.
The <config:config-item> element has a config:name attribute that describes the item and a config:type attribute which can be one of boolean, short, int, long, double, string, datetime, or base64Binary. The content of the element gives the value of that item. Example 2.1, “Example of Configuration Items” shows some representative configuration items from a word processing document:
The meta.xml file contains information about the document itself. We’ll look at the elements found in this file in decreasing order of importance; at the end of this section, we will list them in the order in which they appear in a document. Most of these elements are reflected in the tabs on OpenOffice.org’s File/Properties dialog, which are show in Figure 2.1, “General Document Properties”, Figure 2.2, “Document Description”, Figure 2.3, “User-defined Information”, and Figure 2.4, “Document Statistics”.
All elements borrowed from the Dublin Core namespace contain text and have no attributes. Table 2.1, “Dublin Core Elements in meta.xml” summarizes them.
Table 2.1. Dublin Core Elements in meta.xml
Element | Description | Sample from XML file |
---|---|---|
<dc:title> | The document title; this appears in the title bar. | <dc:title>An Introduction to Digital Cameras</dc:title> |
<dc:subject> | The Dublin Core recommends that this element contain keywords or key phrases to describe the topic of the document; OpenOffice.org keeps keywords in a separate set of elements. | <dc:subject>Digital Photography</dc:subject> |
<dc:description> | This element’s content is shown in the Comments field in the dialog box. | <dc:description>This introduction…</dc:description> |
<dc:creator> | This element’s content is shown in the Modified field in Figure 2.1, “General Document Properties”; it names the last person to edit the file. This may appear odd, but the Dublin Core says that the creator is simply an “entity primarily responsible for making the content of the resource.” That is not necessarily the original creator, whose name is stored in a different element. | <dc:creator>J David Eisenberg</dc:creator> |
<dc:date> | This element’s content is also shown in the Modified field in Figure 2.1, “General Document Properties”. It is stored in a form compatible with ISO-8601. The time is shown in local time. See the section called “Time and Duration Formats” for details about times and dates. | <dc:date>2005-05-30T20:30:30</dc:date> |
<dc:language> | The document’s language, written as a two or three-letter main language code followed by a two-letter sublanguage code. This field is not shown in the properties dialog, but is found in OpenOffice.org’s Tools/Options/Language Settings dialog. | <dc:language>en-US</dc:language> |
The remaining elements in the meta.xml file come from OpenDocument’s meta namespace. Table 2.2, “OpenDocument Elements in meta.xml” describes these elements in the order in which they appear in the file.
Table 2.2. OpenDocument Elements in meta.xml
Element | Description | Sample from XML file |
---|---|---|
<meta:generator> | The program that created this document. According to the specifcation, you should not “fake” being OpenOffice.org if you are creating the document using a different program; you should use a unique identifier. | <meta:generator>OpenOffice.org/1.9.100$Linux OpenOffice.org_project/680m100$Build-8909</meta:generator> |
<meta:initial-creator> | The user who created the document. This is shown in the "Created:" area in Figure 2.1, “General Document Properties”. | <meta:initial-creator>Steven Eisenberg</meta:initial-creator> |
<meta:creation-date> | The date and time when the document was created. This is shown in the “Created:” area in Figure 2.1, “General Document Properties”. It is in the same format as described in the section called “Time and Duration Formats”. | <meta:creation-date>2005-05-30T20:29:42</meta:creation-date> |
<meta:keyword> | A document can have multiple <meta:keyword> elements. These elements reflect the entries in the “Keywords:” area in Figure 2.2, “Document Description”. | <meta:keyword>photography</meta:keyword> <meta:keyword>cameras</meta:keyword> <meta:keyword>optics</meta:keyword> <meta:keyword>digital cameras</meta:keyword> |
<meta:editing-cycles> | This element tells how many times the file has been edited; this is the “Revision Number:” in in Figure 2.1, “General Document Properties”. | <meta:editing-cycles>5</meta:editing-cycles> |
<meta:editing-duration> | This element tells the total amount of time that has been spent editing the document in all editing sessions; this is the “Editing time:” in Figure 2.1, “General Document Properties”, and is represented as described in the section called “Time and Duration Formats”. | <meta:editing-duration>PT1H28M55S</meta:editing-duration> |
<meta:user-defined> | OpenOffice.org allows you to define your own information, as shown in Figure 2.3, “User-defined Information”. This element has a meta:name attribute, giving the “title” of this information, and the content of the element is the information itself. | <meta:user-defined meta:name="Maximum Length">3 pages or 750 words</meta:user-defined> |
<meta:document-statistic> | This is the information shown on the statistics tab of the properties dialog (see Figure 2.4, “Document Statistics”). This element has attributes whose names are largely self-explanatory, and are listed in Table 2.3, “Attributes of the <meta:document-statistic> Element”. | <meta:document-statistic meta:paragraph-count="4"…/> |
Table 2.3. Attributes of the <meta:document-statistic> Element
Attribute | Description |
---|---|
meta:page-count | Number of pages in a word processing document. This must be greater than zero. This attribute is not used in spreadsheets. The “number of pages” shown in the statistics dialog for a spreadsheet is a calculated value that tells how many sheets have filled cells on them, and this can be zero for a totally empty spreadsheet. |
meta:paragraph-count | Number of paragraphs in a word processing document. |
meta:word-count | Number of words in a word processing document. |
meta:character-count | Number of characters in a word processing document. |
meta:image-count | Number of images in a word processing document. |
meta:table-count | Number of tables in a word processing document, or number of sheets in a spreadsheet document. |
meta:cell-count | Number of non-empty cells in a spreadsheet document. |
meta:object-count | Number of objects in a document. This is shown as “Number of OLE objects” in the dialog box of Figure 2.4, “Document Statistics”. This attribute is used in drawing and presentation documents, but it does not bear any simple relationship to the number of items you see on the screen. |
meta:ole-object-count | Apparently unused in OpenOffice.org2.0. |
meta:row-count | Apparently unused in OpenOffice.org2.0. |
meta:draw-count | Apparently unused in OpenOffice.org2.0. |
It would be inefficient to unpack the document into a temporary directory, read the meta.xml file, and then remove the temporary directory. Archive::Zip::MemberRead avoids this problem by letting you read members of a ZIP archive without having to unpack them. Example 2.2, “Program member_read.pl” shows the program that uses this module. The program takes two arguments: the OpenDocument file name and the member of the packed document that you want to read. The program parcels out the data in 32 kilobyte chunks rather than reading it in as one huge string. It also sends its output to standard output so that it can be piped to other processes. [This is program member_read.pl in directory ch02 in the downloadable example files.]
<document> <h1>Information</h1> <para align="left" style="color:red;">More info</para> </document>
XML::Simple returns a data structure as if we had written this Perl statement:
$result = { 'h1' => 'Information', 'para' => { 'align' => 'left', 'content' => 'More info', 'style' => 'color:red;' } };
The program that actually does the extraction, Example 2.3, “Program show_meta.pl”, takes one argument: the OpenDocument filename. The program receives its input from the piped output of member_read.pl.
After the file is parsed, the program prints the data. Information in the <meta:document-statistic> is selected depending upon the type of document being parsed. The program also uses the Text::Wrap module to format the description, which may be several lines long. [This is program show_meta.pl in directory ch02 in the downloadable example files.]
These two lines from the preceding program are where all the parsing takes place:
my $fh = IO::File->new("perl member_read.pl $ARGV[0] meta.xml |"); my $xml= XMLin( $fh, forcearray => [ 'meta:keyword'] );
While XML::Simple is the easiest way to accomplish this task, it is not the most flexible way to parse XML. For more general XML parsing, you probably want to use the XML::SAX module. the section called “Showing Meta-information Using SAX” shows this same program written with the XML::SAX module.
The <office:font-face-decls> element contains zero or more <style:font-face> elements. <style:font-face> is an empty element, some of whose attributes are described in Table 2.4, “Attributes of the <style:font-face> element”.
Table 2.4. Attributes of the <style:font-face> element
Attribute | Description |
---|---|
style:name | The name of the font (required). |
svg:font-family | The font family (optional). It is not necessarily the same as the font name. For example, a font named Courier Bold Oblique belongs to the Courier family, and its svg:font-family attribute would be Courier. If a font family name has blanks in it, such as Zapf Chancery, OpenOffice.org encloses the value in single quotes. |
style:font-family-generic | The generic class to which this font belongs. Valid values for this optional attribute are roman (serif), swiss (sans-serif), modern, decorative, script, and system. |
style:font-pitch | This optional attribute tells whether the font is fixed (fixed-width, as is the Courier font) or variable (proportional-width). |
style:font-charset | The encoding for this font; this attribute is optional. |
The most important elements that you will find within <office:styles> are <style:default-style> and <style:style>. Both elements contain a style:family attribute which tells what “level” the style applies to. The possible values of this required attribute are text (character level), paragraph, section, table, table-column, table-row, table-cell, table-page, chart, graphics, default, drawing-page, presentation, control, and ruby[1]
Both <style:default-style> and <style:style> have a style:name attribute. Styles built in to OpenOffice.org’s stylist, or ones that you create there, will have names like Heading_20_1 or Custom_20_Citation. Non-alphanumeric characters in names are converted to hexadecimal; thus blanks are converted to _20_. A style named !wow?#@$ would be stored as _21_wow_3f__23__40__24_. Automatic styles will have names consisting of a one- or two-letter abbreviation followed by a number; a style name such as T1 is the first automatic style for style:family="text"; P3 would be the third style for paragraphs, ta2 would be the second style for a table, ro4 would be the fourth style for a table row, etc.
A full discussion of styles is beyond the scope of this book, so we will simply give you an idea of the range of style specifications, and take up specific details of styles when they are relevant in other chapters. Example 2.4, “Style Defintion in a Word Processing Document”, Example 2.5, “Style Defintion in a Spreadsheet Document”, and Example 2.6, “Style Defintion in a Drawing Document” are excerpts from the styles.xml files in a word processing, spreadsheet, and drawing document
Example 2.7, “Structure of the content.xml file” shows the skeleton for an OpenOffice.org document’s content.xml file.
Copyright (c) 2005 O’Reilly & Associates, Inc. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".