Appendix B. The XSLT You Need for OpenDocument

The purpose of this appendix is to introduce you to XSLT. A knowledge of XSLT is essential if you wish to easily transform OpenDocument files to text, XHTML, or other XML formats; or to transform XML and XHTML documents to OpenDocument.

The idea behind XSLT is to transform an XML source document to an output document which may be XML or just plain text. The transformation is accomplished by feeding the source document to an XSLT transformation stylesheet.

Before we can get into the details of XSLT, we need to talk about how the transformation program views your document and refers to its elements, using a notation called XPath. Take a look at the document in Example B.1, “Sample XML Document” (with line numbers for reference only). XPath (conceptually) represents it as a tree like the one shown in Figure B.1, “Tree Representation of Sample Document”.

This looks a lot like a file directory listing, and we will begin talking about XPath using this analogy. The highlighted item at the top of the diagram is called the root node of the tree, which corresponds to the root of a UNIX file system. The root node is not the same as the root element of the document. As you see, there is a processing instruction node and a comment node that precede the document’s root element. They are part of the document as a whole, and must be represented in the tree.

Just as you may select a file from a directory tree by specifying its absolute pathname, you may select any node from the document tree by its absolute path. To select the <index> element, the path would be /document/index. In a file system, names must be unique within a directory. Thus, a path will select only one file. In an XML document, however, a parent element may have many child elements with the same name. Thus, the absolute XPath expression /document/para selects the nodes on lines 5 and 6 of the sample document, and /document/index/item selects the elements whose start tags are on lines 8 and 11.

Just as you can create relative path names in a file system (relative to the “current working directory”) XPath allows you to specify nodes by using path names relative to the “context node” in the document tree. If your context node is the <para> on line 5, the relative path to the index nodes is simply index. If your context node is the <index> on line 7, then the path item/para will select the nodes on lines 9 and 12. In these cases, as with file systems, every time you take a new step on the path, you look at the children of the context node.

Just as a filesystem path uses .. (dot dot) to allow you to move up to a parent directory in the directory tree, XPath uses .. to move to the parent of a node. Thus, if your context node is the <para> element on line 12, the relative path name to the <endnote> element would be ../../endnote.

At this point, we must abandon the filesystem analogy, since XPath gives you many more ways to move around the document tree than just the parent and child relationships. With XPath, you may specify an axis, which is the direction in which you want to look as you move to the next step in the path. We have actually been using the child axis as a default; the non-abbreviated syntax for item/para is child::item/child::para. The following descriptions of XPath axes is adapted from XML in a Nutshell, by Elliotte Rusty Harold and W. Scott Means, ISBN 1-56592-580-7.

In an XPath expression, you select nodes by giving the axis and the name of the node(s) you want. You may also use these constructions: axis::* to mean “all element nodes along an axis,” @* to mean “all attribute nodes of the context node,” and axis::node() to mean “all nodes, including text and comments, attributes, and namespaces, along an axis. Table B.1, “Examples of XPath Axes” gives some examples in terms of line numbers in Example B.1, “Sample XML Document”.

But wait, there’s more! You can also select nodes depending on a condition. This condition is known as a predicate (since the selection is “predicated upon” the condition). The predicate is enclosed in square brackets, and may also contain an XPath expression. Thus, to find all <para> elements that have an align attribute with the value center, you would say //para[@align="center"], or, without abbreviations, /descendant-or-self::para[attribute::align="center"]. If you were at line 15 of Example B.1, “Sample XML Document” and wanted to find all the preceding <para> elements that had <item> parents, you would say preceding::para[parent::item]. A predicate does not require a relational operation like = or > or <=; if the node exists, the predicate is true. If it doesn’t, the predicate is false.

Note

The expression preceding::para[parent::item] selects the <para> nodes, not the <item> node. The predicate looks for the parent::item, but does not change the context node.

If the predicate consists of a number n, then the predicate matches the nth element in the set of selected nodes. The first node is numbered one, not zero.

XPath’s ability to reach any part of a document from any other part of the document allows XSLT to perform powerful and radical transformations on documents.

An XSLT stylesheet consists of a series of templates, which tell the transformation engine what to do when it encounters certain items. For example, a template might express the idea of “whenever you find a <para> element in the source document, put a <p> element into the output document.” We’ll see these in action in a while, but first, let’s look at the simplest possible transformation.

Here is the simplest possible XSLT stylesheet, consisting of just a beginning and ending tag:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="1.0">
</xsl:stylesheet>

Although you may not see anything between those tags, XSLT has inserted some default templates for you. Here are the two most important ones:


<xsl:template match="*|/">
    <xsl:apply-templates/>
</xsl:template>

<xsl:template match="text()|@*">
    <xsl:value-of select="."/>
</xsl:template>

The first template says, “if the current node is an element node (*) or (|) the root node (/), then visit all its children and see if you have any templates that apply to them.” The second template says, “if the current node is a text node (text()) or an attribute node (@*), output the value of that node (.).” Other default templates tell XSLT to ignore processing instructions, comments, and namespace nodes.

If we use this transformation on Example B.1, “Sample XML Document”, XSLT will start at the root node and visit all its children. The first child is a processing instruction, which is ignored. The second child is a comment, which is also ignored. The third child is the <document> node. Since we have no template of our own, the first default template (*|/) is the only one that matches, so we visit the <document>’s children. The first child is a <para>, and, since we haven’t provided a template to match it, the first default matches again, and we visit <para>’s children. Since attributes are not considered as children, we skip the align attribute, and find the text node Centered. The second default template matches that text node, so the value Centered is written to the output file.

Proceeding in this manner, we see that the empty transformation will create a document consisting of only the text contained therein.

Note

If you run this transformation, you will see that the output contains many blank lines. This is because the newline and tabs between, say, lines 4 and 5, actually create a text node, and those text nodes go into the output document via the default templates. We conveniently ignored these “whitespace-only” text nodes when producing the tree diagram in Figure B.1, “Tree Representation of Sample Document”, but now we have to face their cold reality.

Now let’s add some templates to handle things ourselves and create an XHTML file as the output from our sample file.

Let’s examine the first template in detail. This says: “Whenever you find a <document> element, I want you to emit these items: [25] <html>, <head>, <title>, the text My Document, </title>, </head>, and <body>.

“Now go visit all the children of this node and apply any templates you have for them. Once you have finished that, …

“Emit a </body> and </html> into the output. That is all that needs to be done to completely handle a <document> element.”

The <document>’s first child is a <para>, and it will be handled by the second template that we have added. It says, “Whenever you encounter a <para> node, emit a <p> into the output, visit all of the children of this node and process them as their templates demand, then emit a </p> into the output.” Since the paragraph’s child is a text node, the default template will put the text into the output between the <p> and </p> tags.

The <document>’s next child is an <index>, so the appropriate <xsl:apply-template> will be applied. It, in turn, uses <xsl:apply-templates> to visit all its <item> children, which will be handled by the <xsl:template match="item">.

After the <index> and its descendants are handled, the last child of <document>—the <endnote> will be handled by the last <xsl:template>.

So far, we have been processing all the child nodes indiscriminately with <xsl:apply-templates/>. If we wish to visit only specific children, we use a select attribute whose value is an XPath expression that selects the children we wish to visit. Thus, if we only wanted the index portion of the document to be converted to XHTML, we would change the first template of Example B.2, “Simple Templates” to read:

<xsl:template match="document">
    <html>
        <head>
            <title>My Document</title>
        </head>
        <body>
            <xsl:apply-templates select="index"/>
        </body>
    </html>
</xsl:template>

The important thing to remember about the select expression is that the context node is the one that was matched by the <xsl:template>. In the preceding example, the context node for the select="index" is the <document> element currently being processed.

To show a more complicated example of a selection, we need a more complex XML file. Example B.3, “Gradebook Data in XML” shows a student gradebook file represented in XML:

Our goal is to list the results for all the students who have active email addresses and have taken Quiz 1. The following select will choose the appropriate <result> elements:

gradebook/student[email!='']/result[@ref='Q01']

When you get a complex XSLT expression like this, it is probably best read from right to left. It will select all:

Example B.4, “Complex XPath Selector” shows the XSLT stylesheet.

1

We don’t want an XML file as output, we just want plain text.

2

The content of xsl:text is output verbatim into the output file—including the newline before the closing tag. If you are producing a plain text file, <xsl:text> gives you control over whitespace. When you put text into a template that is not surrounded by <xsl:text>, then its whitespace will be determined by the underlying XML/XSLT engine’s whitespace processing.

3

<xsl:value-of> outputs the text content of the selected node. If the select refers to more than one node, it outputs the text of the first node only.

In this case, the XPath selection reaches up to the parent <student> node via .. and then selects its child <first> element.

Although the ability to use predicates gives us some flexibility with transformations, there are times we would like to do totally different actions depending upon some condition. That is why XSLT provides the <xsl:if> and <xsl:choose> elements. We will now modify Example B.4, “Complex XPath Selector” to choose all students, regardless of their email status, and print a message next to the scores of those students who have no email. The relevant changes are in Example B.5, “Using <xsl:if>”.

(We could have also used preceding-sibling::email as the value of the test attribute.)

If you are a programmer, your next instinct is to ask “where is the <xsl:else> element? Sorry, but there is none. If you need to do a multi-way test, you need to use <xsl:choose>, which contains one or more <xsl:when> elements, each of which gives a condition. <xsl:otherwise> is the branch taken if none of the other conditions matches. Example B.6, “Example of <xsl:choose>” shows the relevant portion of an XSLT stylesheet that assigns an evaluation to the quiz scores. Because this is XML, we must encode a less than sign as &lt;. For symmetry, we encode the greater than sign as &gt;.

It is possible to do arithmetic calculations in XSLT. For example, if you wanted to output the average of the quiz and program scores in the gradebook example, you could use this expression:

<xsl:template match="student">
    <xsl:value-of select="(result[1]/@score + result[2]/@score) div 2"/>
</xsl:template>

Notice that we need to use div for division, since the forward slash is already used to separate steps in a path. If you are doing subtraction, make sure you surround the minus sign with whitespace, because a hyphen could legitimately be part of an element name.

In addition to the normal arithmetic operators, XSLT is provided with a panoply of mathematical and string functions. (You have no idea of how long I have wanted to use “panoply” in a document.) Here are some of the more important ones.

concat()

Takes a variable number of arguments, and returns the result of concatenating them. In the preceding template, we could have done the output of the student’s name from the gradebook with this:

<xsl:value-of select="concat(../first ,' ', ../last)"/>
count()

Takes an XPath expression as its argument and returns the number of items in that node set. Thus, to print the number of tasks in the gradebook, we could say:

<xsl:value-of select="count(/gradebook/task-list/task)"/>
last()

Returns the number of the last node in the set being matched by the current template. Note that count() requires an argument; last() does not.

name()

Returns the fully qualified name of the current node. If the current node doesn’t have a name, this returns the null string. This includes any namespace prefix. If you don’t want the prefix, use the local-name() function instead.

normalize-space()

Strips leading and trailing whitespace from its argument, and replaces any run of internal whitespace with a single space.

position()

Tells which node in the nodeset is currently being processed. This example will print a list of student names followed by the phrase n of total

<xsl:template match="student">
    <xsl:value-of select="concat(first, ' ', last)"/>
    <xsl:text> student </xsl:text>
    <xsl:value-of select="position()"/>
    <xsl:text> of </xsl:text>
    <xsl:value-of select="last()"/><xsl:text>
</xsl:text>
</xsl:template>
substring()

Takes two or three arguments. The first argument is a string to be “sliced.” The second argument is the starting character of the substring. The first character in a string is character number one, not zero! The third argument is the number of characters in the substring. If no third argument is given, then the substring extends to the end of the original string.

sum()

Converts the contents of the current node set to numbers and returns the sum of the contents of the nodes. If any of these is non-numeric, then the result will be NaN (Not a Number).

Well, if you can perform arithmetic operations and use functions, you must have variables. Indeed, XSLT has variables, but they aren’t variables in the traditional C/C++/Java sense.

You may declare a variable with the <xsl:variable> element. The variable name is given in the name element. You set the variable with either a select attribute or the content of the element. When you use a variable, you precede its name with a dollar sign.

A variable can contain more than just a simple string or number; it can contain a whole set of nodes, as shown in Example B.7, “Setting Variables”.

The scope of the variables in the preceding example is the enclosing <xsl:template> element. A variable declared at the top level of the XSLT stylesheet is global and available to all the templates.

XSLT variables are “variable” only in the sense that their values may change each time the template that uses them is invoked. For the duration of the template, the value that is first established cannot be changed.

Sometimes you may need to emit the same output at several different stages of a transformation. Rather than duplicate the markup, you may create a named template that contains that markup and then call upon it, much as you would use a traditional subroutine. Example B.8, “Named XSLT Template” shows a template that inserts “boilerplate” markup into an output XHTML document. The starting <xsl:template> element has a name attribute rather than a match attribute.

To invoke this template from anywhere else in the transformation, you need but say <xsl:call-template name="back-to-top"/>.

Just as subroutines have arguments, templates can have parameters that modify the way they work. Parameters to a template are named with the <xsl:param> element. The content of this element is the default value for the parameter, in case the call does not pass one. Consider Example B.9, “XLST Template with Parameters”, a template which displays a money amount in red if it is negative, black if zero or greater. If no parameter is passed to this template, it will presume a value of zero.

The parameter is passed using the <xsl:with-param> element. The parameter value is either the content of the element or the result of its select attribute. Example B.10, “Calling an XSLT Template with a Parameter” shows three calls to this template; two with absolute amounts and one from a selected element’s content.

This brief summary has only touched the surface of XSLT. XSLT by Doug Tidwell, ISBN 0-596-00053-7, is an excellent resource for learning more.



[25] It will also emit newlines and tabs into the output, but let’s ignore that whitespace to make our lives easier


Copyright (c) 2005 O’Reilly & Associates, Inc. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".