You may want to split the output for a large document into several HTML files. That process is known in DocBook as chunking, and the individual output files are called chunks. The results are a coherent set of linked files, with a title page containing a table of contents as the starting point for browsing the set.
You get chunked output by processing your XML input file with
html/chunk.xsl
stylesheet file instead of the
standard html/docbook.xsl
file. For example:
xsltproc /usr/share/docbook-xsl/html/chunk.xsl myfile.xml
The default behavior in chunking includes:
The name of the main titlepage/table of contents file is
index.html
.
Each of the following elements start a new chunk:
appendix article bibliography in article or book book chapter colophon glossary in article or book index in article or book part preface refentry reference sect1 except first section if equivalent to sect1 set setindex
Each chunk filename is generated with an algorithm. It can instead be named after the id
attribute value of its starting element, if it has one (see
the section “Generated filename”).
A message is displayed for each chunk filename that is generated. If you prefer not to see those messages, then set the chunk.quietly
parameter to 1.
Each chunk has to have a filename. The filename (before adding .html) can come from three sources, selected in this order:
A dbhtml filename
processing instruction embedded in the
element.
If it is the root element of the document, then the chunk is
named using the value of the parameter root.filename
, which is index
by default.
The chunk element's id attribute value (but only if the use.id.as.filename
parameter is set).
A unique name generated by the stylesheet.
You can embed processing instructions (PI) in your DocBook XML files that instruct the XSL stylesheets what filename to use for a chunk. Here is an example:
<chapter><?dbhtml filename="intro.html" ?> <title>Introduction</title> ...
The dbhtml
name indicates that this processing instruction is
intended for DocBook HTML processing. This dbhtml filename
processing
instruction says that the HTML chunk file for this chapter should be
named intro.html
. The stylesheet does not add a
filename extension when dbhtml filename
is used. The processing instruction
needs to be an immediate child of the element you are naming, not
inside one of its children. For example, it won't work if you put it
inside the title element of a chapter. If there is more than one such
PI in an element then the first one is used.
If the element that starts a new chunk has an id attribute,
then that value can be used as the start of the chunk filename. The stylesheet parameter use.id.as.filename
controls that behavior. If that parameter is set to a
non-zero value, then your chunk filenames will use the element's id
attribute. By default, the parameter is set to zero, so you have to
turn that behavior on if you want it. For
example:
<chapter id="intro"> <title>Introduction</title> ...
This will work for all elements that have an id value and that start a chunk, except for the main index file. By default, that file is named using the value of the root.filename
parameter, whose value is index
by default. To use your document root element's id as
that filename, set the root.filename
parameter to blank.
When the id value is used, then the .html
filename extension is
automatically added. You can change the default extension by setting
the html.ext
parameter to some other extension, including the
dot.
If not specified by a PI or id attribute, then the XSL
stylesheet will generate a filename. The names are abbreviations of
the element name and a count. For example, the first chapter element
would be ch01.html
, the second chapter would be ch02.html
, and so on. The first sect1 in a chapter might be s01.html
. But that filename would not be unique if each chapter had
a sect1. To make each sect1 name unique, the stylesheet prepends the
chapter part. So the first sect1 in the second chapter would be
chunked into ch02s01.html
. In general, the stylesheet keeps adding parent prefixes
to make sure each name is unique. If a document is a set with
multiple books, then the stylesheet would also add a book prefix to
make a name like bk01ch02s01.html
.
The names are not pretty, but they do have a recognizable logic. They are also somewhat stable, as opposed to random number names that might have been used instead. But the filenames may change if the document is edited, because when you insert a chapter, subsequent chapters are bumped up in number. If you are creating a website in which other files refer to these chunk filenames, then they are moving targets unless the document never changes. If you want to point to your generated files, it's best not to use generated filenames, and instead to use one of the other methods to name them. Using the id attribute is the easiest.
The first thing you will notice when you chunk a document is that it can produce a lot of HTML files! Suddenly your directory is very crowded with new HTML files. When chunking, most people choose to place the chunked files into a separate directory.
One method that does not work is to use
the processor's -output
option. That option is used to redirect the standard
output of the processor to a file. During chunking, the stylesheet
creates the filenames and files, and also needs to handle the
directory location.
You inform the stylesheet of the desired directory location
using the base.dir
parameter. For example, to output the chunked files to
the
/usr/apache/htdocs
directory::
xsltproc -stringparam base.dir /usr/apache/htdocs/ chunk.xsl myfile.xml
Things to watch out for:
Be sure to include that trailing '/' because the stylesheet simply appends the filename to this string. If you forget the trailing slash, you'll end up with all your filenames beginning with that name.
The stylesheets can create files, but cannot create directories. So create any directories before running the XSL processor.
Be aware that the base.dir
parameter only works with the chunk stylesheet, not the
regular docbook.xsl
stylesheet. It does work
with the onechunk.xsl
stylesheet, though.
You can also use a dbhtml dir
processing instructions to modify where the chunked output
goes. For example:
<book><?dbhtml dir="UserGuide" ?>
<title>User Guide</title>
...
<chapter id="intro">
...
This sets the output directory to be
UserGuide
for the root element chunk and all of
its children and descendants (unless otherwise specified). Since this
is a relative pathname, the output will be relative to the current
directory. So in this example the root element chunk will be
UserGuide/index.html
, and the first chapter
chunk will be in UserGuide/intro.html
since it
is a child of the book element. Note that the dbhtml dir
value does not have a trailing slash
because the stylesheet inserts one.
If the base.dir
parameter is set, then that value is prepended to the dir
value. For example, you could process the above file
using:
xsltproc -stringparam base.dir /usr/apache/htdocs/ chunk.xsl myfile.xml
Then the root element chunk will be in
/usr/apache/htdocs/UserGuide/index.html
.
Remember that base.dir
does need a trailing slash.
If any of the descendants of the root element also have a
dbhtml dir
processing instruction, then that value is appended to
ancestor value. That means it is relative to its ancestor element's
directory. This allows you to build up a longer pathname to divide
the output into several subdirectories of the main directory. For
example:
<book><?dbhtml dir="UserGuide" ?> <title>User Guide</title> ... <chapter id="intro"><?dbhtml dir="FrontMatter" ?> ... <chapter id="installing"> ... <appendix id="reference"><?dbhtml dir="BackMatter" ?> ...
Now the output chunks will be:
UserGuide/index.html UserGuide/FrontMatter/intro.html UserGuide/installing.html UserGuide/BackMatter/reference.html
Note that the second chapter is not a child of the first chapter, so
its directory reverts to that of the book-level PI. Again, if the base.dir
parameter is set, then all of these become relative to
that value. Remember that you need to create any directories you specify, because
the stylesheets won't.
The dbhtml dir
processing instruction can be used to specify a full
pathname if you don't use a base.dir
parameter, but that's not a good idea. That hard codes the
path into your file, which means you have to edit the file to put the
output elsewhere. Generally this PI is used to create directories
relative to some base output directory that you specify on the
command line with a parameter. That gives you the flexibility to put
the output where you want, yet maintains the relative structure of
the subdirectories specified by the PIs.
In all cases, cross references between your chunked files should still resolve, regardless what the relative locations are.
If you are chunking large documents, then there is a stylesheet option you can turn on that will speed up the processing. The caveat is that the XSL processor you are using must support the EXSLT node-set()
function. That includes Saxon, Xalan, and xsltproc. It does not include MSXSL, however.
To speed up chunking, set the chunk.fast
parameter to a non-zero value. Without this parameter
being set, the calculation of the Next and Previous elements for each
chunk is performed each time a chunk is output. That calculation
requires searching the document using XPath, which can take some time
for large documents. When this parameter is set, those calculations
are all done ahead of time so that output can proceed without
delay.
When chunking a book, the DocBook XSL stylesheets normally put the table of contents (TOC) in the same chunk as the book's title page. The stylesheets provide options for generating separate chunks for the table of contents, and for any lists of titles such as List of Tables.
If you set the stylesheet parameter chunk.tocs.and.lots
to 1, then the stylesheet will generate a separate chunk
that contains the table of contents and all the lists of titles. The
title page chunk will then contain a link to the new chunk. If you
also set the parameter chunk.separate.lots
to 1, then each of the lists of titles will get a
separate chunk as well. If you set only chunk.separate.lots
to 1, then your table of contents will appear in the
title page chunk, and only the lists of titles will get separate
chunks. The chunk.separate.lots
parameter was added in version 1.66.1 of the
stylesheets.
The chunk.toc
parameter does not generate a separate table of contents
chunk. Rather, it is used to manually designate chunking boundaries.
See the section “Manually control chunking” for more information.
There are three options in the DocBook XSL stylesheets for controlling what gets chunked:
Set the parameters chunk.section.depth
and/or chunk.first.sections
.
Chunk based on a manually edited table of contents file.
Modify the chunk
template.
If you only want to control what section levels get put into separate HTML files, then you should set the chunk.section.depth
parameter. By default it is set to 1. So if you want
sect1
and sect2
elements to be chunked into
individual files, set the parameter to 2.
The chunk stylesheet by default includes the first sect1
of a chapter (or article) with the content that precedes it in the chapter. If you want those also to be chunked to separate files, then set the chunk.first.sections
parameter to 1.
If the standard chunking process doesn't meet your needs, and you are willing to manually intervene, then you can completely control how content gets chunked. This might be useful if some sections are very short and you would rather keep them together. But since it requires hand editing of a generated table of contents file, it is only useful if done infrequently or with documents that have stable structure.
Here are the steps for manually chunking HTML output:
Process your document with the special maketoc.xsl
stylesheet, which generates an XML table of contents file. Using xsltproc for example:
xsltproc -output mytoc.xml html/maketoc.xsl myfile.xml
Edit the generated mytoc.xml
file to remove any tocentry
elements that you don't want chunked, or add entries that you do want chunked.
Process your document with the special chunktoc.xsl
stylesheet instead of the regular chunk.xsl
stylesheet, and pass it the generated TOC filename in the chunk.toc
parameter. For example:
xsltproc -output output/ \ -stringparam chunk.toc mytoc.xml \ html/chunktoc.xsl myfile.xml
This will chunk your document based on the entries in the generated TOC file. You can still use any of the chunking parameters to modify the chunking behavior.
If you also want the HTML TOC that is
produced during chunking to match your XML TOC file, then set the
parameter manual.toc
to that same filename.
When you use this process, you must have an id
attribute on every element that you want to start a new chunk. This includes the document element, which generates the title page and table of contents. You can see which elements don't have an id by examining the generated TOC file and looking for empty id
attributes in the tocentry
s. Any such entries will be merged with their parent elements during chunking.
If you want to control what elements produce chunks, beyond just the section level choice, then you must modify the templates that do chunk processing. See the section “Chunking customization” for more information.
You may need to change the output encoding for your chunked HTML files. The chunker.output.encoding
parameter lets you change the default value of the HTML
character encoding from the default value of ISO-8859-1
. For example, if you want your HTML files to use UTF-8
encoding instead, you could process your document with the
following:
xsltproc -output output/ \ -stringparam chunker.output.encoding UTF-8 \ html/chunk.xsl myfile.xml
This will produce the following line in each chunked HTML file:
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
It will also encode the HTML content itself using UTF-8 encoding. When a browser opens the file, the meta
tag informs it that the file is encoded in UTF-8 so it will use a UTF-8 font to display the text. This feature is only available with Saxon and XSL processors that support EXSLT extensions (such as xsltproc). It does not work in Xalan, however.
By default, chunked HTML output from Saxon will not contain any non-ASCII characters, regardless of the encoding your specify. Any non-ASCII characters will be represented as named or numerical entities. This behavior is controlled by the saxon.character.representation
stylesheet parameter. See the section “Saxon output character representation” for more
information.
The default output encoding for XHTML is UTF-8, as described in the section “XHTML”.
You may want to specify a particular DOCTYPE at the top of your chunked HTML files. This is most useful for XHTML output where you may want to validate the chunked files against the DTD.
There are two stylesheet parameters for the chunking stylesheet that affect the DOCTYPE:
chunker.output.doctype-public
Specifies the PUBLIC identifier of the DTD in the DOCTYPE.
chunker.output.doctype-system
Specifies the SYSTEM identifier of the DTD in the DOCTYPE.
See the section “Generating XHTML” for an example of using these parameters.
Unfortunately, there is no way to add an internal subset to the output DTD using XSLT. If you don't know what an internal DTD subset is, then you probably don't need it. See a good XML reference for more information.
If you use a text editor to open an HTML file produced by DocBook XSL, you will notice that by default it produces long text lines that contain many elements. If you would prefer your HTML elements to start on a new line and have nested indents to show the HTML element structure, you can do that by setting the chunker.output.indent
parameter to yes
. Note that this feature is only available with XSL
processors that support EXSLT extensions, but that
includes most of the major ones.
There are limits to which HTML elements can start an indented line. In general, any element that permits #PCDATA
(plain text) as part of its content model will not allow the extra line breaks inside it. That is because white space must be respected inside such elements, and that respect includes not adding extra white space.
To add indentation with the non-chunking docbook.xsl
stylesheet, you need to use a customization layer with an xsl:output
element similar to the example in the section “Output encoding”. Use the indent="yes"
attribute value to turn on indentation. The other approach for single-file output is to use the onechunk.xsl
stylesheet and its extra parameters, as described in the section “Single file options with onechunk”.
DocBook XSL: The Complete Guide - 3rd Edition | PDF version available | Copyright © 2002-2005 Sagehill Enterprises |