Book HomeJava and XML, 2nd Edition

5.2. Serialization

One of the most common questions about using DOM is, "I have a DOM tree; how do I write it out to a file?" This question is asked so often because DOM Levels 1 and 2 do not provide a standard means of serialization for DOM trees. While this is a bit of a shortcoming of the API, it provides a great example in using DOM (and as you'll see in the next chapter, DOM Level 3 seeks to correct this problem). In this section, to familiarize you with the DOM, I'm going to walk you through a class that takes a DOM tree as input, and serializes that tree to a supplied output.

5.2.1. Getting a DOM Parser

Before I talk about outputting a DOM tree, I will give you information on getting a DOM tree in the first place. For the sake of example, all that the code in this chapter does is read in a file, create a DOM tree, and then write that DOM tree back out to another file. However, this still gives you a good start on DOM and prepares you for some more advanced topics in the next chapter.

As a result, there are two Java source files of interest in this chapter. The first is the serializer itself, which is called (not surprisingly) DOMSerializer.java. The second, which I'll start on now, is SerializerTest.java. This class takes in a filename for the XML document to read and a filename for the document to serialize out to. Additionally, it demonstrates how to take in a file, parse it, and obtain the resultant DOM tree object, represented by the org.w3c.dom.Document class. Go ahead and download this class from the book's web site, or enter in the code as shown in Example 5-1, for the SerializerTest class.

Example 5-1. The SerializerTest class

package javaxml2;

import java.io.File;
import org.w3c.dom.Document;

// Parser import
import org.apache.xerces.parsers.DOMParser;

public class SerializerTest {

    public void test(String xmlDocument, String outputFilename) 
        throws Exception {

        File outputFile = new File(outputFilename);
        DOMParser parser = new DOMParser( );

        // Get the DOM tree as a Document object

        // Serialize
    }

    public static void main(String[] args) {
        if (args.length != 2) {
            System.out.println(
                "Usage: java javaxml2.SerializerTest " +
                "[XML document to read] " +
                "[filename to write out to]");
            System.exit(0);
        }

        try {
            SerializerTest tester = new SerializerTest( );
            tester.test(args[0], args[1]);
        } catch (Exception e) {
            e.printStackTrace( );
        }
    }
}

This example obviously has a couple of pieces missing, represented by the two comments in the test( ) method. I'll supply those in the next two sections, first explaining how to get a DOM tree object, and then detailing the DOMSerializer class itself.

5.2.2. DOM Parser Output

Remember that in SAX, the focus of interest in the parser was the lifecycle of the process, as all the callback methods provided us "hooks" into the data as it was being parsed. In the DOM, the focus of interest lies in the output from the parsing process. Until the entire document is parsed and added into the output tree structure, the data is not in a usable state. The output of a parse intended for use with the DOM interface is an org.w3c.dom.Document object. This object acts as a "handle" to the tree your XML data is in, and in terms of the element hierarchy I've discussed, it is equivalent to one level above the root element in your XML document. In other words, it "owns" each and every element in the XML document input.

Because the DOM standard focuses on manipulating data, there is a variety of mechanisms used to obtain the Document object after a parse. In many implementations, such as older versions of the IBM XML4J parser, the parse( ) method returned the Document object. The code to use such an implementation of a DOM parser would look like this:

File outputFile = new File(outputFilename);
DOMParser parser = new DOMParser( );
Document doc = parser.parse(xmlDocument);

Most newer parsers, such as Apache Xerces, do not follow this methodology. In order to maintain a standard interface across both SAX and DOM parsers, the parse( ) method in these parsers returns void, as the SAX example of using the parse( ) method did. This change allows an application to use a DOM parser class and a SAX parser class interchangeably; however, it requires an additional method to obtain the Document object result from the XML parsing. In Apache Xerces, this method is named getDocument( ). Using this type of parser (as I do in the example), you can add the following example to your test( ) method to obtain the resulting DOM tree from parsing the supplied input file:

    public void test(String xmlDocument, String outputFilename) 
        throws Exception {

        File outputFile = new File(outputFilename);
        DOMParser parser = new DOMParser( );

        // Get the DOM tree as a Document object
        parser.parse(xmlDocument);
        Document doc = parser.getDocument( );

        // Serialize
    }

This of course assumes you are using Xerces, as the import statement at the beginning of the source file indicates:

import org.apache.xerces.parsers.DOMParser;

If you are using a different parser, you'll need to change this import to your vendor's DOM parser class. Then consult your vendor's documentation to determine which of the parse( ) mechanisms you need to employ to get the DOM result of your parse. In Chapter 7, "JDOM", I'll look at Sun's JAXP API and other ways to standardize a means of accessing a DOM tree from any parser implementation. Although there is some variance in getting this result, all the uses of this result that we look at are standard across the DOM specification, so you should not have to worry about any other implementation curveballs in the rest of this chapter.

5.2.3. DOMSerializer

I've been throwing the term serialization around quite a bit, and should probably make sure you know what I mean. When I say serialization, I simply mean outputting the XML. This could be a file (using a Java File), an OutputStream, or a Writer. There are certainly more output forms available in Java, but these three cover most of the bases (in fact, the latter two do, as a File can be easily converted to a Writer, but accepting a File is a nice convenience feature). In this case, the serialization taking place is in an XML format; the DOM tree is converted back to a well-formed XML document in a textual format. It's important to note that the XML format is used, as you could easily code serializers to write HTML, WML, XHTML, or any other format. In fact, Apache Xerces provides these various classes, and I'll touch on them briefly at the end of this chapter.

5.2.3.1. Getting started

To get you past the preliminaries, Example 5-2 is the skeleton for the DOMSerializer class. It imports all the needed classes to get the code going, and defines the different entry points (for a File, OutputStream, and Writer) to the class. Two of these three methods simply defer to the third (with a little I/O magic). The example also sets up some member variables for the indentation to use, the line separator, and methods to modify those properties.

Example 5-2. The DOMSerializer skeleton

package javaxml2;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
import org.w3c.dom.Document;
import org.w3c.dom.DocumentType;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

public class DOMSerializer {

    /** Indentation to use */
    private String indent;

    /** Line separator to use */
    private String lineSeparator;

    public DOMSerializer( ) {
        indent = "";
        lineSeparator = "\n";
    }

    public void setLineSeparator(String lineSeparator) {
        this.lineSeparator = lineSeparator;
    }

    public void serialize(Document doc, OutputStream out)
        throws IOException {
        
        Writer writer = new OutputStreamWriter(out);
        serialize(doc, writer);
    }

    public void serialize(Document doc, File file)
        throws IOException {

        Writer writer = new FileWriter(file);
        serialize(doc, writer);
    }

    public void serialize(Document doc, Writer writer)
        throws IOException {

        // Serialize document
    }
}

Once this code is saved into a DOMSerializer.java source file, everything ends up in the version of the serialize( ) method that takes a Writer. Nice and tidy.

5.2.3.2. Launching serialization

With the setup in place for starting serialization, it's time to define the process of working through the DOM tree. One nice facet of DOM already mentioned is that all of the specific DOM structures that represent XML (including the Document object) extend the DOM Node interface. This enables the coding of a single method that handles serialization of all DOM node types. Within that method, you can differentiate between node types, but by accepting a Node as input, it enables a very simple way of handling all DOM types. Additionally, it sets up a methodology that allows for recursion, any programmer's best friend. Add the serializeNode( ) method shown here, as well as the initial invocation of that method in the serialize( ) method (the common code point just discussed):

    public void serialize(Document doc, Writer writer)
        throws IOException {

        // Start serialization recursion with no indenting
        serializeNode(doc, writer, "");
        writer.flush( );
    }
 
    public void serializeNode(Node node, Writer writer, 
                              String indentLevel)
        throws IOException {
    }

Additionally, an indentLevel variable is put in place; this sets us up for recursion. In other words, the serializeNode( ) method can indicate how much the node being worked with should be indented, and when recursion takes place, can add another level of indentation (using the indent member variable). Starting out (within the serialize( ) method), there is an empty String for indentation; at the next level, the default is two spaces for indentation, then four spaces at the next level, and so on. Of course, as recursive calls unravel, things head back up to no indentation. All that's left now is to handle the various node types.

5.2.3.3. Working with nodes

Once within the serializeNode( ) method, the first task is to determine what type of node has been passed in. Although you could approach this with a Java methodology, using the instanceof keyword and Java reflection, the DOM language bindings for Java make this task much simpler. The Node interface defines a helper method, getNodeType( ), which returns an integer value. This value can be compared against a set of constants (also defined within the Node interface), and the type of Node being examined can be quickly and easily determined. This also fits very naturally into the Java switch construct, which can be used to break up serialization into logical sections. The code here covers almost all DOM node types; although there are some additional node types defined (see Figure 5-2), these are the most common, and the concepts here can be applied to the less common node types as well:

    public void serializeNode(Node node, Writer writer, 
                              String indentLevel)
        throws IOException {

        // Determine action based on node type
        switch (node.getNodeType( )) {
            case Node.DOCUMENT_NODE:
                break;
            
            case Node.ELEMENT_NODE:
                break;
            
            case Node.TEXT_NODE:
                break;

            case Node.CDATA_SECTION_NODE:
                break;

            case Node.COMMENT_NODE:
                break;
            
            case Node.PROCESSING_INSTRUCTION_NODE:
                break;
            
            case Node.ENTITY_REFERENCE_NODE:
                break;
                
            case Node.DOCUMENT_TYPE_NODE: 
                break;                
        }
    }

This code is fairly useless; however, it helps to see all of the DOM node types laid out here in a line, rather than mixed in with all of the code needed to perform actual serialization. I want to get to that now, though, starting with the first node passed into this method, an instance of the Document interface.

Because the Document interface is an extension of the Node interface, it can be used interchangeably with the other node types. However, it is a special case, as it contains the root element as well as the XML document's DTD and some other special information not within the XML element hierarchy. As a result, you need to extract the root element and pass that back to the serialization method (starting recursion). Additionally, the XML declaration itself is printed out:

            case Node.DOCUMENT_NODE:	
                writer.write("<?xml version=\"1.0\"?>");
                writer.write(lineSeparator);

                Document doc = (Document)node;
                serializeNode(doc.getDocumentElement( ), writer, "");
                break;
WARNING: DOM Level 2 (as well as SAX 2.0) does not expose the XML declaration. This may not seem like a big deal, until you consider that the encoding of the document is included in this declaration. DOM Level 3 is expected to address this deficiency, and I'll cover that in the next chapter. Be careful not to write DOM applications that depend on this information until this feature is in place.

Since the code needs to access a Document-specific method (as opposed to one defined in the generic Node interface), the Node implementation must be cast to the Document interface. Then invoke the object's getDocumentElement( ) method to obtain the root element of the XML input document, and in turn pass that on to the serializeNode( ) method, starting the recursion and traversal of the DOM tree.

Of course, the most common task in serialization is to take a DOM Element and print out its name, attributes, and value, and then print its children. As you would suspect, all of these can be easily accomplished with DOM method calls. First you need to get the name of the XML element, which is available through the getNodeName( ) method within the Node interface. The code then needs to get the children of the current element and serialize these as well. A Node's children can be accessed through the getChildNodes( ) method, which returns an instance of a DOM NodeList. It is trivial to obtain the length of this list, and then iterate through the children calling the serialization method on each, continuing the recursion. There's also quite a bit of logic that ensures correct indentation and line feeds; these are really just formatting issues, and I won't spend time on them here. Finally, the closing bracket of the element can be output:

            case Node.ELEMENT_NODE:
                String name = node.getNodeName( );
                writer.write(indentLevel + "<" + name);
                writer.write(">");
                
                // recurse on each child
                NodeList children = node.getChildNodes( );
                if (children != null) {
                    if ((children.item(0) != null) &&
                        (children.item(0).getNodeType( ) == 
                        Node.ELEMENT_NODE)) {
                            
                        writer.write(lineSeparator);
                    }
                    for (int i=0; i<children.getLength( ); i++) {  
                        serializeNode(children.item(i), writer,
                            indentLevel + indent);
                    }
                    if ((children.item(0) != null) &&
                        (children.item(children.getLength( )-1)
                                .getNodeType( ) ==
                        Node.ELEMENT_NODE)) {
                     
                        writer.write(indentLevel);       
                    }
                }
                
                writer.write("</" + name + ">");
                writer.write(lineSeparator);
                break;

Of course, astute readers (or DOM experts) will notice that I left out something important: the element's attributes! These are the only pseudo-exception to the strict tree that DOM builds. They should be an exception, though, since an attribute is not really a child of an element; it's (sort of) lateral to it. Basically the relationship is a little muddy. In any case, the attributes of an element are available through the getAttributes( ) method on the Node interface. This method returns a NamedNodeMap, and that too can be iterated through. Each Node within this list can be polled for its name and value, and suddenly the attributes are handled! Enter the code as shown here to take care of this:

            case Node.ELEMENT_NODE:
                String name = node.getNodeName( );
                writer.write(indentLevel + "<" + name);
                NamedNodeMap attributes = node.getAttributes( );
                for (int i=0; i<attributes.getLength( ); i++) {
                    Node current = attributes.item(i);
                    writer.write(" " + current.getNodeName( ) +
                                 "=\"" + current.getNodeValue( ) +
                                 "\"");
                }
                writer.write(">");
                
                // recurse on each child
                NodeList children = node.getChildNodes( );
                if (children != null) {
                    if ((children.item(0) != null) &&
                        (children.item(0).getNodeType( ) == 
                        Node.ELEMENT_NODE)) {
                            
                        writer.write(lineSeparator);
                    }
                    for (int i=0; i<children.getLength( ); i++) {
                      serializeNode(children.item(i), writer,
                            indentLevel + indent);
                    }
                    if ((children.item(0) != null) &&
                        (children.item(children.getLength( )-1)
                                .getNodeType( ) ==
                        Node.ELEMENT_NODE)) {
                     
                        writer.write(indentLevel);       
                    }
                }
                
                writer.write("</" + name + ">");
                writer.write(lineSeparator);
                break;

Next on the list of node types is Text nodes. Output is quite simple, as you only need to use the now-familiar getNodeValue( ) method of the DOM Node interface to get the textual data and print it out; the same is true for CDATA nodes, except that the data within a CDATA section should be enclosed within the CDATA XML semantics (surrounded by <![CDATA[ and ]]>). You can add the logic within those two cases now:

            case Node.TEXT_NODE:
                writer.write(node.getNodeValue( ));
                break;

            case Node.CDATA_SECTION_NODE:
                writer.write("<![CDATA[" +
                             node.getNodeValue( ) + "]]>");
                break;

Dealing with comments in DOM is about as simple as it gets. The getNodeValue( ) method returns the text within the <!-- and --> XML constructs. That's really all there is to it; see this code addition:

            case Node.COMMENT_NODE:
                writer.write(indentLevel + "<!-- " +
                             node.getNodeValue( ) + " -->");
                writer.write(lineSeparator);
                break;

Moving on to the next DOM node type: the DOM bindings for Java define an interface to handle processing instructions that are within the input XML document, rather obviously called ProcessingInstruction. This is useful, as these instructions do not follow the same markup model as XML elements and attributes, but are still important for applications to know about. In the table of contents XML document, there aren't any PIs present (although you could easily add some for testing).

The PI node in the DOM is a little bit of a break from what you have seen so far: to fit the syntax into the Node interface model, the getNodeValue( ) method returns all data instructions within a PI in one String. This allows quick output of the PI; however, you still need to use getNodeName( ) to get the name of the PI. If you were writing an application that received PIs from an XML document, you might prefer to use the actual ProcessingInstruction interface; although it exposes the same data, the method names (getTarget( ) and getData( )) are more in line with a PI's format. With this understanding, you can add in the code to print out any PIs in supplied XML documents:

            case Node.PROCESSING_INSTRUCTION_NODE:
                writer.write("<?" + node.getNodeName( ) +
                             " " + node.getNodeValue( ) +
                             "?>");                
                writer.write(lineSeparator);
                break;

While the code to deal with PIs is perfectly workable, there is a problem. In the case that handled document nodes, all the serializer did was pull out the document element and recurse. The problem is that this approach ignores any other child nodes of the Document object, such as top-level PIs and any DOCTYPE declarations. Those node types are actually lateral to the document element (root element), and are ignored. Instead of just pulling out the document element, then, the following code serializes all child nodes on the supplied Document object:

            case Node.DOCUMENT_NODE:
                writer.write("<xml version=\"1.0\">");
                writer.write(lineSeparator);

                // recurse on each child
                NodeList nodes = node.getChildNodes( );
                if (nodes != null) {
                    for (int i=0; i<nodes.getLength( ); i++) {
                        serializeNode(nodes.item(i), writer, "");
                    }
                }
                /*
                Document doc = (Document)node;
                serializeNode(doc.getDocumentElement( ), writer, "");
                */
                break;

With this in place, the code can deal with DocumentType nodes, which represent a DOCTYPE declaration. Like PIs, a DTD declaration can be helpful in exposing external information that might be needed in processing an XML document. However, since there can be public and system IDs as well as other DTD-specific data, the code needs to cast the Node instance to the DocumentType interface to access this additional data. Then, use the helper methods to get the name of the Node, which returns the name of the element in the document that is being constrained, the public ID (if it exists), and the system ID of the DTD referenced. Using this information, the original DTD can be serialized:

            case Node.DOCUMENT_TYPE_NODE: 
                DocumentType docType = (DocumentType)node;
                writer.write("<!DOCTYPE " + docType.getName( ));
                if (docType.getPublicId( ) != null)  {
                    System.out.print(" PUBLIC \"" + 
                        docType.getPublicId( ) + "\" ");              
                } else {
                    writer.write(" SYSTEM ");
                }
                writer.write("\"" + docType.getSystemId( ) + "\">";
                writer.write(lineSeparator);
                break;

All that's left at this point is handling entities and entity references. In this chapter, I will skim over entities and focus on entity references; more details on entities and notations are in the next chapter. For now, a reference can simply be output with the & and ; characters surrounding it:

            case Node.ENTITY_REFERENCE_NODE:
                writer.write("&" + node.getNodeName( ) + ";");    
                break;

There are a few surprises that may trip you up when it comes to the output from a node such as this. The definition of how entity references should be processed within DOM allows a lot of latitude, and also relies heavily on the underlying parser's behavior. In fact, most XML parsers have expanded and processed entity references before the XML document's data ever makes its way into the DOM tree. Often, when expecting to see an entity reference within your DOM structure, you will find the text or values referenced rather than the entity reference itself. To test this for your parser, you'll want to run the SerializerTest class on the contents.xml document (which I'll cover in the next section) and see what it does with the OReillyCopyright entity reference. In Apache, this comes across as an entity reference, by the way.

And that's it! As I mentioned, there are a few other node types, but covering them isn't worth the trouble at this point; you get the idea about how DOM works. In the next chapter, I'll take you deeper than you probably ever wanted to go. For now, let's put the pieces together and see some results.

5.2.4. The Results

With the DOMSerializer class complete, all that's left is to invoke the serializer's serialize( ) method in the test class. To do this, add the following lines to the SerializerTest class:

    public void test(String xmlDocument, String outputFilename) 
        throws Exception {

        File outputFile = new File(outputFilename);
        DOMParser parser = new DOMParser( );

        // Get the DOM tree as a Document object
        parser.parse(xmlDocument);
        Document doc = parser.getDocument( );

        // Serialize
        DOMSerializer serializer = new DOMSerializer( );
        serializer.serialize(doc, new File(outputFilename));
    }

This fairly simple addition completes the classes, and you can run the example on Chapter 2, "Nuts and Bolts"'s contents.xml file, as shown:

C:\javaxml2\build>java javaxml2.SerializerTest 
    c:\javaxml2\ch05\xml\contents.xml
    output.xml

While you don't get any exciting output here, you can open up the newly created output.xml file and check it over for accuracy. It should contain all the information in the original XML document, with only the differences already discussed in previous sections. A portion of my output.xml is shown in Example 5-3.

Example 5-3. A portion of the output.xml serialized DOM tree

<?xml version="1.0"?>
<!DOCTYPE book SYSTEM "DTD/JavaXML.dtd">
<!--  Java and XML Contents  -->
<book xmlns="http://www.oreilly.com/javaxml2" 
      xmlns:ora="http://www.oreilly.com">
  <title ora:series="Java">Java and XML</title>


  <!--  Chapter List  -->

  <contents>
    <chapter number="2" title="Nuts and Bolts">
      <topic name="The Basics"></topic>

      <topic name="Constraints"></topic>

      <topic name="Transformations"></topic>

      <topic name="And More..."></topic>

      <topic name="What's Next?"></topic>

    </chapter>

You may notice that there is quite a bit of extra whitespace in the output; that's because the serializer adds some line feeds every time writer.write(lineSeparator) appears in the code. Of course, the underlying DOM tree has some line feeds in it as well, which are reported as Text nodes. The end result in many of these cases is the double line breaks, as seen in the output.

WARNING: Let me be very clear that the DOMSerializer class shown in this chapter is for example purposes, and is not a good production solution. While you are welcome to use the class in your own applications, realize that several important options are left out, like encoding and setting advanced options for indentation, line feeds, and line wrapping. Additionally, entities are handled only in passing (complete treatment would be twice as long as this chapter already is!). Your parser probably has its own serializer class, if not multiple classes, that perform this task at least as well, if not better, than the example in this chapter. However, you now should understand what's going on under the hood in those classes. As a matter of reference, if you are using Apache Xerces, the classes to look at are in the org.apache.xml.serialize. Some particularly useful ones are the XMLSerializer, XHTMLSerializer, and HTMLSerializer. Check them out -- they offer a good solution, until DOM Level 3 comes out with a standardized one.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.