Advanced SAX (Java & XML, 2nd Edition)

The last chapter was a good introduction to SAX. However, there are several more topics that will round out your knowledge of SAX. While I've called this chapter "Advanced SAX," don't be intimidated. It could just as easily be called "Less-Used Portions of SAX that are Still Important." In writing these two chapters, I followed the 80/20 principle. 80% of you will probably never need to use the material in this chapter, and Chapter 3, "SAX" will completely cover your needs. However, for those power users out there working in XML day in and day out, this chapter covers some of the finer points of SAX that you'll need.

I'll start with a look at setting parser properties and features, and discuss configuring your parser to do whatever you need it to. From there, I'll move on to some more handlers: the EntityResolver and DTDHandler left over from the last chapter. At that point, you should have a comprehensive understanding of the standard SAX 2.0 distribution. However, we'll push on to look at some SAX extensions, beginning with the writers that can be coupled with SAX, as well as some filtering mechanisms. Finally, I'll introduce some new handlers to you, the LexicalHandler and DeclHandler, and show you how they are used. When all is said and done (including another "Gotcha!" section), you should be ready to take on the world with just your parser and the SAX classes. So slip into your shiny spacesuit and grab the flightstick -- ahem. Well, I got carried away with the taking on the world. In any case, let's get down to it.

4.1. Properties and Features

With the wealth of XML-related specifications and technologies emerging from the World Wide Web Consortium (W3C), adding support for any new feature or property of an XML parser has become difficult. Many parser implementations have added proprietary extensions or methods at the cost of code portability. While these software packages may implement the SAX XMLReader interface, the methods for setting document and schema validation, namespace support, and other core features are not standard across parser implementations. To address this, SAX 2.0 defines a standard mechanism for setting important properties and features of a parser that allows the addition of new properties and features as they are accepted by the W3C without the use of proprietary extensions or methods.

4.1.1. Setting Properties and Features

Lucky for you and me, SAX 2.0 includes the methods needed for setting properties and features in the XMLReader interface. This means you have to change little of your existing code to request validation, set the namespace separator, and handle other feature and property requests. The methods used for these purposes are outlined in Table 4-1.

Table 4-1. Property and feature methods

Method	Returns	Parameters	Syntax
`setProperty( )`	`void`	`String propertyID`, `Object value`	`parser.setProperty("[Property URI]", propertyValue);`
`setFeature( )`	`void`	`String featureID`, `boolean state`	`parser.setFeature("[Feature URI]", featureState);`
`getProperty( )`	`Object`	`String propertyID`	`Object propertyValue = parser.getProperty("[Property URI]");`
`getFeature( )`	`boolean`	`String featureID`	`boolean featureState = parser.getFeature("[Feature URI]");`

For these methods, the ID of a specific property or feature is a URI. The core set of features and properties is listed in Appendix B, "SAX 2.0 Features and Properties". Additional documentation on features and properties supported by your vendor's XML parser should also be available. These URIs are similar to namespace URIs; they are only used as associations for particular features. Good parsers ensure that you do not need network access to resolve these features; think of them as simple constants that happen to be in URI form. These methods are simply invoked and the URI is dereferenced locally, often to constantly represent what action in the parser needs to be taken.

WARNING: Don't type these property and feature URIs into a browser to "check for their existence." Often, this results in a 404 Not Found error. I've had many browsers report this to me, insisting that the URIs are invalid. However, this is not the case; the URI is just an identifier, and as I pointed out, usually resolved locally. Trust me: just use the URI, and trust the parser to do the right thing.

In the parser configuration context, a property requires some object value to be usable. For example, for lexical handling, a DOM Node implementation would be supplied as the value for the appropriate property. In contrast, a feature is a flag used by the parser to indicate whether a certain type of processing should occur. Common features are validation, namespace support, and including external parameter entities.

The most convenient aspect of these methods is that they allow simple addition and modification of features. Although new or updated features will require a parser implementation to add supporting code, the method by which features and properties are accessed remains standard and simple; only a new URI need be defined. Regardless of the complexity (or obscurity) of new XML-related ideas, this robust set of four methods should be sufficient to allow parsers to implement the new ideas.

4.1.2. SAX Properties and Features

More often than not, the features and properties you deal with are the standard SAX-defined ones. These are features and properties that should be available with any SAX distribution, and that any SAX-compliant parser should support. Additionally, this preserves vendor-independence in your code, so I recommend that you use SAX-defined properties and features whenever possible.

4.1.2.1. Validation

The most common feature you'll use is the validation feature. The URI for this guy is http://xml.org/sax/features/validation, and not surprisingly, it turns validation on or off in the parser. For example, if you want to turn on validation in the parsing example from the last chapter (remember the Swing viewer?), make this change in the SAXTreeViewer.java source file:

    public void buildTree(DefaultTreeModel treeModel, 
                          DefaultMutableTreeNode base, String xmlURI) 
        throws IOException, SAXException {

        // Create instances needed for parsing
        XMLReader reader = 
            XMLReaderFactory.createXMLReader(vendorParserClass);
        ContentHandler jTreeContentHandler = 
            new JTreeContentHandler(treeModel, base);
        ErrorHandler jTreeErrorHandler = new JTreeErrorHandler( );

        // Register content handler
        reader.setContentHandler(jTreeContentHandler);

        // Register error handler
        reader.setErrorHandler(jTreeErrorHandler);
 
        // Request validation
        reader.setFeature("http://xml.org/sax/features/validation", true);

        // Parse
        InputSource inputSource = 
            new InputSource(xmlURI);
        reader.parse(inputSource);
    }

Compile these changes, and run the example program. Nothing happens, right? Not surprising; the XML we've looked at so far is all valid with respect to the DTD supplied. However, it's easy enough to fix that. Make the following change to your XML file (notice that the element in the DOCTYPE declaration no longer matches the actual root element, since XML is case-sensitive):

<?xml version="1.0"?>
<!DOCTYPE Book SYSTEM "DTD/JavaXML.dtd">

<!-- Java and XML Contents -->
<book xmlns="http://www.oreilly.com/javaxml2"
      xmlns:ora="http://www.oreilly.com"
>

Now run your program on this modified document. Because validation is turned on, you should get an ugly stack trace reporting the error. Of course, because that's all that our error handler methods do, this is precisely what we want:

C:\javaxml2\build>java javaxml2.SAXTreeViewer 
    c:\javaxml2\ch04\xml\contents.xml
**Parsing Error**
  Line:    7
  URI:     file:///c:/javaxml2/ch04/xml/contents.xml
  Message: Document root element "book", must match DOCTYPE root "Book".
org.xml.sax.SAXException: Error encountered
        at javaxml2.JTreeErrorHandler.error(SAXTreeViewer.java:445)
[Nasty Stack Trace to Follow...]

Remember, turning validation on or off does not affect DTD processing; I talked about this in the last chapter, and wanted to remind you of this subtle fact. To get a better sense of this, turn off validation (comment out the feature setting, or supply it the "false" value), and run the program on the modified XML. Even though the DTD is processed, as seen by the resolved OReillyCopyright entity reference, no errors occur. That's the difference between processing a DTD and validating an XML document against that DTD. Memorize, understand, and recite this to yourself; it will save you hours of confusion in the long run.

4.1.2.2. Namespaces

Next to validation, you'll most commonly deal with namespaces. There are two features related to namespaces: one that turns namespace processing on or off, and one that indicates whether namespace prefixes should be reported as attributes. The two are essentially tied together, and you should always "toggle" both, as shown in Table 4-2.

Table 4-2. Toggle values for namespace-related features

Value for namespace processing	Value for namespace prefix reporting
True	False
False	True

This should make sense: if namespace processing is on, the xmlns-style declarations on elements should not be exposed to your application as attributes, as they are only useful for namespace handling. However, if you do not want namespace processing to occur (or want to handle it on your own), you will want these xmlns declarations reported as attributes so you can use them just as you would use other attributes. However, if these two fall out of sync (both are true, or both are false), you can end up with quite a mess!

Consider writing a small utility method to ensure these two features stay in sync with each other. I often use the method shown here for this purpose:

private void setNamespaceProcessing(XMLReader reader, boolean state) 
    throws SAXNotSupportedException, SAXNotRecognizedException {

    reader.setFeature(
        "http://xml.org/sax/features/namespaces", state);
    reader.setFeature(
        "http://xml.org/sax/features/namespace-prefixes", !state);
}

This maintains the correct setting for both features, and you can now simply call this method instead of two setFeature( ) invocations in your own code. Personally, I've used this feature less than ten times in about two years; the default values (processing namespaces as well as not reporting prefixes as attributes) almost always work for me. Unless you are writing low-level applications that either don't need namespaces or can use the speed increase obtained from not processing namespaces, or you need to handle namespaces on your own, I wouldn't worry too much about either of these features.

This code brings up a rather important aspect of features and properties, though: invoking the feature and property methods can result in SAXNotSupportedExceptions and SAXNotRecognizedExceptions. These are both in the org.xml.sax package, and need to be imported in any SAX code that uses them. The first indicates that the parser knows about the feature or property but doesn't support it. You won't run into this much in even average quality parsers, but it is commonly used when a standard property or feature is not yet coded in. So invoking setFeature( ) on the namespace processing feature on a parser in development might result in a SAXNotSupportedException. The parser recognizes the feature, but doesn't have the ability to perform the requested processing. The second exception most commonly occurs when using vendor-specific features and properties (covered in the next section), and then switching parser implementations. The new implementation won't know anything about the other vendor's features or properties, and will throw a SAXNotRecognizedException.

You should always explicitly catch these exceptions so you can deal with them. Otherwise, you end up losing valuable information about what happened in your code. For example, let me show you a modified version of the code from the last chapter that tries to set up various features, and how that changes the exception-handling architecture:

    public void buildTree(DefaultTreeModel treeModel, 
                          DefaultMutableTreeNode base, String xmlURI) 
        throws IOException, SAXException {
            
        String featureURI = "";

        try {
            // Create instances needed for parsing
            XMLReader reader = 
                XMLReaderFactory.createXMLReader(vendorParserClass);
            ContentHandler jTreeContentHandler = 
                new JTreeContentHandler(treeModel, base);
            ErrorHandler jTreeErrorHandler = new JTreeErrorHandler( );

            // Register content handler
            reader.setContentHandler(jTreeContentHandler);

            // Register error handler
            reader.setErrorHandler(jTreeErrorHandler);
            
            /** Deal with features **/
            featureURI = "http://xml.org/sax/features/validation";

            // Request validation
            reader.setFeature(featureURI, true);
            
            // Namespace processing - on
            featureURI = "http://xml.org/sax/features/namespaces";
            setNamespaceProcessing(reader, true);
            
            // Turn on String interning
            featureURI = "http://xml.org/sax/features/string-interning";
            reader.setFeature(featureURI, true);
            
            // Turn off schema processing
            featureURI = 
                "http://apache.org/xml/features/validation/schema";
            reader.setFeature(featureURI, false);

            // Parse
            InputSource inputSource = 
                new InputSource(xmlURI);
            reader.parse(inputSource);
        } catch (SAXNotRecognizedException e) {
            System.out.println("The parser class " + vendorParserClass +
                " does not recognize the feature URI " + featureURI);
            System.exit(0);
        } catch (SAXNotSupportedException e) {
            System.out.println("The parser class " + vendorParserClass +
                " does not support the feature URI " + featureURI);
            System.exit(0);                
        }
    }

By dealing with these exceptions as well as other special cases, you give the user better information and improve the quality of your code.

4.1.2.3. Interning and entities

The three remaining SAX-defined features are fairly obscure. The first, http://xml.org/sax/features/string-interning, turns string interning on or off. By default this is false (off) in most parsers. Setting it to true means that every element name, attribute name, namespace URI and prefix, and other strings have java.lang.String.intern() invoked on them. I'm not going to get into great detail about interning here; if you don't know what it is, check out Sun's Javadoc on the method at http://java.sun.com/j2se/1.3/docs/api/index.html. In a nutshell, every time a string is encountered, Java attempts to return an existing reference for the string in the current string pool, instead of (possibly) creating a new String object. Sounds like a good thing, right? Well, the reason it's off by default is most parsers have their own optimizations in place that can outperform string interning. My advice is to leave this setting alone; many people have spent weeks tuning things like this so you don't have to mess with them.

The other two features determine whether textual entities are expanded and resolved (http://xml.org/sax/features/external-general-entities), and whether parameter entities are included (http://xml.org/sax/features/external-parameter-entities) when parsing occurs. These are set to true for most parsers, as they deal with all the entities that XML has to offer. Again, I recommend you leave these settings as is, unless you have a specific reason for disabling entity handling.

4.1.2.4. DOM nodes and literal strings

The two standard SAX properties are a little less clear in their usage. In both cases, the properties are more useful for obtaining values, whereas with features the common use is to set values. Additionally, both properties are more helpful in error handling than in any general usage. And finally, both properties provide access to what is being parsed at a given time. The first, identified by the URI http://xml.org/sax/properties/dom-node, returns the current DOM node being processed, or the root DOM node if parsing isn't occurring. Of course, I haven't really talked about DOM yet, but this will make more sense in the next two chapters. The second property, identified by the URI http://xml.org/sax/properties/xml-string, returns the literal string of characters being processed. You'll find varying support for these properties in various parsers, showing that many parser implementers find these properties of arguable use as well. For example, Xerces does not support the xml-string property, to avoid having to buffer the input document (at least in that specific way). On the other hand, it does support the dom-node property so that you can turn a SAX parser into (essentially) a DOM tree iterator.

4.1.3. Proprietary Properties and Features

In addition to the standard, SAX-defined features and properties, most parsers define several features and properties of their own. For example, Apache Xerces has a page of features it supports at http://xml.apache.org/xerces-j/properties.html,and properties it supports at http://xml.apache.org/xerces-j/properties.html. I'm not going to cover these in great detail, and you should steer clear of them whenever possible; it locks your code into a specific vendor. However, there are times when using a vendor's specific functionality will save you some work. In those cases, exercise caution, but don't be foolish; use what your parser gives you!

As an example, take the Xerces feature that enables and disables XML schema processing: http://apache.org/xml/features/validation/schema. Because there is no standard support for XML schemas across parsers or in SAX, use this specific feature (it's set to true by default) to avoid spending parsing time to deal with any referenced XML schemas in your documents, for example. You save time in production if you don't use this processing, and it needs a vendor-specific feature. Check out your vendor documentation for options available in addition to SAX's.

Chapter 4. Advanced SAX

Contents: