Book HomeXML in a Nutshell

Chapter 25. SAX Reference

Contents:

The org.xml.sax Package
The org.xml.sax.helpers Package
SAX Features and Properties
The org.xml.sax.ext Package

SAX, the Simple API for XML, is a straightforward, event-based API used to parse XML documents. David Megginson, SAX's original author, placed SAX in the public domain. SAX is bundled with all parsers that implement the API, including Xerces, MSXML, Crimson, the Oracle XML Parser for Java, and Ælfred. However, you can also get it and the full source code from http://sax.sourceforge.net/.

SAX was originally defined as a Java API and is intended primarily for parsers written in Java, so this chapter will focus on its Java implementation. However, its port to other object-oriented languages, such as C++, Python, Perl, and Eiffel, is common and usually quite similar.

TIP: This chapter covers SAX2 exclusively. In 2002, all major parsers that support SAX support SAX2. The major change from SAX1 to SAX2 was the addition of namespace support. This addition necessitated changing the names and signatures of almost every method and class in SAX. The old SAX1 methods and classes are still available, but they're now deprecated and shouldn't be used.

25.1. The org.xml.sax Package

The org.xml.sax package contains the core interfaces and classes that comprise the Simple API for XML.

The Attributes Interface

An object that implements the Attributes interface represents a list of attributes on a start-tag. The order of attributes in the list is not guaranteed to match the order in the document itself. Attributes objects are passed as arguments to the startElement( ) method of ContentHandler. You can access particular attributes in three ways:

This list does not include namespace declaration attributes (xmlns and xmlns:prefix) unless the http://xml.org/sax/features/namespace-prefixes feature is true. It is false by default.

If the namespace-prefixes feature is false, qualified name access may not be available; if the http://xml.org/sax/features/namespaces feature is false, local names and namespace URIs may not be available:

package org.xml.sax;

public interface Attributes {

  public int    getLength( );
  public String getURI(int index);
  public String getLocalName(int index);
  public String getQName(int index);
  public int    getIndex(String uri, String localName);
  public int    getIndex(String qualifiedName);
  public String getType(int index);
  public String getType(String uri, String localName);
  public String getType(String qualifiedName);
  public String getValue(String uri, String localName);
  public String getValue(String qualifiedName);
  public String getValue(int index);

}
The ContentHandler Interface

ContentHandler is the key piece of SAX. Almost every SAX program needs to use this interface. ContentHandler is a callback interface. An instance of this interface is passed to the parser via the setContentHandler( ) method of XMLReader. As the parser reads the document, it invokes the methods in its ContentHandler to tell the program what's in the document:

package org.xml.sax;

public interface ContentHandler {

  public void setDocumentLocator(Locator locator);
  public void startDocument( ) throws SAXException;
  public void endDocument( ) throws SAXException;
  public void startPrefixMapping(String prefix, String uri)
   throws SAXException;
  public void endPrefixMapping(String prefix) throws SAXException;
  public void startElement(String namespaceURI, String localName,
   String qualifiedName, Attributes atts) throws SAXException;
  public void endElement(String namespaceURI, String localName,
   String qualifiedName) throws SAXException;
  public void characters(char[] text, int start, int length)
   throws SAXException;
  public void ignorableWhitespace(char[] text, int start, int length)
   throws SAXException;
  public void processingInstruction(String target, String data)
   throws SAXException;
  public void skippedEntity(String name) throws SAXException;

}
The DTDHandler Interface

By passing an instance of the DTDHandler interface to the setDTDHandler( ) method of XMLReader, you can receive notification of notation and unparsed entity declarations in the DTD. You can store this information and use it later to retrieve information about the unparsed entities you encounter while reading the document:

package org.xml.sax;

public interface DTDHandler {

  public void notationDecl(String name, String publicID, String systemID)
   throws SAXException;
  public void unparsedEntityDecl(String name, String publicID,
   String systemID, String notationName) throws SAXException;

}
The EntityResolver Interface

By passing an instance of the EntityResolver interface to the setEntityResolver( ) method of XMLReader, you can intercept parser requests for external entities, such as the external DTD subset or external parameter entities, and redirect those requests in order to substitute different entities. For example, you could replace a reference to a remote copy of a standard DTD with a local one or find the sources for particular public IDs in a catalog. The interface is also useful for applications that use URI types other than URLs:

package org.xml.sax;

public interface EntityResolver {

  public InputSource resolveEntity(String publicID, String systemID)
   throws SAXException, IOException;

}
The ErrorHandler Interface

By passing an instance of the ErrorHandler interface to the setErrorHandler( ) method of XMLReader, you can provide custom handling for particular classes of errors detected by the parser. For example, you can choose whether to stop parsing when a validity error is detected. The SAXParseException passed to each of the three methods in this interface provides details about the specific cause and location of the error:

package org.xml.sax;

public interface ErrorHandler {

  public void warning(SAXParseException exception) throws SAXException;
  public void error(SAXParseException exception) throws SAXException;
  public void fatalError(SAXParseException exception)
   throws SAXException;

}

Warnings represent possible problems noticed by the parser that are not technically violations of XML's well-formedness or validity rules. For instance, a parser might issue a warning if an xml:lang attribute's value was not a legal ISO-639 language code. The most common kind of error is a validity problem. The parser should report it, but it should also continue processing. A fatal error violates well-formedness. The parser should not continue parsing after reporting such an error.

The Locator Interface

Unlike most other interfaces in the org.xml.sax package, the Locator interface does not have to be implemented. Instead, the parser has the option to provide an implementation. If it does so, it passes its implementation to the setDocumentLocator( ) method in your ContentHandler instance before it calls startDocument( ). You can save a reference to this object in a field in your ContentHandler class, like this:

private Locator locator;

public void setDocumentLocator(Locator locator) {
  this.locator = locator;
}

Once you've found the locator, you can then use it inside any other ContentHandler method, such as startElement( ) or characters( ), to determine in exactly which document and at which line and column the event took place. For instance, the locator allows you to determine that a particular start-tag began on the third column of the document's seventeenth line at the URL http://www.slashdot.org/slashdot.xml:

package org.xml.sax;

public interface Locator {

  public String getPublicId( );
  public String getSystemId( );
  public int    getLineNumber( );
  public int    getColumnNumber( );

}
The XMLFilter Interface

An XMLFilter is an XMLReader that obtains its events from another parent XMLReader, rather than reading it from a text source such as InputStream. Filters can sit between the original source XML and the application and modify data in the original source before passing it to the application. Implementing this interface directly is unusual. It is almost always much easier to use the more complete org.xml.sax.helpers.XMLFilterImpl class instead.

package org.xml.sax;

public interface XMLFilter extends XMLReader {

  public void      setParent(XMLReader parent);
  public XMLReader getParent( );

}
The XMLReader Interface

The XMLReader interface represents the XML parser that reads XML documents. You generally do not implement this interface yourself. Instead, use the org.xml.sax.helpers.XMLReaderFactory class to build a parser-specific implementation. Then use this parser's various setter methods to configure the parsing process. Finally, invoke the parse( ) method to read the document, while calling back to methods in your own implementations of ContentHandler, ErrorHandler, EntityResolver, and DTDHandler as the document is read:

package org.xml.sax;

public interface XMLReader {

  public boolean getFeature(String name)
   throws SAXNotRecognizedException, SAXNotSupportedException;
  public void    setFeature(String name, boolean value)
   throws SAXNotRecognizedException, SAXNotSupportedException;
  public Object  getProperty(String name)
   throws SAXNotRecognizedException, SAXNotSupportedException;

  public void    setProperty(String name, Object value)
   throws SAXNotRecognizedException, SAXNotSupportedException;
  public void           setEntityResolver(EntityResolver resolver);
  public EntityResolver getEntityResolver( );
  public void           setDTDHandler(DTDHandler handler);
  public DTDHandler     getDTDHandler( );
  public void           setContentHandler(ContentHandler handler);
  public ContentHandler getContentHandler( );
  public void           setErrorHandler(ErrorHandler handler);
  public ErrorHandler   getErrorHandler( );

  public void parse(InputSource input) throws IOException, SAXException;
  public void parse(String systemID) throws IOException, SAXException;

}
The InputSource Class

The InputSource class is an abstraction of a data source from which the raw bytes of an XML document are read. It can wrap a system ID, a public ID, an InputStream, or a Reader. When given an InputSource, the parser tries to read from the Reader. If the InputSource does not have a Reader, the parser will try to read from the InputStream using the specified encoding. If no encoding is specified, then it will try to autodetect the encoding by reading the XML declaration. Finally, if neither a Reader nor an InputStream has been set, then the parser will open a connection to the URL given by the system ID.

package org.xml.sax;

public class InputSource {

    public InputSource( );
    public InputSource(String systemID);
    public InputSource(InputStream byteStream);
    public InputSource(Reader reader);

    public void        setPublicId(String publicID);
    public String      getPublicId( );
    public void        setSystemId(String systemID);
    public String      getSystemId( );
    public void        setByteStream(InputStream byteStream);
    public InputStream getByteStream( );
    public void        setEncoding(String encoding);
    public String      getEncoding( );
    public void        setCharacterStream(Reader reader);
    public Reader      getCharacterStream( );

}
The SAXExceptions Class

Most exceptions thrown by SAX methods are instances of the SAXException class or one of its subclasses. The single exception to this rule is the parse( ) method of XMLReader, which may throw a raw IOException if a purely I/O-related error occurs, for example, if a socket is broken before the parser finishes reading the document from the network.

Besides the usual exception methods, such as getMessage( ) and printStackTrace( ), that SAXException inherits from or overrides in its superclasses, SAXException adds a getException( ) method to return the nested exception that caused the SAXException to be thrown in the first place:

package org.xml.sax;

public class SAXException extends Exception {

    public SAXException(String message);
    public SAXException(Exception ex);
    public SAXException(String message, Exception ex);

    public String    getMessage( );
    public Exception getException( );
    public String    toString( );

}


Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.