Book HomeSAX2

4.3. Exposing DTD Information

SAX2 exposes DTD information through three different interfaces. Part of it is exposed through the LexicalHandler extension interface: the DTD's root element type declaration and boundaries of the various entities. The rest is exposed through two DTD-specific interfaces, presented here.

When you're working with streams of SAX event data, remember that all DTD event data is seen before the document data it describes. This means that if you need it inside the document, you'll need to plan ahead to save the DTD data. It also means that if you need to merge streams of event data, such DTD data may create a problem. Unless you know the DTD data in advance, you'd need to dam up the event stream until all data that needs to go into downstream DTD events is in hand. Only then can you send the events downstream (with the DTD first). Luckily, merging event streams with unknown DTD data isn't common.

DTD information is automatically used inside XML parsers when they parse XML documents. That includes expansion of conditional sections and parameter entities in DTDs, expanding general entities, and normalizing or defaulting attributes. Most DTD validation can be cleanly layered on top of SAX2 since these declaration callbacks provide all the most important information.[20] SAX2 enables application-level processing of DTD constraints; the only internal support it provides for DTDs is a feature flag to expose parser support for validation. When applications need to construct valid documents, they can use DTD information as they make changes, instead of needing to save the document and reparse the whole thing.

[20]The exceptions relate to lexical constraints that should arguably be well-formedness constraints. Entity nesting is supposed to match nesting of grammatical constructs within DTDs; that's a validity constraint. However, the analogous constraint in a document body affects well-formedness instead.

The support for working with DTDs provided by most XML tools is not as good as the support provided by SAX2. For example, DOM Level 2 provides weaker support, and the TRAX support for SAX (java.xml.transform.sax) doesn't support DeclHandler at all.

Note that while a fully featured SAX2 parser will let you re-create the internal subset, it will not let you round-trip any external parameter entities. That's because parameter entities will be expanded. You will not see conditional sections in external PEs, or declarations being built up from parameter entities. Instead, you'll see the actual declarations that apply to your documents. This may help you to understand exactly what a complex DTD is doing.

4.3.1. The DeclHandler Interface

This extension interface is new in SAX2. It's in the org.xml.sax.ext package, which means among other things that it is optional and not all SAX APIs support it. (DefaultHandler is one example of an API that does not.) However, any SAX2 parser that can be bootstrapped with JAXP must support this interface. There is no setDeclHandler() method; bind these handlers to parsers like this:

XMLReader	producer = ...;
DeclHandler	handler = ...;

producer.setProperty ("http://xml.org/sax/properties/
	declaration-handler",handler);
// throws SAXNotSupportedException if parameter isn't a DeclHandler.
// throws SAXNotRecognizedException if parser doesn't support it.

Parsers that support DeclHandler are essential for applications that need to work with declarations of elements and attributes or with parsed entities. DOM requires such support for parsed entities, although even Level 2 hides or ignores element and attribute type data. This interface is the most common way SAX2 exposes type constraints (the primary role of a Document Type Declaration) from DTDs, so if you need to see those constraints, you'll use this handler. It has four API callbacks:

void attributeDecl(eName,aName,type,mode,value)

This callback reports <!ATTLIST ... > declarations in a DTD. A given declaration produces one callback for each attribute in the declaration. Much of this information will also be provided through Attributes methods if an instance of that element appears in a document.

String eName

This is the name of the element whose attribute is being declared.

String aName

This is the name of the attribute associated with that element.

String type

This is one of the strings CDATA, ID, IDREF, IDREFS, NMTOKEN, NMTOKENS, ENTITY, or ENTITIES, or two types of enumerated values. Enumerated values are encoded with parenthesized strings such as (a|b|c) to indicate that strings a, b, or c are permissible. If the string is an enumeration of notation names, "NOTATION " (which includes one space) precedes that parenthesized string.

This type information is more complete than information you get through the Attributes object provided with startElement(), because Attributes reports only enumerations as being either NOTATION or NMTOKEN. However, at this time several widely available SAX2 parsers conform to a beta test version of this API and don't correctly report enumerations. You may need to get a bug-fixed version of your parser if you're depending on this support.

String mode

This describes the kind of default value applied to this attribute: #IMPLIED (the application determines the value), #REQUIRED (the value must be given; defaulting is not permitted), #FIXED (only one value is permitted), or null indicating that value is the default.

Unless the document provided a value, you won't see #IMPLIED attributes in the Attributes object provided with startElement(); if you need to know this information, save it when you get this callback.

String value

This parameter is either null or a string with the default value for this attribute. That might be the only permitted value if the attribute mode is #FIXED. The value will be reported exactly as applications will see it: normalized and with character and entity references replaced.

XML structure editors can use this information to constrain the choices presented to document authors so that only valid documents can be created. Other tools that construct documents will also benefit from having this information. When you're mostly reading documents rather than creating them, the most important data here tends to be declaration of ID, IDREF, and IDREFS attributes, which are used to build links within and between XML documents.

If more than one declaration for an attribute is provided, only the first one will be used. (The second one will be ignored; unlike the analogous case for element declarations, attribute redeclaration is not a validity error.) Normally code to implement this callback would first retrieve any existing per-element data structure, or it would create one (with a null content model) if none is yet known. Then if there is no record of an attribute with this name for that element, a per-attribute data structure instance would be created and saved in the element data structure, keyed by attribute name.

void elementDecl(name,model)

This method reports <!ELEMENT ... > declarations in a DTD.

String name

This is the element name.

String model

This is the element content model, with all whitespace removed. For example, element content models like (a,(b|c)+,d?), mixed content models like (#PCDATA|one|two|three)*, and simple models like ANY and EMPTY may all be found in the same document. Note that parsers may do more than just remove the whitespace, as long as an equivalent content model is reported.

Because the content model is provided as a string, applications using it must always parse it themselves. Similarly, if applications want to validate against that model, they must provide code to do that. Except for the case of element content, such work is straightforward. Validating element content models requires constructing and using some sort of finite state automaton, and it takes a bit of work to parse the model. Mixed content models are easier to handle since they can be parsed with a java.util.StringTokenizer and because the validation logic is simpler.

If more than one declaration for an element is provided, only the first one will be used. (The second one will be considered a validity error; element type redeclaration is not allowed.) Normally the code implementing this callback would create a new per-element data structure to save the name and content model and store it in data structure (hash table or other map) keyed by element name. Such a data structure might already exist if an element attribute was declared before the element. In this case, this callback just provides the content model, which was previously unknown.

void externalEntityDecl(name,publicId,systemId)

This callback reports <!ENTITY ... > declarations in a DTD for parsed external entities. These may be either general or parameter entities.

String name

This is the entity name; it is always provided. Names that start with % are parameter entities; all others are general entities.

String publicId

This is the public ID for the entity and can be omitted (provided as null). If public IDs are provided, any embedded whitespace is normalized, so these strings may be directly compared. They may be used to determine a location for the entity, for example, by using an SGML Formal Public Identifier with some sort of catalog.

String systemId

This is the system ID for the entity and is always provided. It is an absolute URI, which parsers normally use to retrieve the entity before parsing it. However, some SAX2 parsers have a bug, and won't report the absolute URI here.

Applications usually ignore all parameter entity declarations and use the org.xml.sax.EntityResolver when they want to provide local copies of these entities to a parser. If applications don't ignore these declarations, redeclaration should be ignored (it is not an error). XML editors may want to offer menus of external (and internal) entities when editing element content. And in some cases you may want to track external entities by name so that you can tell when LexicalHandler.startEntity() is reporting the start of one; this is useful for applications that use xml:base attributes to change applications' views of the actual URI that contains an element, using the Locator.getSystemId() method. (Perhaps the actual location was not known, or should for some reason be ignored.)

void internalEntityDecl(name,value)

This callback reports <!ENTITY ... > declarations in a DTD for (parsed) internal entities. These may be either general or parameter entities.

String name

This is the entity name. Names that start with % are parameter entities, all others are general entities.

String value

This is the entity value, which contains arbitrary XML content (including elements and nested entity references) that will be reparsed when this entity is expanded.

Applications normally ignore all parameter entity declarations. If applications don't ignore these declarations, redeclaration for a name should be be ignored (it is not an error). XML editors may want to offer menus of internal entities when they edit attribute values or element content. However, SAX2 does not report entity references inside the attribute values it parses. This means that you won't be able to re-create such text without heuristics.

4.3.2. The DTDHandler Interface

The DTDHandler interface was carried unchanged from SAX1 into SAX2 and is primarily useful for applications that work with two specific SGML notions: notations and unparsed entities. Some DTDs, such as XML DocBook, use notations in such traditional roles. DOM also requires such support. Use XMLReader.setDTDHandler() to bind this handler to a parser. You probably won't ever need to use it for new code. On the Web, those SGML notions correspond roughly to MIME types and URIs respectively, web concepts that are much more widely understood and supported. The interface has only two API callbacks, provided to meet specific requirements in the XML 1.0 specification:

void notationDecl(name,publicId,systemId)

This callback reports a <!NOTATION ...> declaration in a DTD.

String name

This is the notation name; it is always provided. These names are used explicitly in unparsed entity declarations and in some kinds of attribute declaration (elements can have one such attribute, used to associate type with the element). Also, some applications follow a convention that they may be used to identify processing instruction targets.

String publicId

This is the public ID for the notation and may be omitted (provided as null). If public IDs are supplied, then any embedded whitespace is normalized, so these strings may be directly compared. These may be used to assign a meaning to the notation, for example, by using an SGML Formal Public Identifier in a role much like a MIME type.

String systemId

This is the system ID for the notation and may be omitted (provided as null). When provided, it is an absolute URI. However, some SAX2 parsers have a bug, and won't report the absolute URI here. These may be used to assign a meaning to the notation, for example, by using a URI to identify a type or command.

In addition to assigning types to unparsed entities, a NOTATION attribute may also associate a type with an element or processing instruction. Some DTDs provide extensive catalogs of notation declarations specifically for such uses.

Note that notation declarations are the one place in XML syntax where you can provide a public ID without a system ID, and that at least one identifier (public or system) must always be provided. If applications don't ignore these declarations, redeclaration should be ignored (it is not an error).

void unparsedEntityDecl(name,publicId,systemId,notation)

This callback reports <!ENTITY ... > declarations with NDATA annotations to associate them with a notation (such as jpeg or png). Unparsed entities are used only in attributes that are declared to be of type ENTITY or ENTITIES.

String name

This is the name of the unparsed entity; it is always provided.

String publicId

This is the public ID for the notation and may be omitted (provided as null). If public IDs are provided, any embedded whitespace is normalized, so these strings may be directly compared. These may be used to assign a location to the entity, for example, by using an SGML Formal Public Identifier in a role much like a URN.

String systemId

This is the system ID for the notation and is always provided. It is normally an absolute URI. However, some SAX2 parsers have a bug, and won't report the absolute URI here. These may be used to assign a location to the entity.

String notation

This is the name of the notation associated with the entity; it is always provided. The role of these names is much like that of an external MIME type annotation for the entity.

In XML, unparsed entities are declared to parsers but pass through them without being parsed. Classic examples of unparsed entities include JPEG or PNG image files. Such entities may also be used for XML text that just doesn't need to be parsed in a given processing stage. If applications don't ignore these declarations, redeclaration should be be ignored (it is not an error).

Most XML applications that care about unparsed entities and notations do so because they interface with SGML systems that use them or are migrating such systems to use the XML generation of tools. XML editors supporting this functionality might use these event callbacks to create menus of notations or unparsed entities when they are editing attributes that hold such values.

Applications that use this interface will normally use the callbacks to create two tables, keyed by entity or notation name respectively, that are used to interpret element attributes. More rarely, notations will be used to determine the operation corresponding to a given processing instruction target name. Secure applications will never use notations to directly encode system commands, but will always redirect through application controlled tables. For example, it would be foolish to rely on system IDs found in a document. System IDs such as rm -rf /, when run through a Unix or Linux shell, would remove all files accessible through the local system.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.