Book HomePerl & XML

5.2. DTD Handlers

XML::Parser::PerlSAX supports another group of handlers used to process DTD events . It takes care of anything that appears before the root element, such as the XML declaration, doctype declaration, and the internal subset of entity and element declarations, which are collectively called the document prolog. If you want to output the document literally as you read it (e.g., in a filter program), you need to define some of these handlers to reproduce the document prolog. Defining these handlers is just what we needed in the previous example.

You can use these handlers for other purposes. For example, you may need to pre-load entity definitions for special processing rather than rely on the parser to do its default substitution for you. These handlers are listed in Table 5-2.

Table 5-2. PerlSAX DTD handlers

Method name

Event

Properties

entity_decl

The parser sees an entity declaration (internal or external, parsed or unparsed).

Name, Value, PublicId, SystemId, Notation

notation_decl

The parser found a notation declaration.

Name, PublicId, SystemId, Base

unparsed_entity_decl

The parser found a declaration for an unparsed entity (e.g., a binary data entity).

Name, PublicId, SystemId, Base

element_decl

An element declaration was found.

Name, Model

attlist_decl

An element's attribute list declaration was encountered.

ElementName, AttributeName, Type, Fixed

doctype_decl

The parser found the document type declaration.

Name, SystemId, PublicId, Internal

xml_decl

The XML declaration was encountered.

Version, Encoding, Standalone

The entity_decl( ) handler is called for all kinds of entity declarations unless a more specific handler is defined. Thus, unparsed entity declarations trigger the entity_decl( ) handler unless you've defined an unparsed_entity_decl( ), which will take precedence.

entity_decl( )'s parameters vary depending on the entity type. The Value parameter is set for internal entities, but not external ones. Likewise, PublicId and SystemId, parameters that tell an XML processor where to find the file containing the entity's value, is not set for internal entities, only external ones. Base tells the procesor what to use for a base URL if the SystemId contains a relative location.

Notation declarations are a special feature of DTDs that allow you to assign a special type identifier to an entity. For example, you could declare an entity to be of type "date" to tell the XML processor that the entity should be treated as that kind of data. It's not used very often in XML, so we won't go into it further.

The Model property of the element_decl( ) contains the content model, or grammar, for an element. This property describes what is allowed to go inside an element according to the DTD.

An attribute list declaration in a DTD can contain more than one attribute description. Fortunately, the parser breaks these descriptions up into individual calls to the attlist_decl( ) handler for each attribute.

The document type declaration is an optional part of the document at the top, just under the XML declaration. The parameter Name is the name of the root element in your document. PublicId and SystemId tell the processor where to find the external DTD. Finally, the Internal parameter contains the whole internal subset as a string, in case you want to skip the individual entity and element declaration handling.

As an example, let's say you wanted to add to the filter example code to output the document prolog exactly as it was encountered by the parser. You'd need to define handlers like the program in Example 5-4.

Example 5-4. A better filter

# handle xml declaration
#
sub xml_decl {
    my( $self, $properties ) = @_;
    output( "<?xml version=\"" . $properties->{'Version'} . "\"" );
    my $encoding = $properties->{'Encoding'};
    output( " encoding=\"$encoding\"" ) if( $encoding );
    my $standalone = $properties->{'Standalone'};
    output( " standalone=\"$standalone\"" ) if( $standalone );
    output( "?>\n" );
}

#
# handle doctype declaration:
# try to duplicate the original
#
sub doctype_decl {
    my( $self, $properties ) = @_;
    output( "\n<!DOCTYPE " . $properties->{'Name'} . "\n" );
    my $pubid = $properties->{'PublicId'};
    if( $pubid ) {
        output( "  PUBLIC \"$pubid\"\n" );
        output( "  \"" . $properties->{'SystemId'} . "\"\n" );
    } else {
        output( "  SYSTEM \"" . $properties->{'SystemId'} . "\"\n" );
    }
    my $intset = $properties->{'Internal'};
    if( $intset ) {
        $in_intset = 1;
        output( "[\n" );
    } else {
        output( ">\n" );
    }
}

#
# handle entity declaration in internal subset:
# recreate the original declaration as it was
#
sub entity_decl {
    my( $self, $properties ) = @_;
    my $name = $properties->{'Name'};
    output( "<!ENTITY $name " );
    my $pubid = $properties->{'PublicId'};
    my $sysid = $properties->{'SystemId'};
    if( $pubid ) {
        output( "PUBLIC \"$pubid\" \"$sysid\"" );
    } elsif( $sysid ) {
        output( "SYSTEM \"$sysid\"" );
    } else {
        output( "\"" . $properties->{'Value'} . "\"" );
    }
    output( ">\n" );
}

Now let's see how the output from our filter looks. The result is in Example 5-5.

Example 5-5. Output from the filter

<?xml version="1.0"?>

<!DOCTYPE book
  SYSTEM "/usr/local/prod/sgml/db.dtd"
[
<!ENTITY thingy "hoo hah blah blah">
]>
<book id="mybook">

  <title>GRXL in a Nutshell</title>
  <chapter id="intro">
    <title>What is GRXL?</title>
<comment> need a better title </comment>
    <para>
Yet another acronym.  That was our attitude at first, but then we saw 
the amazing uses of this new technology called
<literal>GRXL</literal>.  Consider the following program:
    </para>

    <programlisting>AH aof -- %%%%
{{{{{{ let x = 0 }}}}}}
  print!  <lineannotation>wow</lineannotation>
or not!</programlisting>
<comment> what font should we use? </comment>
    <para>
What does it do?  Who cares?  It's just lovely to look at.  In fact,
I'd have to say, "&thingy;".
    </para>

  </chapter>
</book>

That's much better. Now we have a complete filter program. The basic handlers take care of elements and everything inside them. The DTD handlers deal with whatever happens outside of the root element.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.