Previous Section Next Section

23.2 Parsing XML with SAX

In most cases, the best way to extract information from an XML document is to parse the document with a parser compliant with SAX, the Simple API for XML. SAX defines a standard API that can be implemented on top of many different underlying parsers. The SAX approach to parsing has similarities to the HTML parsers covered in Chapter 22. As the parser encounters XML elements, text contents, and other significant events in the input stream, the parser calls back to methods of your classes. Such event-driven parsing, based on callbacks to your methods as relevant events occur, also has similarities to the event-driven approach that is almost universal in GUIs and in some networking frameworks. Event-driven approaches in various programming fields may not appear natural to beginners, but enable high performance and particularly high scalability, making them very suitable for high-workload cases.

To use SAX, you define a content handler class, subclassing a library class and overriding some methods. Then, you build a parser object p, install an instance of your class as p's handler, and feed p the input stream to parse. p calls methods on your handler to reflect the document's structure and contents. Your handler's methods perform application-specific processing. The xml.sax package supplies a factory function to build p, as well as convenience functions for simpler operation in typical cases. xml.sax also supplies exception classes, used to diagnose invalid input and other errors.

Optionally, you can also register with parser p other kinds of handlers besides the content handler. You can supply a custom error handler to use an error diagnosis strategy different from normal exception raising, and try to diagnose several errors during a parse. You can supply a custom DTD handler to receive information about notation and unparsed entities from the XML document's Document Type Definition (DTD). You can supply a custom entity resolver to handle external entity references in advanced, customized ways. These additional possibilities are advanced and rarely used, so I do not cover them in this book.

23.2.1 The xml.sax Package

The xml.sax package supplies exception class SAXException, and subclasses of it to support fine-grained exception handling. xml.sax also supplies three functions.

make_parser

make_parser(parsers_list=[])

parsers_list is a list of strings, names of modules from which you would like to build your parser. make_parser tries each module in sequence until it finds one that defines a suitable function create_parser. After the modules in parsers_list, if any, make_parser continues by trying a list of default modules. make_parser terminates as soon as it can generate a parser p, and returns p.

parse

parse(file,handler,error_handler=None)

file is a filename or a file-like object open for reading, containing an XML document. handler is generally an instance of your own subclass of class ContentHandler, covered later in this chapter. error_handler, if given, is generally an instance of your own subclass of class ErrorHandler. You don't necessarily have to subclass ContentHandler and/or ErrorHandler: you just need to provide the same interfaces as the classes do. Subclassing is often a convenient means to this end.

Function parse is equivalent to the code:

p = make_parser(  )
p.setContentHandler(handler)
if error_handler is not None: 
    p.setErrorHandler(error_handler)
p.parse(file)

This idiom is quite frequent in SAX parsing, so having it in a single function is convenient. When error_handler is None, the parser diagnoses errors by propagating an exception that is an instance of some subclass of SAXException.

parseString

parseString(string,handler,error_handler=None)

Like parse, except that string is the XML document in string form.

xml.sax also supplies a class, which you subclass to define your content handler.

ContentHandler

class ContentHandler(  )

An instance h of a subclass of ContentHandler may override several methods, of which the most frequently useful are the following:

h.characters( data)

Called when textual content data is parsed. The parser may split each range of text in the document into any number of separate callbacks to h.characters. Therefore, your implementation of method characters usually buffers data, generally by appending it to a list attribute. When your class knows from some other event that all relevant data has arrived, your class calls ''.join on the list and processes the resulting string.

h.endDocument( )

Called once when the document finishes.

h.endElement( tag)

Called when the element named tag finishes.

h.endElementNS( name,qname)

Called when an element finishes and the parser is handling namespaces. name and qname are like for startElementNS, covered later in this chapter.

h.startDocument( )

Called once when the document begins.

h.startElement( tag,attrs)

Called when the element named tag begins. attrs is a mapping of attribute names to values, as covered in the next section.

h.startElementNS( name,qname,attrs)

Called when an element begins and the parser is handling namespaces. name is a pair (uri,localname), where uri is the namespace's URI or None, and localname is the name of the tag. qname (which stands for qualified name) is either None, if the parser does not supply the namespace prefixes feature, or the string prefix:name used in the document's text for this tag. attrs is a mapping of attribute names to values, as covered in the next section.

23.2.1.1 Attributes

The last argument of methods startElement and startElementNS is an attributes object attr, a read-only mapping of attribute names to attribute values. For method startElement, names are identifier strings. For method startElementNS, names are pairs (uri,localname), where uri is the namespace's URI or None, and localname is the name of the tag. The object attr also supports methods that let you work with the qname (qualified name) of each attribute.

getValueByQName

attr.getValueByQName(name)

Returns the attribute value for a qualified name name.

getNameByQName

attr.getNameByQName(name)

Returns the (namespace, localname) pair for a qualified name name.

getQNameByName

attr.getQNameByName(name)

Returns the qualified name for name, which is a (namespace, localname) pair.

getQNames

attr.getQNames(  )

Returns the list of qualified names of all attributes.

For startElement, each qname is the same string as the corresponding name. For startElementNS, a qname is the corresponding local name for attributes not associated with a namespace (i.e., attributes whose uri is None); otherwise, the qname is the string prefix:name used in the document's text for this attribute.

The parser may reuse in later processing the attr object that it passes to methods startElement and startElementNS. If you need to keep a copy of the attributes of an element, call attr.copy( ) to get the copy.

23.2.1.2 Incremental parsing

All parsers support a method parse, which you call with the XML document as either a string or a file-like object open for reading. parse does not return until the end of the XML document. Most SAX parsers, though not all, also support incremental parsing, letting you feed the XML document to the parser a little at a time, as the document arrives from a network connection or other source. A parser p that is capable of incremental parsing supplies three more methods.

close

p.close(  )

Call when the XML document is finished.

feed

p.feed(data)

Passes to the parser a part of the document. The parser processes some prefix of the text and holds the rest in a buffer until the next call to p.feed or p.close.

reset

p.reset(  )

Call after an XML document is finished or abandoned, before you start feeding another XML document to the parser.

23.2.1.3 The xml.sax.saxutils module

The saxutils module of package xml.sax supplies two functions and a class that are quite handy to generate XML output based on an input XML document.

escape

escape(data,entities={})

Returns a copy of string data with characters <, >, and & changed into entity references &lt;, &gt;, and &amp;. entities is a dictionary with strings as keys and values; each substring s of data that is a key in entities is changed in escape's result string into string entities[s]. For example, to escape single and double quote characters, in addition to angle brackets and ampersands, you can call:

xml.sax.saxutils.escape(data,{'"':'&quot;', "'":"&apos;"})
quoteattr

escape(data,entities={})

Same as escape, but also quotes the result string to make it immediately usable as an attribute value, and escapes any quote characters that have to be escaped.

XMLGenerator

class XMLGenerator(out=sys.stdout, encoding='iso-8859-1')

Subclasses xml.sax.ContentHandler and implements all that is needed to reproduce the input XML document on the given file-like object out with the specified encoding. When you must generate an XML document that is a small modification of the input one, you can subclass XMLGenerator, overriding methods and delegating most of the work to XMLGenerator's implementations of the methods. For example, if all you need to do is rename some tags according to a dictionary, XMLGenerator makes it quite simple, as shown in the following example:

import xml.sax, xml.sax.saxutils

def tagrenamer(infile, outfile, renaming_dict):
    base = xml.sax.saxutils.XMLGenerator

    class Renamer(base):
        def rename(self, name):
            return renaming_dict.get(name, name)
        def startElement(self, name, attrs):
            base.startElement(self, self.rename(name),
                              attrs)
        def endElement(self, name):
            base.endElement(self, self.rename(name))

    xml.sax.parse(infile, Renamer(outfile))

23.2.2 Parsing XHTML with xml.sax

The following example uses xml.sax to perform a typical XHTML-related task, very similar to the tasks performed in the examples of Chapter 22. The example fetches an XHTML page from the Web with urllib, parses it, and outputs all unique links from the page to other sites. The example uses urlparse to examine the links for the given site, and outputs only the links whose URLs have an explicit scheme of 'http':

import xml.sax, urllib, urlparse

class LinksHandler(xml.sax.ContentHandler):
    def startDocument(self):
        self.seen = {}
    def startElement(self, tag, attributes):
        if tag != 'a': return
        value = attributes.get('href')
        if value is not None and value not in self.seen:
            self.seen[value] = True
            pieces = urlparse.urlparse(value)
            if pieces[0] != 'http': return
            print urlparse.urlunparse(pieces)

p = xml.sax.make_parser(  )
p.setContentHandler(LinksHandler(  ))
f = urllib.urlopen('http://www.w3.org/MarkUp/')
BUFSIZE = 8192

while True:
    data = f.read(BUFSIZE)
    if not data: break
    p.feed(data)

p.close(  )

This example is quite similar to the HTMLParser example in Chapter 22. With the xml.sax module, the parser and the handler are separate objects (while in the examples of Chapter 22 they coincided). Method names differ (startElement in this example versus handle_starttag in the HTMLParser example). The attributes argument is a mapping here, so its method get immediately gives us the attribute value we're interested in, while in the examples of Chapter 22 it was a sequence of (name,value) pairs, so we had to loop on the sequence until we found the right name. Despite these differences in detail, the overall structure is very close, and typical of simple event-driven parsing tasks.

    Previous Section Next Section