23.2 Parsing XML with SAX
In most cases, the best way to extract
information from an XML document is to parse the document with a
parser compliant with SAX, the Simple API for XML. SAX defines a
standard API that can be implemented on top of many different
underlying parsers. The SAX approach to parsing has similarities to
the HTML parsers covered in Chapter 22. As the
parser encounters XML elements, text contents, and other significant
events in the input stream, the parser calls back to methods of your
classes. Such event-driven parsing, based on callbacks to your
methods as relevant events occur, also has similarities to the
event-driven approach that is almost universal in GUIs and in some
networking frameworks. Event-driven approaches in various programming
fields may not appear natural to beginners, but enable high
performance and particularly high scalability, making them very
suitable for high-workload cases.
To use SAX, you define a content handler class, subclassing a library
class and overriding some methods. Then, you build a parser object
p, install an instance of your class as
p's handler, and feed
p the input stream to parse.
p calls methods on your handler to reflect
the document's structure and contents. Your
handler's methods perform application-specific
processing. The xml.sax package supplies a factory
function to build p, as well as
convenience functions for simpler operation in typical cases.
xml.sax also supplies exception classes, used to
diagnose invalid input and other errors.
Optionally, you can
also register with parser p other kinds of
handlers besides the content handler. You can supply a custom error
handler to use an error diagnosis strategy different from normal
exception raising, and try to diagnose several errors during a parse.
You can supply a custom DTD handler to receive information about
notation and unparsed entities from the XML
document's Document Type Definition (DTD). You can
supply a custom entity resolver to handle external entity references
in advanced, customized ways. These additional possibilities are
advanced and rarely used, so I do not cover them in this
book.
23.2.1 The xml.sax Package
The xml.sax package
supplies exception class SAXException, and
subclasses of it to support fine-grained exception handling.
xml.sax also supplies three
functions.
make_parser(parsers_list=[])
|
|
parsers_list is a list of strings, names
of modules from which you would like to build your parser.
make_parser tries each module in sequence until it
finds one that defines a suitable function
create_parser. After the modules in
parsers_list, if any,
make_parser continues by trying a list of default
modules. make_parser terminates as soon as it can
generate a parser p, and returns
p.
parse(file,handler,error_handler=None)
|
|
file is a filename or a file-like object
open for reading, containing an XML document.
handler is generally an instance of your
own subclass of class ContentHandler, covered
later in this chapter. error_handler, if
given, is generally an instance of your own subclass of class
ErrorHandler. You don't
necessarily have to subclass ContentHandler and/or
ErrorHandler: you just need to provide the same
interfaces as the classes do. Subclassing is often a convenient means
to this end.
Function parse is equivalent to the code:
p = make_parser( )
p.setContentHandler(handler)
if error_handler is not None:
p.setErrorHandler(error_handler)
p.parse(file) This idiom is quite frequent in SAX parsing, so having it in a single
function is convenient. When error_handler
is None, the parser diagnoses errors by
propagating an exception that is an instance of some subclass of
SAXException.
parseString(string,handler,error_handler=None)
|
|
Like parse, except that
string is the XML document in string form.
xml.sax also supplies a class, which you subclass
to define your content handler.
An instance h of a subclass of
ContentHandler may override several methods, of
which the most frequently useful are the following:
- h.characters( data)
-
Called when textual content
data is parsed. The parser may split each
range of text in the document into any number of separate callbacks
to h.characters.
Therefore, your implementation of method
characters usually buffers
data, generally by appending it to a list
attribute. When your class knows from some other event that all
relevant data has arrived, your class calls
''.join on the list and processes the resulting
string.
- h.endDocument( )
-
Called once when the document finishes.
- h.endElement( tag)
-
Called when the element named
tag finishes.
- h.endElementNS( name,qname)
-
Called when an element finishes and the
parser is handling namespaces. name and
qname are like for
startElementNS, covered later in this chapter.
- h.startDocument( )
-
Called once when the document begins.
- h.startElement( tag,attrs)
-
Called when the element named
tag begins.
attrs is a mapping of attribute names to
values, as covered in the next section.
- h.startElementNS( name,qname,attrs)
-
Called when an element begins and the
parser is handling namespaces. name is a
pair
(uri,localname),
where uri is the
namespace's URI or None, and
localname is the name of the tag.
qname (which stands for qualified name) is
either None, if the parser does not supply the
namespace prefixes feature, or the string
prefix:name
used in the document's text for this tag.
attrs is a mapping of attribute names to
values, as covered in the next section.
23.2.1.1 Attributes
The
last argument of methods startElement and
startElementNS is an attributes object
attr, a read-only mapping of attribute
names to attribute values. For method
startElement, names are identifier strings. For
method startElementNS, names are pairs
(uri,localname),
where uri is the
namespace's URI or None, and
localname is the name of the tag. The
object attr also supports methods that let
you work with the qname (qualified name)
of each attribute.
attr.getValueByQName(name)
|
|
Returns the attribute value for a
qualified name name.
attr.getNameByQName(name)
|
|
Returns the
(namespace,
localname) pair for a
qualified name name.
attr.getQNameByName(name)
|
|
Returns the qualified name for name, which
is a
(namespace,
localname) pair.
Returns the list of qualified names of all attributes.
For startElement, each
qname is the same string as the
corresponding name. For startElementNS, a
qname is the corresponding local name for
attributes not associated with a namespace (i.e., attributes whose
uri is None);
otherwise, the qname is the string
prefix:name
used in the document's text for this attribute.
The parser may reuse in later processing the
attr object that it passes to methods
startElement and
startElementNS. If you need to keep a copy of the
attributes of an element, call
attr.copy( ) to get the
copy.
23.2.1.2 Incremental parsing
All parsers support a method
parse, which you call with the XML document as
either a string or a file-like object open for reading.
parse does not return until the end of the XML
document. Most SAX parsers, though not all, also support incremental
parsing, letting you feed the XML document to the parser a little at
a time, as the document arrives from a network connection or other
source. A parser p that is capable of
incremental parsing supplies three more methods.
Call
when the XML document is finished.
Passes to the parser a part of the document. The parser processes
some prefix of the text and holds the rest in a buffer until the next
call to p.feed or
p.close.
Call after an XML document is finished or abandoned, before you start
feeding another XML document to the parser.
23.2.1.3 The xml.sax.saxutils module
The saxutils module of
package xml.sax supplies two functions and a class
that are quite handy to generate XML output based on an input XML
document.
Returns a copy of string data with
characters <, >, and
& changed into entity references
<, >, and
&. entities is
a dictionary with strings as keys and values; each substring
s of data that
is a key in entities is changed in
escape's result string into
string
entities[s].
For example, to escape single and double quote characters, in
addition to angle brackets and ampersands, you can call:
xml.sax.saxutils.escape(data,{'"':'"', "'":"'"})
Same as escape, but also quotes the result string
to make it immediately usable as an attribute value, and escapes any
quote characters that have to be escaped.
class XMLGenerator(out=sys.stdout, encoding='iso-8859-1')
|
|
Subclasses
xml.sax.ContentHandler and implements all that is
needed to reproduce the input XML document on the given file-like
object out with the specified
encoding. When you must generate an XML
document that is a small modification of the input one, you can
subclass XMLGenerator, overriding methods and
delegating most of the work to
XMLGenerator's implementations of
the methods. For example, if all you need to do is rename some tags
according to a dictionary, XMLGenerator makes it
quite simple, as shown in the following example:
import xml.sax, xml.sax.saxutils
def tagrenamer(infile, outfile, renaming_dict):
base = xml.sax.saxutils.XMLGenerator
class Renamer(base):
def rename(self, name):
return renaming_dict.get(name, name)
def startElement(self, name, attrs):
base.startElement(self, self.rename(name),
attrs)
def endElement(self, name):
base.endElement(self, self.rename(name))
xml.sax.parse(infile, Renamer(outfile))
23.2.2 Parsing XHTML with xml.sax
The following
example uses xml.sax to perform a typical
XHTML-related task, very similar to the tasks performed in the
examples of Chapter 22. The example fetches an
XHTML page from the Web with urllib, parses it,
and outputs all unique links from the page to other sites. The
example uses urlparse to examine the links for the
given site, and outputs only the links whose URLs have an explicit
scheme of 'http':
import xml.sax, urllib, urlparse
class LinksHandler(xml.sax.ContentHandler):
def startDocument(self):
self.seen = {}
def startElement(self, tag, attributes):
if tag != 'a': return
value = attributes.get('href')
if value is not None and value not in self.seen:
self.seen[value] = True
pieces = urlparse.urlparse(value)
if pieces[0] != 'http': return
print urlparse.urlunparse(pieces)
p = xml.sax.make_parser( )
p.setContentHandler(LinksHandler( ))
f = urllib.urlopen('http://www.w3.org/MarkUp/')
BUFSIZE = 8192
while True:
data = f.read(BUFSIZE)
if not data: break
p.feed(data)
p.close( )
This example is quite similar to the HTMLParser
example in Chapter 22. With the
xml.sax module, the parser and the handler are
separate objects (while in the examples of Chapter 22 they coincided). Method names differ
(startElement in this example versus
handle_starttag in the
HTMLParser example). The
attributes argument is a mapping here, so
its method get immediately gives us the attribute
value we're interested in, while in the examples of
Chapter 22 it was a sequence of
(name,value)
pairs, so we had to loop on the sequence until we found the right
name. Despite these differences in detail, the overall structure is
very close, and typical of simple event-driven parsing
tasks.
|