23.3 Parsing XML with DOM
SAX parsing
does not build any structure in memory to represent the XML document.
This makes SAX fast and highly scalable, as your application builds
exactly as little or as much in-memory structure as needed for its
specific tasks. However, for particularly complicated processing
tasks involving reasonably small XML documents, you may prefer to let
the library build in-memory structures that represent the whole XML
document, and then traverse those structures. The XML standards
describe the DOM (Document Object Model) for XML. A DOM object
represents an XML document as a tree whose root is the
document object, while other nodes correspond to
elements, text contents, element attributes, and so
on.
The Python standard library supplies a minimal implementation of the
XML DOM standard, xml.dom.minidom.
minidom builds everything up in memory, with the
typical pros and cons of the DOM approach to parsing. The Python
standard library also supplies a different DOM-like approach in
module xml.dom.pulldom. pulldom
occupies an interesting middle ground between SAX and DOM, presenting
the stream of parsing events as a Python iterator object so that you
do not code callbacks, but rather loop over the events and examine
each event to see if it's of interest. When you do
find an event of interest to your application, you can ask
pulldom to build the DOM subtree rooted in that
event's node by calling method
expandNode, and then work with that subtree as you
would in minidom. Paul Prescod,
pulldom's author and XML and
Python expert, describes the net result as "80% of
the performance of SAX, 80% of the convenience of
DOM." Other DOM parsers are part of the PyXML and
4Suite extension packages, mentioned at the start of this chapter.
23.3.1 The xml.dom Package
The xml.dom package
supplies exception class DOMException and
subclasses of it to support fine-grained exception handling.
xml.dom also supplies a class
Node, typically used as a base class for all nodes
by DOM implementations. Class Node only supplies
constant attributes giving the codes for node types, such as
ELEMENT_NODE for elements,
ATTRIBUTE_NODE for attributes, and so on.
xml.dom also supplies constant module attributes
with the URIs of important namespaces:
XML_NAMESPACE, XMLNS_NAMESPACE,
XHTML_NAMESPACE, and
EMPTY_NAMESPACE.
23.3.2 The xml.dom.minidom Module
The xml.dom.minidom
module supplies two functions.
file is a filename or a file-like object
open for reading, containing an XML document.
parser, if given, is an instance of a SAX
parser class; otherwise, parse generates a default
SAX parser by calling xml.sax.make_parser( ).
parse returns a minidom
document object instance representing the given XML document.
parseString(string,parser=None)
|
|
Like parse, except that
string is the XML document in string form.
xml.dom.minidom also supplies many classes as
specified by the XML DOM standard. Almost all of these classes
subclass Node. Class Node
supplies the methods and attributes that all kinds of nodes have in
common. A notable class of module xml.dom.minidom
that is not a subclass of Node is
AttributeList, identified in the DOM standard as
NamedNodeMap, which is a mapping that collects the
attributes of a node of class
Element.
For methods and attributes related to changing and creating XML
documents, see Section 23.4 later in this chapter. Here, I present the
classes, methods, and attributes that you use most often when
traversing a DOM tree without changes, normally after the tree has
been built by parsing an XML document. For concreteness and
simplicity, I mention Python classes. However, the DOM specifications
deal strictly with abstract interfaces, never with concrete classes.
Your code must never deal with the class objects directly, only with
instances of those classes. Do not type-test nodes (for example,
don't use isinstance on them) and
do not instantiate node classes directly (rather, use the factory
methods covered later in Section 23.4). This is good Python
practice in general, but it's particularly important
here.
23.3.2.1 Node objects
Each node n
in the DOM tree is an instance of some subclass of
Node; therefore n
supplies all attributes and methods that Node
supplies, with appropriate overriding implementations if needed. The
most frequently used methods and attributes are as follows.
The
n.attributes attribute
is either None or an
AttributeList instance with all attributes of
n.
The
n.childNodes attribute
is a list of all nodes that are children of
n, possibly an empty list.
The
n.firstChild attribute
is None when
n.childNodes is empty,
otherwise like
n.childNodes[0].
Like
len(n.childNodes)!=0,
but possibly faster.
True
when n and
other refer to the same DOM node,
otherwise False. Do not use the normal Python
idiom n is
other: a Python DOM implementation is free
to generate multiple Node instances that refer to
the same DOM node. Therefore, to check the identity of DOM node
references, always and exclusively use method
isSameNode.
The
n.lastChild attribute
is None when
n.childNodes is empty,
otherwise like
n.childNodes[-1].
The
n.localName attribute
is the local part of n's
qualified name (relevant when namespaces are involved).
The
n.namespaceURI
attribute is None when
n's qualified name has no
namespace part, otherwise the namespace's URI.
The
n.nextSibling attribute
is None when n is the
last child of n's parent,
otherwise the next child of
n's parent.
The
n.nodeName attribute is
n's name string. The
string is a node-specific name when that makes sense for
n's node type (e.g., the
tag name when n is an
Element), otherwise a string starting with
'#'.
The
n.nodeType attribute is
n's type code, an integer
that is one of the constant attributes of class
Node.
The
n.nodeValue attribute
is None when n has no
value (e.g., when n is an
Element), otherwise
n's value (e.g., the text
content when n is an instance of class
Text).
Normalizes the entire subtree rooted at
n, merging adjacent
Text nodes. Parsing may separate ranges of text in
the XML document into arbitrary chunks; normalize
ensures that text ranges remain separate only when there is markup
between them.
The
n.ownerDocument
attribute is the Document instance that contains
n.
The
n.parentNode attribute
is n's parent node in the
DOM tree, or None for attribute nodes and nodes
not in the tree.
The
n.prefix attribute is
None when
n's qualified name has no
namespace prefix, otherwise the namespace prefix. Note that a name
may have a namespace even if it has no namespace prefix.
The
n.previousSibling
attribute is None when
n is the first child of
n's parent, otherwise the
previous child of n's
parent.
23.3.2.2 Attr objects
The Attr class is a subclass of
Node that represents an attribute of an
Element. Besides attributes and methods of class
Node, an instance a of
Attr supplies the following
attributes.
The
a.ownerElement
attribute is the Element instance of which
a is an attribute.
The
a.specified attribute
is true if a was explicitly specified in
the document, false if obtained by default.
23.3.2.3 Document objects
The
Document class is a subclass of
Node whose instances are returned by the
parse and parseString functions
of module xml.dom.minidom. All nodes in the
document refer to the same Document node as their
ownerDocument attribute. To check this, you must
use the isSameNode method, not Python identity
checking (operator is). Besides the attributes and
methods of class Node,
d supplies the following attributes and
methods.
The
d.doctype attribute is
the DocumentType instance corresponding to
d's DTD. This attribute
comes directly from the !DOCTYPE declaration in
d's XML source.
The
d.documentElement
attribute is the Element instance corresponding to
d's root element.
d.getElementById(elementId)
|
|
Returns the Element
instance within the document that has the given ID (what element
attributes are IDs is specified by the DTD), or
None if there is no such instance (or the
underlying parser does not supply ID information).
d.getElementsByTagName(tagName)
|
|
Returns the list of Element instances within the
document whose tag equals string tagName,
in the same order as in the parsed XML document. May be the empty
list. When name is '*',
returns the list of all Element instances within
the document, with any tag.
d.getElementsByTagNameNS(namespaceURI,localName)
|
|
Returns the list of Element instances within the
document with the given namespaceURI and
localName, in the order found in the XML
document. May be the empty list. A value of '*'
for namespaceURI,
localName, or both matches all values of
the corresponding field.
23.3.2.4 Element objects
The
Element class is a subclass of
Node that represents tagged elements. Besides
attributes and methods of Node, an instance
e of Element supplies
the following methods.
Returns the value of
e's attribute with the
given name. Returns the empty string
'' if e has no
attribute with the given name.
e.getAttributeNS(namespaceURI,localName)
|
|
Returns the value of
e's attribute with the
given namespaceURI and
localName.
Returns the Attr
instance that is e's
attribute with the given name, or
None if no attribute with that name is among
e's attributes.
e.getAttributeNodeNS(namespaceURI,localName)
|
|
Returns the Attr
instance that is e's
attribute with the given namespaceURI and
localName, or None if
no such attribute is among
e's attributes.
e.getElementsByTagName(tagName)
|
|
Returns the list of
Element instances within the subtree rooted at
e whose tag equals string
tagName, in the same order as in the XML
document. e is included in the list that
getElementsbyTagName returns if
e's tag equals
tagName.
getElementsbyTagName may return the empty list
when no node in the subtree rooted at e
has a tag equal to tagName. When
tagName is '*',
getElementsbyTagName returns the list of all
Element instances within the subtree, with any
tag, including e.
e.getElementsByTagNameNS(namespaceURI,localName)
|
|
Returns the list of
Element instances within the subtree rooted at
e, with the given
namespaceURI and
localname, in the same order as in the XML
document. A value of '*' for
namespaceURI,
localname, or both matches all values of
the corresponding field. The list may include
e or may be empty, just as for method
getElementsByTagName.
True if and only if
e has an attribute with the given
name. If the underlying parser extracts
the relevant information from the DTD,
hasAttribute is also true for attributes of
e that have a default value, even when
they are not explicitly specified.
e.hasAttributeNS(namespaceURI,localName)
|
|
True if and only if e has an attribute
with the given namespaceURI and
localName. Same as method
hasAttribute regarding attributes with default
values from the DTD.
23.3.3 Parsing XHTML with xml.dom.minidom
The following example uses
xml.dom.minidom to perform the same task as in the
previous example for xml.sax, fetching a page from
the Web with urllib, parsing it, and outputting
the hyperlinks:
import xml.dom.minidom, urllib, urlparse
f = urllib.urlopen('http://www.w3.org/MarkUp/')
doc = xml.dom.minidom.parse(f)
as = doc.getElementsByTagName('a')
seen = {}
for a in as:
value = a.getAttribute('href')
if value and value not in seen:
seen[value] = True
pieces = urlparse.urlparse(value)
if pieces[0] == 'http' and pieces[1]!='www.w3.org':
print urlparse.urlunparse(pieces)
In this example, we get the list of all elements with tag
'a', and the relevant attribute, if any, for each
of them. We then work in the usual way with the
attribute's value.
23.3.4 The xml.dom.pulldom Module
The
xml.dom.pulldom module supplies two functions.
file is a filename or a file-like object
open for reading, containing an XML document.
parser, if given, is an instance of a SAX
parser class; otherwise parse generates a default
SAX parser by calling xml.sax.make_parser( ).
parse returns a pulldom event
stream instance representing the given XML document.
parseString(string,parser=None)
|
|
Like parse, except that
string is the XML document in string form.
xml.dom.pulldom also supplies class
DOMEventStream, an iterator whose items are pairs
(event,node),
where event is a string giving the event
type, and node is an instance of an
appropriate subclass of class Node. The possible
values for event are constant uppercase
strings that are also available as constant attributes of module
xml.dom.pulldom with the same names:
CHARACTERS, COMMENT,
END_DOCUMENT, END_ELEMENT,
IGNORABLE_WHITESPACE,
PROCESSING_INSTRUCTION,
START_DOCUMENT, and
START_ELEMENT.
An instance d of class
DOMEventStream supplies one other important
method.
node must be the latest instance of
Node so far returned by iterating on
d, i.e., the instance of
Node returned by the latest call to
d.next( ).
expandNode processes that part of the XML document
stream that corresponds to the subtree rooted at
node, ensuring that you can then access
the subtree with the usual minidom approach.
d iterates on itself for the purpose so
that after calling expandNode, the next call to
next continues right after the subtree thus
expanded.
23.3.5 Parsing XHTML with xml.dom.pulldom
The following
example uses xml.dom.pulldom to perform the same
task as our previous examples, fetching a page from the Web with
urllib, parsing it, and outputting the hyperlinks:
import xml.dom.pulldom, urllib, urlparse
f = urllib.urlopen('http://www.w3.org/MarkUp/')
doc = xml.dom.pulldom.parse(f)
seen = {}
for event, node in doc:
if event=='START_ELEMENT' and node.nodeName=='a':
doc.expandNode(node)
value = node.getAttribute('href')
if value and value not in seen:
seen[value] = True
pieces = urlparse.urlparse(value)
if pieces[0] == 'http' and pieces[1]!='www.w3.org':
print urlparse.urlunparse(pieces)
In this example, we select only elements with tag
'a'. For each of them we request full expansion,
and then proceed just like in the minidom example
(i.e., we get the relevant attribute, if any, then work in the usual
way with the attribute's value). The expansion is in
fact not necessary in this specific case, since we do not need to
work with the subtree rooted in each element with tag
'a', just with the attributes, and attributes can
be accessed without calling expandNode. Therefore,
this example works just as well if you change the call to
doc.expandNode into a comment. However, I put the
expandNode call in the example to show how this
crucial method of pulldom is normally used in
context.
|