23.3 Parsing XML with DOM

SAX parsing does not build any structure in memory to represent the XML document. This makes SAX fast and highly scalable, as your application builds exactly as little or as much in-memory structure as needed for its specific tasks. However, for particularly complicated processing tasks involving reasonably small XML documents, you may prefer to let the library build in-memory structures that represent the whole XML document, and then traverse those structures. The XML standards describe the DOM (Document Object Model) for XML. A DOM object represents an XML document as a tree whose root is the document object, while other nodes correspond to elements, text contents, element attributes, and so on.

The Python standard library supplies a minimal implementation of the XML DOM standard, xml.dom.minidom. minidom builds everything up in memory, with the typical pros and cons of the DOM approach to parsing. The Python standard library also supplies a different DOM-like approach in module xml.dom.pulldom. pulldom occupies an interesting middle ground between SAX and DOM, presenting the stream of parsing events as a Python iterator object so that you do not code callbacks, but rather loop over the events and examine each event to see if it's of interest. When you do find an event of interest to your application, you can ask pulldom to build the DOM subtree rooted in that event's node by calling method expandNode, and then work with that subtree as you would in minidom. Paul Prescod, pulldom's author and XML and Python expert, describes the net result as "80% of the performance of SAX, 80% of the convenience of DOM." Other DOM parsers are part of the PyXML and 4Suite extension packages, mentioned at the start of this chapter.

23.3.1 The xml.dom Package

The xml.dom package supplies exception class DOMException and subclasses of it to support fine-grained exception handling. xml.dom also supplies a class Node, typically used as a base class for all nodes by DOM implementations. Class Node only supplies constant attributes giving the codes for node types, such as ELEMENT_NODE for elements, ATTRIBUTE_NODE for attributes, and so on. xml.dom also supplies constant module attributes with the URIs of important namespaces: XML_NAMESPACE, XMLNS_NAMESPACE, XHTML_NAMESPACE, and EMPTY_NAMESPACE.

23.3.2 The xml.dom.minidom Module

The xml.dom.minidom module supplies two functions.

parse

parse(file,parser=None)

file is a filename or a file-like object open for reading, containing an XML document. parser, if given, is an instance of a SAX parser class; otherwise, parse generates a default SAX parser by calling xml.sax.make_parser( ). parse returns a minidom document object instance representing the given XML document.

parseString

parseString(string,parser=None)

Like parse, except that string is the XML document in string form.

xml.dom.minidom also supplies many classes as specified by the XML DOM standard. Almost all of these classes subclass Node. Class Node supplies the methods and attributes that all kinds of nodes have in common. A notable class of module xml.dom.minidom that is not a subclass of Node is AttributeList, identified in the DOM standard as NamedNodeMap, which is a mapping that collects the attributes of a node of class Element.

For methods and attributes related to changing and creating XML documents, see Section 23.4 later in this chapter. Here, I present the classes, methods, and attributes that you use most often when traversing a DOM tree without changes, normally after the tree has been built by parsing an XML document. For concreteness and simplicity, I mention Python classes. However, the DOM specifications deal strictly with abstract interfaces, never with concrete classes. Your code must never deal with the class objects directly, only with instances of those classes. Do not type-test nodes (for example, don't use isinstance on them) and do not instantiate node classes directly (rather, use the factory methods covered later in Section 23.4). This is good Python practice in general, but it's particularly important here.

23.3.2.1 Node objects

Each node n in the DOM tree is an instance of some subclass of Node; therefore n supplies all attributes and methods that Node supplies, with appropriate overriding implementations if needed. The most frequently used methods and attributes are as follows.

attributes

The n.attributes attribute is either None or an AttributeList instance with all attributes of n.

childNodes

The n.childNodes attribute is a list of all nodes that are children of n, possibly an empty list.

firstChild

The n.firstChild attribute is None when n.childNodes is empty, otherwise like n.childNodes[0].

hasChildNodes

n.hasChildNodes(  )

Like len(n.childNodes)!=0, but possibly faster.

isSameNode

n.isSameNode(other)

True when n and other refer to the same DOM node, otherwise False. Do not use the normal Python idiom n is other: a Python DOM implementation is free to generate multiple Node instances that refer to the same DOM node. Therefore, to check the identity of DOM node references, always and exclusively use method isSameNode.

lastChild

The n.lastChild attribute is None when n.childNodes is empty, otherwise like n.childNodes[-1].

localName

The n.localName attribute is the local part of n's qualified name (relevant when namespaces are involved).

namespaceURI

The n.namespaceURI attribute is None when n's qualified name has no namespace part, otherwise the namespace's URI.

nextSibling

The n.nextSibling attribute is None when n is the last child of n's parent, otherwise the next child of n's parent.

nodeName

The n.nodeName attribute is n's name string. The string is a node-specific name when that makes sense for n's node type (e.g., the tag name when n is an Element), otherwise a string starting with '#'.

nodeType

The n.nodeType attribute is n's type code, an integer that is one of the constant attributes of class Node.

nodeValue

The n.nodeValue attribute is None when n has no value (e.g., when n is an Element), otherwise n's value (e.g., the text content when n is an instance of class Text).

normalize

n.normalize(  )

Normalizes the entire subtree rooted at n, merging adjacent Text nodes. Parsing may separate ranges of text in the XML document into arbitrary chunks; normalize ensures that text ranges remain separate only when there is markup between them.

ownerDocument

The n.ownerDocument attribute is the Document instance that contains n.

parentNode

The n.parentNode attribute is n's parent node in the DOM tree, or None for attribute nodes and nodes not in the tree.

prefix

The n.prefix attribute is None when n's qualified name has no namespace prefix, otherwise the namespace prefix. Note that a name may have a namespace even if it has no namespace prefix.

previousSibling

The n.previousSibling attribute is None when n is the first child of n's parent, otherwise the previous child of n's parent.

23.3.2.2 Attr objects

The Attr class is a subclass of Node that represents an attribute of an Element. Besides attributes and methods of class Node, an instance a of Attr supplies the following attributes.

ownerElement

The a.ownerElement attribute is the Element instance of which a is an attribute.

specified

The a.specified attribute is true if a was explicitly specified in the document, false if obtained by default.

23.3.2.3 Document objects

The Document class is a subclass of Node whose instances are returned by the parse and parseString functions of module xml.dom.minidom. All nodes in the document refer to the same Document node as their ownerDocument attribute. To check this, you must use the isSameNode method, not Python identity checking (operator is). Besides the attributes and methods of class Node, d supplies the following attributes and methods.

doctype

The d.doctype attribute is the DocumentType instance corresponding to d's DTD. This attribute comes directly from the !DOCTYPE declaration in d's XML source.

documentElement

The d.documentElement attribute is the Element instance corresponding to d's root element.

getElementById

d.getElementById(elementId)

Returns the Element instance within the document that has the given ID (what element attributes are IDs is specified by the DTD), or None if there is no such instance (or the underlying parser does not supply ID information).

getElementsByTagName

d.getElementsByTagName(tagName)

Returns the list of Element instances within the document whose tag equals string tagName, in the same order as in the parsed XML document. May be the empty list. When name is '*', returns the list of all Element instances within the document, with any tag.

getElementsByTagNameNS

d.getElementsByTagNameNS(namespaceURI,localName)

Returns the list of Element instances within the document with the given namespaceURI and localName, in the order found in the XML document. May be the empty list. A value of '*' for namespaceURI, localName, or both matches all values of the corresponding field.

23.3.2.4 Element objects

The Element class is a subclass of Node that represents tagged elements. Besides attributes and methods of Node, an instance e of Element supplies the following methods.

getAttribute

e.getAttribute(name)

Returns the value of e's attribute with the given name. Returns the empty string '' if e has no attribute with the given name.

getAttributeNS

e.getAttributeNS(namespaceURI,localName)

Returns the value of e's attribute with the given namespaceURI and localName.

getAttributeNode

e.getAttributeNode(name)

Returns the Attr instance that is e's attribute with the given name, or None if no attribute with that name is among e's attributes.

getAttributeNodeNS

e.getAttributeNodeNS(namespaceURI,localName)

Returns the Attr instance that is e's attribute with the given namespaceURI and localName, or None if no such attribute is among e's attributes.

getElementsByTagName

e.getElementsByTagName(tagName)

Returns the list of Element instances within the subtree rooted at e whose tag equals string tagName, in the same order as in the XML document. e is included in the list that getElementsbyTagName returns if e's tag equals tagName. getElementsbyTagName may return the empty list when no node in the subtree rooted at e has a tag equal to tagName. When tagName is '*', getElementsbyTagName returns the list of all Element instances within the subtree, with any tag, including e.

getElementsByTagNameNS

e.getElementsByTagNameNS(namespaceURI,localName)

Returns the list of Element instances within the subtree rooted at e, with the given namespaceURI and localname, in the same order as in the XML document. A value of '*' for namespaceURI, localname, or both matches all values of the corresponding field. The list may include e or may be empty, just as for method getElementsByTagName.

hasAttribute

e.hasAttribute(name)

True if and only if e has an attribute with the given name. If the underlying parser extracts the relevant information from the DTD, hasAttribute is also true for attributes of e that have a default value, even when they are not explicitly specified.

hasAttributeNS

e.hasAttributeNS(namespaceURI,localName)

True if and only if e has an attribute with the given namespaceURI and localName. Same as method hasAttribute regarding attributes with default values from the DTD.

23.3.3 Parsing XHTML with xml.dom.minidom

The following example uses xml.dom.minidom to perform the same task as in the previous example for xml.sax, fetching a page from the Web with urllib, parsing it, and outputting the hyperlinks:

import xml.dom.minidom, urllib, urlparse

f = urllib.urlopen('http://www.w3.org/MarkUp/')
doc = xml.dom.minidom.parse(f)
as = doc.getElementsByTagName('a')
seen = {}
for a in as:
    value = a.getAttribute('href')
    if value and value not in seen:
        seen[value] = True
        pieces = urlparse.urlparse(value)
        if pieces[0] == 'http' and pieces[1]!='www.w3.org':
            print urlparse.urlunparse(pieces)

In this example, we get the list of all elements with tag 'a', and the relevant attribute, if any, for each of them. We then work in the usual way with the attribute's value.

23.3.4 The xml.dom.pulldom Module

The xml.dom.pulldom module supplies two functions.

parse

parse(file,parser=None)

file is a filename or a file-like object open for reading, containing an XML document. parser, if given, is an instance of a SAX parser class; otherwise parse generates a default SAX parser by calling xml.sax.make_parser( ). parse returns a pulldom event stream instance representing the given XML document.

parseString

parseString(string,parser=None)

Like parse, except that string is the XML document in string form.

xml.dom.pulldom also supplies class DOMEventStream, an iterator whose items are pairs (event,node), where event is a string giving the event type, and node is an instance of an appropriate subclass of class Node. The possible values for event are constant uppercase strings that are also available as constant attributes of module xml.dom.pulldom with the same names: CHARACTERS, COMMENT, END_DOCUMENT, END_ELEMENT, IGNORABLE_WHITESPACE, PROCESSING_INSTRUCTION, START_DOCUMENT, and START_ELEMENT.

An instance d of class DOMEventStream supplies one other important method.

expandNode

d.expandNode(node)

node must be the latest instance of Node so far returned by iterating on d, i.e., the instance of Node returned by the latest call to d.next( ). expandNode processes that part of the XML document stream that corresponds to the subtree rooted at node, ensuring that you can then access the subtree with the usual minidom approach. d iterates on itself for the purpose so that after calling expandNode, the next call to next continues right after the subtree thus expanded.

23.3.5 Parsing XHTML with xml.dom.pulldom

The following example uses xml.dom.pulldom to perform the same task as our previous examples, fetching a page from the Web with urllib, parsing it, and outputting the hyperlinks:

import xml.dom.pulldom, urllib, urlparse

f = urllib.urlopen('http://www.w3.org/MarkUp/')
doc = xml.dom.pulldom.parse(f)
seen = {}
for event, node in doc:
    if event=='START_ELEMENT' and node.nodeName=='a':
        doc.expandNode(node)
        value = node.getAttribute('href')
        if value and value not in seen:
            seen[value] = True
            pieces = urlparse.urlparse(value)
            if pieces[0] == 'http' and pieces[1]!='www.w3.org':
                print urlparse.urlunparse(pieces)

In this example, we select only elements with tag 'a'. For each of them we request full expansion, and then proceed just like in the minidom example (i.e., we get the relevant attribute, if any, then work in the usual way with the attribute's value). The expansion is in fact not necessary in this specific case, since we do not need to work with the subtree rooted in each element with tag 'a', just with the attributes, and attributes can be accessed without calling expandNode. Therefore, this example works just as well if you change the call to doc.expandNode into a comment. However, I put the expandNode call in the example to show how this crucial method of pulldom is normally used in context.