22.3 The HTMLParser Module

Module HTMLParser supplies one class, HTMLParser, that you subclass to override and add methods. HTMLParser.HTMLParser is similar to sgmllib.SGMLParser, but is simpler and able to parse XHTML as well. The main differences between HTMLParser and SGMLParser are the following:

HMTLParser does not call back to methods named do_tag, start_tag, and end_tag. To process tags and end tags, your subclass X of HTMLParser must override methods handle_starttag and/or handle_endtag and check explicitly for the tags it wants to process.
HMTLParser does not keep track of, nor check, tag nesting in any way.
HMTLParser does nothing, by default, to resolve character and entity references. Your subclass X of HTMLParser must override methods handle_charref and/or handle_entityref if it needs to perform processing of such references.

The most frequently used methods of an instance h of a subclass X of HTMLParser are as follows.

close

h.close(  )

Tells the parser that there is no more input data. When X overrides close, h.close must also call HTMLParser.close to ensure that buffered data gets processed.

feed

h.feed(data)

Passes to the parser a part of the text being parsed. The parser processes some prefix of the text and holds the rest in a buffer until the next call to h.feed or h.close.

handle_charref

h.handle_charref(ref)

Called to process a character reference '&#ref;'. HTMLParser's implementation of handle_charref does nothing.

handle_comment

h.handle_comment(comment)

Called to handle comments. comment is the string within '', without the delimiters. HTMLParser's implementation of handle_comment does nothing.

handle_data

h.handle_data(data)

Called to process each arbitrary string data. Your subclass X almost always overrides handle_data. HTMLParser's implementation of handle_data does nothing.

handle_endtag

h.handle_endtag(tag)

Called to handle termination tags. tag is the tag string, lowercased. HTMLParser's implementation of handle_endtag does nothing.

handle_entityref

h.handle_entityref(ref)

Called to process an entity reference '&ref;'. HTMLParser's implementation of handle_entityref does nothing.

handle_starttag

h.handle_starttag(tag, attributes)

Called to handle tags. tag is the tag string, lowercased. attributes is a list of pairs (name,value), where name is each attribute's name, lowercased, and value is the value, processed to resolve entity references and character references and to remove surrounding quotes. HTMLParser's implementation of handle_starttag does nothing.

The following example uses HTMLParser to perform the same task as our previous examples: fetching a page from the Web with urllib, parsing it, and outputting the hyperlinks.

import HTMLParser, urllib, urlparse

class LinksParser(HTMLParser.HTMLParser):
    def __init_  _(self):
        HTMLParser.HTMLParser.__init_  _(self)
        self.seen = {}
    def handle_starttag(self, tag, attributes):
        if tag != 'a': return
        for name, value in attributes:
            if name == 'href' and value not in self.seen:
                self.seen[value] = True
                pieces = urlparse.urlparse(value)
                if pieces[0] != 'http': return
                print urlparse.urlunparse(pieces)
                return

p = LinksParser(  )
f = urllib.urlopen('http://www.python.org/index.html')
BUFSIZE = 8192
while True:
    data = f.read(BUFSIZE)
    if not data: break
    p.feed(data)

p.close(  )

This example is similar to the one for sgmllib. However, since the HTMLParser.HTMLParser superclass performs no per-tag dispatching to methods, class LinksParser needs to override method handle_starttag and check that the tag is indeed 'a'.