22.3 The HTMLParser Module
Module HTMLParser
supplies one class, HTMLParser, that you subclass
to override and add methods. HTMLParser.HTMLParser
is similar to sgmllib.SGMLParser, but is simpler
and able to parse XHTML as well. The main differences between
HTMLParser and SGMLParser are
the following:
HMTLParser does not call back to methods named
do_tag,
start_tag, and
end_tag. To process
tags and end tags, your subclass X of
HTMLParser must override methods
handle_starttag and/or
handle_endtag and check explicitly for the tags it
wants to process.
HMTLParser does not keep track of, nor check, tag
nesting in any way.
HMTLParser does nothing, by default, to resolve
character and entity references. Your subclass
X of HTMLParser must
override methods handle_charref and/or
handle_entityref if it needs to perform processing
of such references.
The most frequently used methods of an instance
h of a subclass
X of HTMLParser are as
follows.
Tells
the parser that there is no more input data. When
X overrides close,
h.close must also call
HTMLParser.close to ensure that buffered data gets
processed.
Passes to the parser a part of the text being parsed. The parser
processes some prefix of the text and holds the rest in a buffer
until the next call to
h.feed or
h.close.
Called to process a character reference
'&#ref;'.
HTMLParser's implementation of
handle_charref does nothing.
h.handle_comment(comment)
|
|
Called to handle comments. comment is the
string within '<!--...-->', without the
delimiters. HTMLParser's
implementation of handle_comment does nothing.
Called to process each arbitrary string
data. Your subclass
X almost always overrides
handle_data.
HTMLParser's implementation of
handle_data does nothing.
Called to handle termination tags. tag is
the tag string, lowercased.
HTMLParser's implementation of
handle_endtag does nothing.
Called to process an entity reference
'&ref;'.
HTMLParser's implementation of
handle_entityref does nothing.
h.handle_starttag(tag, attributes)
|
|
Called to handle tags. tag is the tag
string, lowercased. attributes is a list
of pairs
(name,value),
where name is each
attribute's name, lowercased, and
value is the value, processed to resolve
entity references and character references and to remove surrounding
quotes. HTMLParser's
implementation of handle_starttag does nothing.
The following example uses HTMLParser to perform
the same task as our previous examples: fetching a page from the Web
with urllib, parsing it, and outputting the
hyperlinks.
import HTMLParser, urllib, urlparse
class LinksParser(HTMLParser.HTMLParser):
def __init_ _(self):
HTMLParser.HTMLParser.__init_ _(self)
self.seen = {}
def handle_starttag(self, tag, attributes):
if tag != 'a': return
for name, value in attributes:
if name == 'href' and value not in self.seen:
self.seen[value] = True
pieces = urlparse.urlparse(value)
if pieces[0] != 'http': return
print urlparse.urlunparse(pieces)
return
p = LinksParser( )
f = urllib.urlopen('http://www.python.org/index.html')
BUFSIZE = 8192
while True:
data = f.read(BUFSIZE)
if not data: break
p.feed(data)
p.close( )
This example is similar to the one for sgmllib.
However, since the HTMLParser.HTMLParser
superclass performs no per-tag dispatching to methods, class
LinksParser needs to override method
handle_starttag and check that the
tag is indeed
'a'.
|