22.1 The sgmllib Module
The name of the
sgmllib module is misleading:
sgmllib parses only a tiny subset of SGML, but it
is still a good way to get information from HTML files.
sgmllib supplies one class,
SGMLParser, which you subclass to override and add
methods. The most frequently used methods of an instance
s of your subclass
X of SGMLParser are as
follows.
Tells
the parser that there is no more input data. When
X overrides close,
x.close must call
SGMLParser.close to ensure that buffered data get
processed.
X supplies a method with such a name for
each tag, with no corresponding end tag,
that X wants to process.
tag must be in lowercase in the method
name, but can be in any mix of cases in the parsed text.
SGMLParser's
handle_tag method calls
do_tag as appropriate.
attributes is a list of pairs
(name,value),
where name is each
attribute's name, lowercased, and
value is the value, processed to resolve
entity references and character references and to remove surrounding
quotes.
X supplies a method with such a name for
each tag whose end tag
X wants to process.
tag must be in lowercase in the method
name, but can be in any mix of cases in the parsed text.
X must also supply a method named
start_tag, otherwise
end_tag is ignored.
SGMLParser's
handle_endtag method calls
end_tag as appropriate.
Passes to the parser some of the text being parsed. The parser may
process some prefix of the text, holding the rest in a buffer until
the next call to s.feed
or s.close.
Called to process a character reference
'&#ref;'.
SGMLParser's implementation of
handle_charref handles decimal numbers in
range(0,256), like:
def handle_charref(self, ref):
try:
c = chr(int(ref))
except (TypeError, ValueError):
self.unknown_charref(ref)
else: self.handle_data(c) Your subclass X may override
handle_charref or
unknown_charref in order to support other forms of
character references '&#...;'.
s.handle_comment(comment)
|
|
Called to handle comments. comment is the
string within '<!--...-->', without the
delimiters. SGMLParser's
implementation of handle_comment does
nothing.
Called to process each arbitrary string
data. Your subclass
X normally overrides
handle_data.
SGMLParser's implementation of
handle_data does nothing.
s.handle_endtag(tag,method)
|
|
Called to handle termination tags for which
X supplies methods named
start_tag and
end_tag.
tag is the tag string, lowercased.
method is the bound method for
end_tag.
SGMLParser's implementation of
handle_endtag calls
method( ).
Called to process an entity reference
'&ref;'.
SGMLParser's implementation of
handle_entityref looks
ref up in
s.entitydefs, like:
def handle_entityref(self, ref):
try: t = self.entitydefs[ref]
except KeyError: self.unknown_entityref(ref)
else: self.handle_data(t) Your subclass X may override
handle_entityref or
unknown_entityref in order to support entity
references '&...;' in different ways.
SGMLParser's attribute
entitydefs includes keys 'amp',
'apos', 'gt',
'lt', and 'quot'.
s.handle_starttag(tag, method, attributes)
|
|
Called to handle tags for which X supplies
a method start_tag or
do_tag.
tag is the tag string, lowercased.
method is the bound method for
start_tag or
do_tag.
attributes is a list of pairs
(name,value),
where name is each
attribute's name, lowercased, and
value is the value, processed to resolve
entity references and character references and to remove surrounding
quotes. When X supplies both
start_tag and
do_tag methods,
start_tag has
precedence and do_tag
is ignored. SGMLParser's
implementation of handle_starttag calls
method(attributes).
Called when tags terminate without being open.
tag is the tag string, lowercased.
SGMLParser's implementation of
report_unbalanced does nothing.
X supplies a method thus named for each
tag, with an end tag, that
X wants to process.
tag must be in lowercase in the method
name, but can be in any mix of cases in the parsed text.
SGMLParser's
handle_tag method calls
start_tag as
appropriate. attributes is a list of pairs
(name,value),
where name is each
attribute's name, lowercased, and
value is the value, processed to resolve
entity references and character references and to remove surrounding
quotes.
Called to process invalid or unrecognized character references.
SGMLParser's implementation of
unknown_charref does nothing.
Called to process termination tags for which
X supplies no specific method.
SGMLParser's implementation of
unknown_endtag does nothing.
Called to process unknown entity references.
SGMLParser's implementation of
unknown_entityref does nothing.
s.unknown_starttag(tag, attributes)
|
|
Called to process tags for which X
supplies no specific method. tag is the
tag string, lowercased. attributes is a
list of pairs
(name,value),
where name is each
attribute's name, lowercased, and
value is the value, processed to resolve
entity references and character references and to remove surrounding
quotes. SGMLParser's
implementation of unknown_starttag does nothing.
The following example uses sgmllib for a typical
HTML-related task: fetching a page from the Web with
urllib, parsing it, and outputting the hyperlinks.
The example uses urlparse to check the
page's links, and outputs only links whose URLs have
an explicit scheme of 'http'.
import sgmllib, urllib, urlparse
class LinksParser(sgmllib.SGMLParser):
def __init_ _(self):
sgmllib.SGMLParser.__init_ _(self)
self.seen = {}
def do_a(self, attributes):
for name, value in attributes:
if name == 'href' and value not in self.seen:
self.seen[value] = True
pieces = urlparse.urlparse(value)
if pieces[0] != 'http': return
print urlparse.urlunparse(pieces)
return
p = LinksParser( )
f = urllib.urlopen('http://www.python.org/index.html')
BUFSIZE = 8192
while True:
data = f.read(BUFSIZE)
if not data: break
p.feed(data)
p.close( )
Class LinksParser only needs to define method
do_a. The superclass calls back to this method for
all <a> tags, and the method loops on the
attributes, looking for one named 'href', then
works with the corresponding value (i.e., the relevant
URL).
|