May 2001
Intermediate to advanced
304 pages
6h 12m
English
The sgmllib module, shown in Example 5-5, provides a basic SGML parser. It works pretty much the same
as the xmllib
parser, but is less restrictive (and less complete).
Like in xmllib, this parser calls
methods in itself to deal with things like start tags, data sections,
end tags, and entities. If you’re only interested in a few tags, you
can define special start and
end methods.
Example 5-5. Using the sgmllib Module to Extract the Title Element
File: sgmllib-example-1.py
import sgmllib
import string
class FoundTitle(Exception):
pass
class ExtractTitle(sgmllib.SGMLParser):
def _ _init_ _(self, verbose=0):
sgmllib.SGMLParser._ _init_ _(self, verbose)
self.title = self.data = None
def handle_data(self, data):
if self.data is not None:
self.data.append(data)
def start_title(self, attrs):
self.data = []
def end_title(self):
self.title = string.join(self.data, "")
raise FoundTitle # abort parsing!
def extract(file):
# extract title from an HTML/SGML stream
p = ExtractTitle()
try:
while 1:
# read small chunks
s = file.read(512)
if not s:
break
p.feed(s)
p.close()
except FoundTitle:
return p.title
return None
#
# try it out
print "html", "=>", extract(open("samples/sample.htm"))
print "sgml", "=>", extract(open("samples/sample.sgm"))
html => A Title.
sgml => QuotationsTo handle all tags, overload the unknown_starttag
and unknown_endtag methods instead, as Example 5-6 demonstrates.
Example 5-6. Using the sgmllib Module to Format an SGML Document
File: sgmllib-example-2.py ...
Read now
Unlock full access