The sgmllib Module
The sgmllib
module, shown in Example 5-5, provides a basic SGML parser. It works pretty much the same
as the xmllib
parser, but is less restrictive (and less complete).
Like in xmllib
, this parser calls
methods in itself to deal with things like start tags, data sections,
end tags, and entities. If you’re only interested in a few tags, you
can define special start
and
end
methods.
Example 5-5. Using the sgmllib Module to Extract the Title Element
File: sgmllib-example-1.py import sgmllib import string class FoundTitle(Exception): pass class ExtractTitle(sgmllib.SGMLParser): def _ _init_ _(self, verbose=0): sgmllib.SGMLParser._ _init_ _(self, verbose) self.title = self.data = None def handle_data(self, data): if self.data is not None: self.data.append(data) def start_title(self, attrs): self.data = [] def end_title(self): self.title = string.join(self.data, "") raise FoundTitle # abort parsing! def extract(file): # extract title from an HTML/SGML stream p = ExtractTitle() try: while 1: # read small chunks s = file.read(512) if not s: break p.feed(s) p.close() except FoundTitle: return p.title return None # # try it out print "html", "=>", extract(open("samples/sample.htm")) print "sgml", "=>", extract(open("samples/sample.sgm"))html => A Title.
sgml => Quotations
To handle all tags, overload the unknown_starttag
and unknown_endtag
methods instead, as Example 5-6 demonstrates.
Example 5-6. Using the sgmllib Module to Format an SGML Document
File: sgmllib-example-2.py ...
Get Python Standard Library now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.