Credit: Jürgen Hermann
You need to convert Python source code into HTML markup, rendering comments, keywords, operators, and numeric and string literals in different colors.
tokenize.tokenize
does most of the work and calls
us back for each token found, so we can output it with appropriate
colorization:
""" MoinMoin - Python Source Parser """
import cgi, string, sys, cStringIO
import keyword, token, tokenize
# Python Source Parser (does highlighting into HTML)
_KEYWORD = token.NT_OFFSET + 1
_TEXT = token.NT_OFFSET + 2
_colors = {
token.NUMBER: '#0080C0',
token.OP: '#0000C0',
token.STRING: '#004080',
tokenize.COMMENT: '#008000',
token.NAME: '#000000',
token.ERRORTOKEN: '#FF8080',
_KEYWORD: '#C00000',
_TEXT: '#000000',
}
class Parser:
""" Send colorized Python source as HTML to an output file (normally stdout).
"""
def _ _init_ _(self, raw, out = sys.stdout):
""" Store the source text. """
self.raw = string.strip(string.expandtabs(raw))
self.out = out
def format(self):
""" Parse and send the colorized source to output. """
# Store line offsets in self.lines
self.lines = [0, 0]
pos = 0
while 1:
pos = string.find(self.raw, '\n', pos) + 1
if not pos: break
self.lines.append(pos)
self.lines.append(len(self.raw))
# Parse the source and write it
self.pos = 0
text = cStringIO.StringIO(self.raw)
self.out.write('<pre><font face="Lucida,Courier New">')
try:
tokenize.tokenize(text.readline, self) # self as handler callable
except tokenize.TokenError, ex:
msg = ex[0]
line = ex[1][0]
self.out.write("<h3>ERROR: %s</h3>%s\n" % (
msg, self.raw[self.lines[line]:]))
self.out.write('</font></pre>')
def _ _call_ _(self, toktype, toktext, (srow,scol), (erow,ecol), line):
""" Token handler """
if 0: # You may enable this for debugging purposes only
print "type", toktype, token.tok_name[toktype], "text", toktext,
print "start", srow,scol, "end", erow,ecol, "<br>"
# Calculate new positions
oldpos = self.pos
newpos = self.lines[srow] + scol
self.pos = newpos + len(toktext)
# Handle newlines
if toktype in [token.NEWLINE, tokenize.NL]:
self.out.write('\n')
return
# Send the original whitespace, if needed
if newpos > oldpos:
self.out.write(self.raw[oldpos:newpos])
# Skip indenting tokens
if toktype in [token.INDENT, token.DEDENT]:
self.pos = newpos
return
# Map token type to a color group
if token.LPAR <= toktype <= token.OP:
toktype = token.OP
elif toktype == token.NAME and keyword.iskeyword(toktext):
toktype = _KEYWORD
color = _colors.get(toktype, _colors[_TEXT])
style = ''
if toktype == token.ERRORTOKEN:
style = ' style="border: solid 1.5pt #FF0000;"'
# Send text
self.out.write('<font color="%s"%s>' % (color, style))
self.out.write(cgi.escape(toktext))
self.out.write('</font>')
if _ _name_ _ == "_ _main_ _":
import os, sys
print "Formatting..."
# Open own source
source = open('python.py').read( )
# Write colorized version to "python.html"
Parser(source, open('python.html', 'wt')).format( )
# Load HTML page into browser
if os.name == "nt":
os.system("explorer python.html")
else:
os.system("netscape python.html &")
This code is part of
MoinMoin (see
http://moin.sourceforge.net/) and
shows how to use the built-in
keyword
, token
, and
tokenize
modules to scan Python source code and
re-emit it with appropriate color markup but no changes to its
original formatting (“no changes”
is the hard part!).
The Parser
class’s constructor
saves the multiline string that is the Python source to colorize and
the file object, which is open for writing, where you want to output
the colorized results. Then, the format
method
prepares a self.lines
list that holds the offset
(the index into the source string, self.raw
) of
each line’s start.
format
then calls
tokenize.tokenize
, passing self
as the callback. Thus, the _ _call_ _
method is invoked for each
token, with arguments specifying the token type and starting and
ending positions in the source (each expressed as line number and
offset within the line). The body of the _ _call_ _
method reconstructs the exact position within the
original source code string self.raw
, so it can
emit exactly the same whitespace that was present in the original
source. It then picks a color code from the
_colors
dictionary (which uses HTML color coding),
with help from the keyword
standard module to
determine if a NAME
token is actually a Python
keyword (to be emitted in a different color than that used for
ordinary identifiers).
The test code at the bottom of the module formats the module itself
and launches a browser with the result. It does not use the standard
Python module
webbrowser
to ensure compatibility with stone-age
versions of Python. If you have no such worries, you can change the
last few lines of the recipe to:
# Load HTML page into browser import webbrowser webbrowser.open("python.html", 0, 1)
and enjoy the result in your favorite browser.
Documentation for the webbrowser
,
token
, tokenize
, and
keyword
modules in the Library Reference; the colorizer is available at http://purl.net/wiki/python/MoinMoinColorizer,
part of MoinMoin (http://moin.sourceforge.net).
Get Python Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.