O'Reilly logo

Python Cookbook, 2nd Edition by David Ascher, Anna Ravenscroft, Alex Martelli

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

1.23. Encoding Unicode Data for XML and HTML

Credit: David Goodger, Peter Cogolo

Problem

You want to encode Unicode text for output in HTML, or some other XML application, using a limited but popular encoding such as ASCII or Latin-1.

Solution

Python provides an encoding error handler named xmlcharrefreplace, which replaces all characters outside of the chosen encoding with XML numeric character references:

def encode_for_xml(unicode_data, encoding='ascii'):
    return unicode_data.encode(encoding, 'xmlcharrefreplace')

You could use this approach for HTML output, too, but you might prefer to use HTML's symbolic entity references instead. For this purpose, you need to define and register a customized encoding error handler. Implementing that handler is made easier by the fact that the Python Standard Library includes a module named htmlentitydefs that holds HTML entity definitions:

import codecs
from htmlentitydefs import codepoint2name
def html_replace(exc):
    if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)):
        s = [ u'&%s;' % codepoint2name[ord(c)]
              for c in exc.object[exc.start:exc.end] ]
        return ''.join(s), exc.end
    else:
        raise TypeError("can't handle %s" % exc._ _name_ _) 
codecs.register_error('html_replace', html_replace)

After registering this error handler, you can optionally write a function to wrap its use:

def encode_for_html(unicode_data, encoding='ascii'):
    return unicode_data.encode(encoding, 'html_replace')

Discussion

As with any good Python module, this module would normally proceed with an example of its use, guarded by an if _ _name_ _ == '_ _main_ _' test:

if _ _name_ _ == '_ _main_ _':
    # demo
    data = u'''\
<html>
<head>
<title>Encoding Test</title>
</head>
<body>
<p>accented characters:
<ul>
<li>\xe0 (a + grave)
<li>\xe7 (c + cedilla)
<li>\xe9 (e + acute)
</ul>
<p>symbols:
<ul>
<li>\xa3 (British pound)
<li>\u20ac (Euro)
<li>\u221e (infinity)
</ul>
</body></html>
'''
    print encode_for_xml(data)
    print encode_for_html(data)

If you run this module as a main script, you will then see such output as (from function encode_for_xml):

<li>&#224; (a + grave)
<li>&#231; (c + cedilla)
<li>&#233; (e + acute)...
<li>&#163; (British pound)
<li>&#8364; (Euro)
<li>&#8734; (infinity)

as well as (from function encode_for_html):

<li>&agrave; (a + grave)
<li>&ccedil; (c + cedilla)
<li>&eacute; (e + acute)...
<li>&pound; (British pound)
<li>&euro; (Euro)
<li>&infin; (infinity)

There is clearly a niche for each case, since encode_for_xml is more general (you can use it for any XML application, not just HTML), but encode_for_html may produce output that's easier to read—should you ever need to look at it directly, edit it further, and so on. If you feed either form to a browser, you should view it in exactly the same way. To visualize both forms of encoding in a browser, run this recipe's module as a main script, redirect the output to a disk file, and use a text editor to separate the two halves before you view them with a browser. (Alternatively, run the script twice, once commenting out the call to encode_for_xml, and once commenting out the call to encode_for_html.)

Remember that Unicode data must always be encoded before being printed or written out to a file. UTF-8 is an ideal encoding, since it can handle any Unicode character. But for many users and applications, ASCII or Latin-1 encodings are often preferred over UTF-8. When the Unicode data contains characters that are outside of the given encoding (e.g., accented characters and most symbols are not encodable in ASCII, and the "infinity" symbol is not encodable in Latin-1), these encodings cannot handle the data on their own. Python supports a built-in encoding error handler called xmlcharrefreplace, which replaces unencodable characters with XML numeric character references, such as &#8734; for the "infinity" symbol. This recipe shows how to write and register another similar error handler, html_replace, specifically for producing HTML output. html_replace replaces unencodable characters with more readable HTML symbolic entity references, such as &infin; for the "infinity" symbol. html_replace is less general than xmlcharrefreplace, since it does not support all Unicode characters and cannot be used with non-HTML applications; however, it can still be useful if you want HTML output that is as readable as possible in a "view page source" context.

Neither of these error handlers makes sense for output that is neither HTML nor some other form of XML. For example, TeX and other markup languages do not recognize XML numeric character references. However, if you know how to build an arbitrary character reference for such a markup language, you may modify the example error handler html_replace shown in this recipe's Solution to code and register your own encoding error handler.

An alternative (and very effective!) way to perform encoding of Unicode data into a file, with a given encoding and error handler of your choice, is offered by the codecs module in Python's standard library:

outfile = codecs.open('out.html', mode='w', encoding='ascii',
                       errors='html_replace')

You can now use outfile.write(unicode_data) for any arbitrary Unicode string unicode_data, and all the encoding and error handling will be taken care of transparently. When your output is finished, of course, you should call outfile.close( ).

See Also

Library Reference and Python in a Nutshell docs for modules codecs and htmlentitydefs.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required