Getting ready

We will read a file named unicode.html from our local web server, located at http://localhost:8080/unicode.html.  This file is UTF-8 encoded and contains several sets of characters in different parts of the encoding space. For example, the page looks as follows in your browser:

The Page in the Browser

Using an editor that supports UTF-8, we can see how the Cyrillic characters are rendered in the editor:

The HTML in an Editor

Code for the sample is in 02/06_unicode.py.

Get Python Web Scraping Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.