February 2018
Beginner to intermediate
364 pages
10h 32m
English
We will look at using urlopen and requests to handle HTML in UTF-8. These two libraries handle this differently, so let's examine this. Let's start importing urllib, loading the page, and examining some of the content.
In [8]: from urllib.request import urlopen ...: page = urlopen("http://localhost:8080/unicode.html") ...: content = page.read() ...: content[840:1280] ...:Out[8]: b'><strong>Cyrillic</strong> U+0400 \xe2\x80\x93 U+04FF (1024\xe2\x80\x931279)</p>\n <table class="unicode">\n <tbody>\n <tr valign="top">\n <td width="50"> </td>\n <td class="b" width="50">\xd0\x89</td>\n <td class="b" width="50">\xd0\xa9</td>\n <td class="b" width="50">\xd1\x89</td>\n <td class="b" width="50">\xd3\x83</td>\n </tr>\n ...Read now
Unlock full access