May 2017
Beginner to intermediate
220 pages
5h 2m
English
To scrape web pages, we first need to download them. Here is a simple Python script that uses Python's urllib module to download a URL:
import urllib.requestdef download(url): return urllib.request.urlopen(url).read()
When a URL is passed, this function will download the web page and return the HTML. The problem with this snippet is that, when downloading the web page, we might encounter errors that are beyond our control; for example, the requested page may no longer exist. In these cases, urllib will raise an exception and exit the script. To be safer, here is a more robust version to catch these exceptions:
import urllib.requestfrom urllib.error import URLError, HTTPError, ContentTooShortErrordef download(url): ...
Read now
Unlock full access