Chapter 4. Writing Your First Web Scraper

Once you start web scraping, you start to appreciate all the little things that browsers do for you. The web, without its layers of HTML formatting, CSS styling, JavaScript execution, and image rendering, can look a little intimidating at first. In this chapter, we’ll begin to look at how to format and interpret this bare data without the help of a web browser.

This chapter starts with the basics of sending a GET request (a request to fetch, or “get,” the content of a web page) to a web server for a specific page, reading the HTML output from that page, and doing some simple data extraction in order to isolate the content you are looking for.

Installing and Using Jupyter

The code for this course can be found at https://github.com/REMitchell/python-scraping. In most cases, code samples are in the form of Jupyter Notebook files, with an .ipynb extension.

If you haven’t used them already, Jupyter Notebooks are an excellent way to organize and work with many small but related pieces of Python code, as shown in Figure 4-1.

Figure 4-1. A Jupyter Notebook running in the browser

Each piece of code is contained in a box called a cell. The code within each cell can be run by typing Shift + Enter, or by clicking the Run button at the top of the page.

Project Jupyter began as a spin-off project from the IPython (Interactive Python) project in 2014. These notebooks were designed to run Python code in the browser in an accessible and interactive way that would lend itself to teaching and presenting.

To install Jupyter Notebooks:

$ pip install notebook

After installation, you should have access to the jupyter command, which will allow you to start the web server. Navigate to the directory containing the downloaded exercise files for this book, and run:

$ jupyter notebook

This will start the web server on port 8888. If you have a web browser running, a new tab should open automatically. If it doesn’t, copy the URL shown in the terminal, with the provided token, to your web browser.

Connecting

In the first section of this book, we took a deep dive into how the internet sends packets of data across wires from a browser to a web server and back again. When you open a browser, type in google.com, and hit Enter, that’s exactly what’s happening—data, in the form of an HTTP request, is being transferred from your computer, and Google’s web server is responding with an HTML file that represents the data at the root of google.com.

But where, in this exchange of packets and frames, does the web browser actually come into play? Absolutely nowhere. In fact, ARPANET (the first public packet-switched network) predated the first web browser, Nexus, by at least 20 years.

Yes, the web browser is a useful application for creating these packets of information, telling your operating system to send them off and interpreting the data you get back as pretty pictures, sounds, videos, and text. However, a web browser is just code, and code can be taken apart, broken into its basic components, rewritten, reused, and made to do anything you want. A web browser can tell the processor to send data to the application that handles your wireless (or wired) interface, but you can do the same thing in Python with just three lines of code:

from urllib.request import urlopen

html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

To run this, you can use the IPython notebook for Chapter 1 in the GitHub repository, or you can save it locally as scrapetest.py and run it in your terminal by using this command:

$ python scrapetest.py

Note that if you also have Python 2.x installed on your machine and are running both versions of Python side by side, you may need to explicitly call Python 3.x by running the command this way:

$ python3 scrapetest.py

This command outputs the complete HTML code for page1 located at the URL http://pythonscraping.com/pages/page1.html. More accurately, this outputs the HTML file page1.html, found in the directory <web root>/pages, on the server located at the domain name http://pythonscraping.com.

Why is it important to start thinking of these addresses as “files” rather than “pages”? Most modern web pages have many resource files associated with them. These could be image files, JavaScript files, CSS files, or any other content that the page you are requesting is linked to. When a web browser hits a tag such as <img src="cute​Kit⁠ten.jpg">, the browser knows that it needs to make another request to the server to get the data at the location cuteKitten.jpg in order to fully render the page for the user.

Of course, your Python script doesn’t have the logic to go back and request multiple files (yet); it can read only the single HTML file that you’ve directly requested.

from urllib.request import urlopen

means what it looks like it means: it looks at the Python module request (found within the urllib library) and imports only the function urlopen.

urllib is a standard Python library (meaning you don’t have to install anything extra to run this example) and contains functions for requesting data across the web, handling cookies, and even changing metadata such as headers and your user agent. We will be using urllib extensively throughout the book, so I recommend you read the Python documentation for the library.

urlopen is used to open a remote object across a network and read it. Because it is a fairly generic function (it can read HTML files, image files, or any other file stream with ease), we will be using it quite frequently throughout the book.

An Introduction to BeautifulSoup

Beautiful Soup, so rich and green,
Waiting in a hot tureen!
Who for such dainties would not stoop?
Soup of the evening, beautiful Soup!

The BeautifulSoup library was named after a Lewis Carroll poem of the same name in Alice’s Adventures in Wonderland. In the story, this poem is sung by a character called the Mock Turtle (itself a pun on the popular Victorian dish Mock Turtle Soup made not of turtle but of cow).

Like its Wonderland namesake, BeautifulSoup tries to make sense of the nonsensical; it helps format and organize the messy web by fixing bad HTML and presenting us with easily traversable Python objects representing XML structures.

Installing BeautifulSoup

Because the BeautifulSoup library is not a default Python library, it must be installed. If you’re already experienced at installing Python libraries, please use your favorite installer and skip ahead to the next section, “Running BeautifulSoup”.

For those who have not installed Python libraries (or need a refresher), this general method will be used for installing multiple libraries throughout the book, so you may want to reference this section in the future.

We will be using the BeautifulSoup 4 library (also known as BS4) throughout this book. The complete documentation, as well as installation instructions, for BeautifulSoup 4 can be found at Crummy.com.

If you’ve spent much time writing Python, you’ve probably used the package installer for Python (pip). If you haven’t, I highly recommend that you install pip in order to install BeautifulSoup and other Python packages used throughout this book.

Depending on the Python installer you used, pip may already be installed on your computer. To check, try:

$ pip

This command should result in the pip help text being printed to your terminal. If the command isn’t recognized, you may need to install pip. Pip can be installed in a variety of ways, such as with apt-get on Linux or brew on macOS. Regardless of your operating system, you can also download the pip bootstrap file at https://bootstrap.pypa.io/get-pip.py, save this file as get-pip.py, and run it with Python:

$ python get-pip.py

Again, note that if you have both Python 2.x and 3.x installed on your machine, you might need to call python3 explicitly:

$ python3 get-pip.py

Finally, use pip to install BeautifulSoup:

$ pip install bs4

If you have two versions of Python, along with two versions of pip, you may need to call pip3 to install the Python 3.x versions of packages:

$ pip3 install bs4

And that’s it! BeautifulSoup will now be recognized as a Python library on your machine. You can test this by opening a Python terminal and importing it:

$ python
> from bs4 import BeautifulSoup

The import should complete without errors.

Running BeautifulSoup

The most commonly used object in the BeautifulSoup library is, appropriately, the BeautifulSoup object. Let’s take a look at it in action, modifying the example found in the beginning of this chapter:

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)

The output is as follows:

<h1>An Interesting Title</h1>

Note that this returns only the first instance of the h1 tag found on the page. By convention, only one h1 tag should be used on a single page, but conventions are often broken on the web, so you should be aware that this will retrieve only the first instance of the tag, and not necessarily the one that you’re looking for.

As in previous web scraping examples, you are importing the urlopen function and calling html.read() to get the HTML content of the page. In addition to the text string, BeautifulSoup can use the file object directly returned by urlopen, without needing to call .read() first:

bs = BeautifulSoup(html, 'html.parser')

This HTML content is then transformed into a BeautifulSoup object with the following structure:

  • html<html><head>...</head><body>...</body></html>

    • head<head><title>A Useful Page<title></head>

      • title<title>A Useful Page</title>
    • body<body><h1>An Int...</h1><div>Lorem ip...</div></body>

      • h1<h1>An Interesting Title</h1>
      • div<div>Lorem Ipsum dolor...</div>

Note that the h1 tag that you extract from the page is nested two layers deep into your BeautifulSoup object structure (htmlbodyh1). However, when you actually fetch it from the object, you call the h1 tag directly:

bs.h1

In fact, any of the following function calls would produce the same output:

bs.html.body.h1
bs.body.h1
bs.html.h1

When you create a BeautifulSoup object, two arguments are passed in:

bs = BeautifulSoup(html.read(), 'html.parser')

The first is the HTML string that the object is based on, and the second specifies the parser that you want BeautifulSoup to use to create that object. In the majority of cases, it makes no difference which parser you choose.

html.parser is a parser that is included with Python 3 and requires no extra installations to use. Except where required, we will use this parser throughout the book.

Another popular parser is lxml. This can be installed through pip:

$ pip install lxml

lxml can be used with BeautifulSoup by changing the parser string provided:

bs = BeautifulSoup(html.read(), 'lxml')

lxml has some advantages over html.parser in that it is generally better at parsing “messy” or malformed HTML code. It is forgiving and fixes problems like unclosed tags, tags that are improperly nested, and missing head or body tags.

lxml is also somewhat faster than html.parser, although speed is not necessarily an advantage in web scraping, given that the speed of the network itself will almost always be your largest bottleneck.

Avoid Over-Optimizing Web Scraping Code

Elegant algorithms are lovely to behold, but when it comes to web scraping, they may not have a practical impact. A few microseconds of processing time will likely be dwarfed by the—sometimes actual—seconds of network latency that a network request takes.

Good web scraping code generally focuses on robust and easily readable implementations, rather than clever processing optimizations.

One of the disadvantages of lxml is that it needs to be installed separately and depends on third-party C libraries to function. This can cause problems for portability and ease of use, compared to html.parser.

Another popular HTML parser is html5lib. Like lxml, html5lib is an extremely forgiving parser that takes even more initiative with correcting broken HTML. It also depends on an external dependency and is slower than both lxml and html.parser. Despite this, it may be a good choice if you are working with messy or handwritten HTML sites.

It can be used by installing and passing the string html5lib to the BeautifulSoup object:

bs = BeautifulSoup(html.read(), 'html5lib')

I hope this small taste of BeautifulSoup has given you an idea of the power and simplicity of this library. Virtually any information can be extracted from any HTML (or XML) file, as long as it has an identifying tag surrounding it or near it. Chapter 5 delves more deeply into more-complex BeautifulSoup function calls and presents regular expressions and how they can be used with BeautifulSoup in order to extract information from websites.

Connecting Reliably and Handling Exceptions

The web is messy. Data is poorly formatted, websites go down, and closing tags go missing. One of the most frustrating experiences in web scraping is to go to sleep with a scraper running, dreaming of all the data you’ll have in your database the next day—only to find that the scraper hit an error on some unexpected data format and stopped execution shortly after you stopped looking at the screen.

In situations like these, you might be tempted to curse the name of the developer who created the website (and the oddly formatted data), but the person you should really be kicking is yourself for not anticipating the exception in the first place!

Let’s look at the first line of our scraper, after the import statements, and figure out how to handle any exceptions this might throw:

html = urlopen('http://www.pythonscraping.com/pages/page1.html') 

Two main things can go wrong in this line:

  • The page is not found on the server (or there was an error in retrieving it).
  • The server is not found at all.

In the first situation, an HTTP error will be returned. This HTTP error may be “404 Page Not Found,” “500 Internal Server Error,” and so forth. In all of these cases, the urlopen function will throw the generic exception HTTPError. You can handle this exception in the following way:

from urllib.request import urlopen
from urllib.error import HTTPError

try:
    html = urlopen('http://www.pythonscraping.com/pages/page1.html')
except HTTPError as e:
    print(e)
    # return null, break, or do some other "Plan B"
else:
    # program continues. Note: If you return or break in the  
    # exception catch, you do not need to use the "else" statement

If an HTTP error code is returned, the program now prints the error and does not execute the rest of the program under the else statement.

If the server is not found at all (if, for example, http://www.pythonscraping.com is down, or the URL is mistyped), urlopen will throw an URLError. This indicates that no server could be reached at all, and, because the remote server is responsible for returning HTTP status codes, an HTTPError cannot be thrown, and the more serious URLError must be caught. You can add a check to see whether this is the case:

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found!')
else:
    print('It Worked!')

Of course, if the page is retrieved successfully from the server, there is still the issue of the content on the page not being quite what you expected. Every time you access a tag in a BeautifulSoup object, it’s smart to add a check to make sure the tag actually exists. If you attempt to access a tag that does not exist, BeautifulSoup will return a None object. The problem is, attempting to access a tag on a None object itself will result in an AttributeError being thrown.

The following line (where nonExistentTag is a made-up tag, not the name of a real BeautifulSoup function):

print(bs.nonExistentTag)

returns a None object. This object is perfectly reasonable to handle and check for. The trouble comes if you don’t check for it but instead go on and try to call another function on the None object, as illustrated here:

print(bs.nonExistentTag.someTag)

This returns an exception:

AttributeError: 'NoneType' object has no attribute 'someTag'

So how can you guard against these two situations? The easiest way is to explicitly check for both situations:

try:
    badContent = bs.nonExistingTag.anotherTag
except AttributeError as e:
    print('Tag was not found')
else:
    if badContent == None:
        print ('Tag was not found')
    else:
        print(badContent)

This checking and handling of every error does seem laborious at first, but it’s easy to add a little reorganization to this code to make it less difficult to write (and, more important, much less difficult to read). This code, for example, is our same scraper written in a slightly different way:

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bs = BeautifulSoup(html.read(), 'html.parser')
        title = bs.body.h1
    except AttributeError as e:
        return None
    return title

title = getTitle('http://www.pythonscraping.com/pages/page1.html')
if title == None:
    print('Title could not be found')
else:
    print(title)

In this example, you’re creating a function getTitle, which returns either the title of the page, or a None object if there was a problem retrieving it. Inside getTitle, you check for an HTTPError, as in the previous example, and encapsulate two of the BeautifulSoup lines inside one try statement. An AttributeError might be thrown from either of these lines (if the server did not exist, html would be a None object, and html.read() would throw an AttributeError). You could, in fact, encompass as many lines as you want inside one try statement or call another function entirely, which can throw an AttributeError at any point.

When writing scrapers, it’s important to think about the overall pattern of your code in order to handle exceptions and make it readable at the same time. You’ll also likely want to heavily reuse code. Having generic functions such as getSiteHTML and getTitle (complete with thorough exception handling) makes it easy to quickly—and reliably—scrape the web.

Get Web Scraping with Python, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.