Chapter 4. Web Crawling Models

Writing clean and scalable code is difficult enough when you have control over your data and your inputs. Writing code for web crawlers, which may need to scrape and store a variety of data from diverse sets of websites that the programmer has no control over, often presents unique organizational challenges.

You may be asked to collect news articles or blog posts from a variety of websites, each with different templates and layouts. One website’s h1 tag contains the title of the article, another’s h1 tag contains the title of the website itself, and the article title is in <span id="title">.

You may need flexible control over which websites are scraped and how they’re scraped, and a way to quickly add new websites or modify existing ones, as fast as possible, without writing multiple lines of code.

You may be asked to scrape product prices from different websites, with the ultimate aim of comparing prices for the same product. Perhaps these prices are in different currencies, and perhaps you’ll also need to combine this with external data from some other nonweb source.

Although the applications of web crawlers are nearly endless, large scalable crawlers tend to fall into one of several patterns. By learning these patterns and recognizing the situations they apply to, you can vastly improve the maintainability and robustness of your web crawlers.

This chapter focuses primarily on web crawlers that collect a limited number of “types” of data (such as restaurant reviews, news articles, company profiles) from a variety of websites, and that store these data types as Python objects that read and write from a database.

Planning and Defining Objects

One common trap of web scraping is defining the data that you want to collect based entirely on what’s available in front of your eyes. For instance, if you want to collect product data, you may first look at a clothing store and decide that each product you scrape needs to have the following fields:

Product name
Price
Description
Sizes
Colors
Fabric type
Customer rating

Looking at another website, you find that it has SKUs (stock keeping units, used to track and order items) listed on the page. You definitely want to collect that data as well, even if it doesn’t appear on the first site! You add this field:

Item SKU

Although clothing may be a great start, you also want to make sure you can extend this crawler to other types of products. You start perusing product sections of other websites and decide you also need to collect this information:

Hardcover/Paperback
Matte/Glossy print
Number of customer reviews
Link to manufacturer

Clearly, this is an unsustainable approach. Simply adding attributes to your product type every time you see a new piece of information on a website will lead to far too many fields to keep track of. Not only that, but every time you scrape a new website, you’ll be forced to perform a detailed analysis of the fields the website has and the fields you’ve accumulated so far, and potentially add new fields (modifying your Python object type and your database structure). This will result in a messy and difficult-to-read dataset that may lead to problems using it.

One of the best things you can do when deciding which data to collect is often to ignore the websites altogether. You don’t start a project that’s designed to be large and scalable by looking at a single website and saying, “What exists?” but by saying, “What do I need?” and then finding ways to seek the information that you need from there.

Perhaps what you really want to do is compare product prices among multiple stores and track those product prices over time. In this case, you need enough information to uniquely identify the product, and that’s it:

Product title
Manufacturer
Product ID number (if available/relevant)

It’s important to note that none of this information is specific to a particular store. For instance, product reviews, ratings, price, and even description are specific to the instance of that product at a particular store. That can be stored separately.

Other information (colors the product comes in, what it’s made of) is specific to the product, but may be sparse—it’s not applicable to every product. It’s important to take a step back and perform a checklist for each item you consider and ask yourself the following questions:

Will this information help with the project goals? Will it be a roadblock if I don’t have it, or is it just “nice to have” but won’t ultimately impact anything?
If it might help in the future, but I’m unsure, how difficult will it be to go back and collect the data at a later time?
Is this data redundant to data I’ve already collected?
Does it make logical sense to store the data within this particular object? (As mentioned before, storing a description in a product doesn’t make sense if that description changes from site to site for the same product.)

If you do decide that you need to collect the data, it’s important to ask a few more questions to then decide how to store and handle it in code:

Is this data sparse or dense? Will it be relevant and populated in every listing, or just a handful out of the set?
How large is the data?
Especially in the case of large data, will I need to regularly retrieve it every time I run my analysis, or only on occasion?
How variable is this type of data? Will I regularly need to add new attributes, modify types (such as fabric patterns, which may be added frequently), or is it set in stone (shoe sizes)?

Let’s say you plan to do some meta analysis around product attributes and prices: for example, the number of pages a book has, or the type of fabric a piece of clothing is made of, and potentially other attributes in the future, correlated to price. You run through the questions and realize that this data is sparse (relatively few products have any one of the attributes), and that you may decide to add or remove attributes frequently. In this case, it may make sense to create a product type that looks like this:

Product title
Manufacturer
Product ID number (if available/relevant)
Attributes (optional list or dictionary)

And an attribute type that looks like this:

Attribute name
Attribute value

This allows you to flexibly add new product attributes over time, without requiring you to redesign your data schema or rewrite code. When deciding how to store these attributes in the database, you can write JSON to the attribute field, or store each attribute in a separate table with a product ID. See Chapter 6 for more information about implementing these types of database models.

You can apply the preceding questions to the other information you’ll need to store as well. For keeping track of the prices found for each product, you’ll likely need the following:

Product ID
Store ID
Price
Date/Timestamp price was found at

But what if you have a situation in which the product’s attributes actually modify the price of the product? For instance, stores might charge more for a large shirt than a small one, because the large shirt requires more labor or materials. In this case, you may consider splitting the single shirt product into separate product listings for each size (so that each shirt product can be priced independently) or creating a new item type to store information about instances of a product, containing these fields:

Product ID
Instance type (the size of the shirt, in this case)

And each price would then look like this:

Product Instance ID
Store ID
Price
Date/Timestamp price was found at

While the subject of “products and prices” may seem overly specific, the basic questions you need to ask yourself, and the logic used when designing your Python objects, apply in almost every situation.

If you’re scraping news articles, you may want basic information such as the following:

Title
Author
Date
Content

But say some articles contain a “revision date,” or “related articles, or a “number of social media shares.” Do you need these? Are they relevant to your project? How do you efficiently and flexibly store the number of social media shares when not all news sites use all forms of social media, and social media sites may grow or wane in popularity over time?

It can be tempting, when faced with a new project, to dive in and start writing Python to scrape websites immediately. The data model, left as an afterthought, often becomes strongly influenced by the availability and format of the data on the first website you scrape.

However, the data model is the underlying foundation of all the code that uses it. A poor decision in your model can easily lead to problems writing and maintaining code down the line, or difficulty in extracting and efficiently using the resulting data. Especially when dealing with a variety of websites—both known and unknown—it becomes vital to give serious thought and planning to what, exactly, you need to collect and how you need to store it.

Dealing with Different Website Layouts

One of the most impressive feats of a search engine such as Google is that it manages to extract relevant and useful data from a variety of websites, having no upfront knowledge about the website structure itself. Although we, as humans, are able to immediately identify the title and main content of a page (barring instances of extremely poor web design), it is far more difficult to get a bot to do the same thing.

Fortunately, in most cases of web crawling, you’re not looking to collect data from sites you’ve never seen before, but from a few, or a few dozen, websites that are pre-selected by a human. This means that you don’t need to use complicated algorithms or machine learning to detect which text on the page “looks most like a title” or which is probably the “main content.” You can determine what these elements are manually.

The most obvious approach is to write a separate web crawler or page parser for each website. Each might take in a URL, string, or BeautifulSoup object, and return a Python object for the thing that was scraped.

The following is an example of a Content class (representing a piece of content on a website, such as a news article) and two scraper functions that take in a BeautifulSoup object and return an instance of Content:

import requests

class Content:
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body

def getPage(url):
    req = requests.get(url)
    return BeautifulSoup(req.text, 'html.parser')

def scrapeNYTimes(url):
    bs = getPage(url)
    title = bs.find('h1').text
    lines = bs.select('div.StoryBodyCompanionColumn div p')
    body = '\n'.join([line.text for line in lines])
    return Content(url, title, body)

def scrapeBrookings(url):
    bs = getPage(url)
    title = bs.find('h1').text
    body = bs.find('div', {'class', 'post-body'}).text
    return Content(url, title, body)

url = 'https://www.brookings.edu/blog/future-development/2018/01/26/'
    'delivering-inclusive-urban-access-3-uncomfortable-truths/'
content = scrapeBrookings(url)
print('Title: {}'.format(content.title))
print('URL: {}\n'.format(content.url))
print(content.body)

url = 'https://www.nytimes.com/2018/01/25/opinion/sunday/'
    'silicon-valley-immortality.html'
content = scrapeNYTimes(url)
print('Title: {}'.format(content.title))
print('URL: {}\n'.format(content.url))
print(content.body)

As you start to add scraper functions for additional news sites, you might notice a pattern forming. Every site’s parsing function does essentially the same thing:

Selects the title element and extracts the text for the title
Selects the main content of the article
Selects other content items as needed
Returns a Content object instantiated with the strings found previously

The only real site-dependent variables here are the CSS selectors used to obtain each piece of information. BeautifulSoup’s find and find_all functions take in two arguments—a tag string and a dictionary of key/value attributes—so you can pass these arguments in as parameters that define the structure of the site itself and the location of the target data.

To make things even more convenient, rather than dealing with all of these tag arguments and key/value pairs, you can use the BeautifulSoup select function with a single string CSS selector for each piece of information you want to collect and put all of these selectors in a dictionary object:

class Content:
    """
    Common base class for all articles/pages
    """
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body

    def print(self):
        """
        Flexible printing function controls output
        """
        print('URL: {}'.format(self.url))
        print('TITLE: {}'.format(self.title))
        print('BODY:\n{}'.format(self.body))

class Website:
    """ 
    Contains information about website structure
    """
    def __init__(self, name, url, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.titleTag = titleTag
        self.bodyTag = bodyTag

Note that the Website class does not store information collected from the individual pages themselves, but stores instructions about how to collect that data. It doesn’t store the title “My Page Title.” It simply stores the string tag h1 that indicates where the titles can be found. This is why the class is called Website (the information here pertains to the entire website) and not Content (which contains information from just a single page).

Using these Content and Website classes you can then write a Crawler to scrape the title and content of any URL that is provided for a given web page from a given website:

import requests
from bs4 import BeautifulSoup

class Crawler:
    def getPage(self, url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None        
        return BeautifulSoup(req.text, 'html.parser')

    def safeGet(self, pageObj, selector):
        """
        Utility function used to get a content string from a
        Beautiful Soup object and a selector. Returns an empty
        string if no object is found for the given selector
        """
        selectedElems = pageObj.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join(
            [elem.get_text() for elem in selectedElems])
        return ''

    def parse(self, site, url):
        """
        Extract content from a given page URL
        """
        bs = self.getPage(url)
        if bs is not None:
            title = self.safeGet(bs, site.titleTag)
            body = self.safeGet(bs, site.bodyTag)
            if title != '' and body != '':
                content = Content(url, title, body)
                content.print()

And here’s the code that defines the website objects and kicks off the process:

crawler = Crawler()

siteData = [
    ['O\'Reilly Media', 'http://oreilly.com',
    'h1', 'section#product-description'],
    ['Reuters', 'http://reuters.com', 'h1',
    'div.StandardArticleBody_body_1gnLA'],
    ['Brookings', 'http://www.brookings.edu',
    'h1', 'div.post-body'],
    ['New York Times', 'http://nytimes.com',
    'h1', 'div.StoryBodyCompanionColumn div p']
]
websites = []
for row in siteData:
    websites.append(Website(row[0], row[1], row[2], row[3]))

crawler.parse(websites[0], 'http://shop.oreilly.com/product/'\
    '0636920028154.do')
crawler.parse(websites[1], 'http://www.reuters.com/article/'\
    'us-usa-epa-pruitt-idUSKBN19W2D0')
crawler.parse(websites[2], 'https://www.brookings.edu/blog/'\
    'techtank/2016/03/01/idea-to-retire-old-methods-of-policy-education/')
crawler.parse(websites[3], 'https://www.nytimes.com/2018/01/'\
    '28/business/energy-environment/oil-boom.html')

While this new method might not seem remarkably simpler than writing a new Python function for each new website at first glance, imagine what happens when you go from a system with 4 website sources to a system with 20 or 200 sources.

Each list of strings is relatively easy to write. It doesn’t take up much space. It can be loaded from a database or a CSV file. It can be imported from a remote source or handed off to an nonprogrammer with some frontend experience to fill out and add new websites to, and they never have to look at a line of code.

Of course, the downside is that you are giving up a certain amount of flexibility. In the first example, each website gets its own free-form function to select and parse HTML however necessary, in order to get the end result. In the second example, each website needs to have a certain structure in which fields are guaranteed to exist, data must be clean coming out of the field, and each target field must have a unique and reliable CSS selector.

However, I believe that the power and relative flexibility of this approach more than makes up for its real or perceived shortcomings. The next section covers specific applications and expansions of this basic template so that you can, for example, deal with missing fields, collect different types of data, crawl only through specific parts of a website, and store more-complex information about pages.

Structuring Crawlers

Creating flexible and modifiable website layout types doesn’t do much good if you still have to locate each link you want to scrape by hand. The previous chapter showed various methods of crawling through websites and finding new pages in an automated way.

This section shows how to incorporate these methods into a well-structured and expandable website crawler that can gather links and discover data in an automated way. I present just three basic web crawler structures here, although I believe that they apply to the majority of situations that you will likely need when crawling sites in the wild, perhaps with a few modifications here and there. If you encounter an unusual situation with your own crawling problem, I also hope that you will use these structures as inspiration in order to create an elegant and robust crawler design.

Crawling Sites Through Search

One of the easiest ways to crawl a website is via the same method that humans do: using the search bar. Although the process of searching a website for a keyword or topic and collecting a list of search results may seem like a task with a lot of variability from site to site, several key points make this surprisingly trivial:

Most sites retrieve a list of search results for a particular topic by passing that topic as a string through a parameter in the URL. For example: http://example.com?search=myTopic. The first part of this URL can be saved as a property of the Website object, and the topic can simply be appended to it.
After searching, most sites present the resulting pages as an easily identifiable list of links, usually with a convenient surrounding tag such as <span class="result">, the exact format of which can also be stored as a property of the Website object.
Each result link is either a relative URL (e.g., /articles/page.html) or an absolute URL (e.g., http://example.com/articles/page.html). Whether or not you are expecting an absolute or relative URL can be stored as a property of the Website object.
After you’ve located and normalized the URLs on the search page, you’ve successfully reduced the problem to the example in the previous section—extracting data from a page, given a website format.

Let’s look at an implementation of this algorithm in code. The Content class is much the same as in previous examples. You are adding the URL property to keep track of where the content was found:

class Content:
    """Common base class for all articles/pages"""

    def __init__(self, topic, url, title, body):
        self.topic = topic
        self.title = title
        self.body = body
        self.url = url

    def print(self):
        """
        Flexible printing function controls output
        """
        print('New article found for topic: {}'.format(self.topic))
        print('URL: {}'.format(self.url))
        print('TITLE: {}'.format(self.title))
        print('BODY:\n{}'.format(self.body))

The Website class has a few new properties added to it. The searchUrl defines where you should go to get search results if you append the topic you are looking for. The resultListing defines the “box” that holds information about each result, and the resultUrl defines the tag inside this box that will give you the exact URL for the result. The absoluteUrl property is a boolean that tells you whether these search results are absolute or relative URLs.

class Website:
    """Contains information about website structure"""

    def __init__(self, name, url, searchUrl, resultListing,
        resultUrl, absoluteUrl, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.searchUrl = searchUrl
        self.resultListing = resultListing
        self.resultUrl = resultUrl
        self.absoluteUrl=absoluteUrl
        self.titleTag = titleTag
        self.bodyTag = bodyTag

crawler.py has been expanded a bit and contains our Website data, a list of topics to search for, and a two loops that iterate through all the topics and all the websites. It also contains a search function that navigates to the search page for a particular website and topic, and extracts all the result URLs listed on that page.

import requests
from bs4 import BeautifulSoup

class Crawler:

    def getPage(self, url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text, 'html.parser')

    def safeGet(self, pageObj, selector):
        childObj = pageObj.select(selector)
        if childObj is not None and len(childObj) > 0:
            return childObj[0].get_text()
        return ''

    def search(self, topic, site):
        """
        Searches a given website for a given topic and records all pages found
        """
        bs = self.getPage(site.searchUrl + topic)
        searchResults = bs.select(site.resultListing)
        for result in searchResults:
            url = result.select(site.resultUrl)[0].attrs['href']
            # Check to see whether it's a relative or an absolute URL
            if(site.absoluteUrl):
                bs = self.getPage(url)
            else:
                bs = self.getPage(site.url + url)
            if bs is None:
                print('Something was wrong with that page or URL. Skipping!')
                return
            title = self.safeGet(bs, site.titleTag)
            body = self.safeGet(bs, site.bodyTag)
            if title != '' and body != '':
                content = Content(topic, title, body, url)
                content.print()


crawler = Crawler()

siteData = [
    ['O\'Reilly Media', 'http://oreilly.com', 'https://ssearch.oreilly.com/?q=',
    'article.product-result', 'p.title a', True, 'h1',
    'section#product-description'],
    ['Reuters', 'http://reuters.com', 'http://www.reuters.com/search/news?blob=',
    'div.search-result-content', 'h3.search-result-title a', False, 'h1',
    'div.StandardArticleBody_body_1gnLA'],
    ['Brookings', 'http://www.brookings.edu',
    'https://www.brookings.edu/search/?s=', 'div.list-content article',
    'h4.title a', True, 'h1', 'div.post-body']
]
sites = []
for row in siteData:
    sites.append(Website(row[0], row[1], row[2],
                         row[3], row[4], row[5], row[6], row[7]))

topics = ['python', 'data science']
for topic in topics:
    print('GETTING INFO ABOUT: ' + topic)
    for targetSite in sites:
        crawler.search(topic, targetSite)

This script loops through all the topics in the topics list and announces before it starts scraping for a topic:

GETTING INFO ABOUT python

Then it loops through all of the sites in the sites list and crawls each particular site for each particular topic. Each time that it successfully scrapes information about a page, it prints it to the console:

New article found for topic: python
URL: http://example.com/examplepage.html
TITLE: Page Title Here
BODY: Body content is here

Note that it loops through all topics and then loops through all websites in the inner loop. Why not do it the other way around, collecting all topics from one website, and then all topics from the next website? Looping through all topics first is a way to more evenly distribute the load placed on any one web server. This is especially important if you have a list of hundreds of topics and dozens of websites. You’re not making tens of thousands of requests to one website at once; you’re making 10 requests, waiting a few minutes, making another 10 requests, waiting a few minutes, and so forth.

Although the number of requests is ultimately the same either way, it’s generally better to distribute these requests over time as much as is reasonable. Paying attention to how your loops are structured is an easy way to do this.

Crawling Sites Through Links

The previous chapter covered some ways of identifying internal and external links on web pages and then using those links to crawl across the site. In this section, you’ll combine those same basic methods into a more flexible website crawler that can follow any link matching a specific URL pattern.

This type of crawler works well for projects when you want to gather all the data from a site—not just data from a specific search result or page listing. It also works well when the site’s pages may be disorganized or widely dispersed.

These types of crawlers don’t require a structured method of locating links, as in the previous section on crawling through search pages, so the attributes that describe the search page aren’t required in the Website object. However, because the crawler isn’t given specific instructions for the locations/positions of the links it’s looking for, you do need some rules to tell it what sorts of pages to select. You provide a targetPattern (regular expression for the target URLs) and leave the boolean absoluteUrl variable to accomplish this:

class Website:
    def __init__(self, name, url, targetPattern, absoluteUrl, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.targetPattern = targetPattern
        self.absoluteUrl = absoluteUrl
        self.titleTag = titleTag
        self.bodyTag = bodyTag

class Content:

    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body

    def print(self):
        print('URL: {}'.format(self.url))
        print('TITLE: {}'.format(self.title))
        print('BODY:\n{}'.format(self.body))

The Content class is the same one used in the first crawler example.

The Crawler class is written to start from the home page of each site, locate internal links, and parse the content from each internal link found:

import re

class Crawler:
    def __init__(self, site):
        self.site = site
        self.visited = []
        
    def getPage(self, url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None        
        return BeautifulSoup(req.text, 'html.parser')

    def safeGet(self, pageObj, selector):
        selectedElems = pageObj.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join([elem.get_text() for
                elem in selectedElems])
        return ''
    
    def parse(self, url):
        bs = self.getPage(url)
        if bs is not None:
            title = self.safeGet(bs, self.site.titleTag)
            body = self.safeGet(bs, self.site.bodyTag)
            if title != '' and body != '':
                content = Content(url, title, body)
                content.print()

    def crawl(self):
        """
        Get pages from website home page
        """
        bs = self.getPage(self.site.url)
        targetPages = bs.find_all('a',
            href=re.compile(self.site.targetPattern))
        for targetPage in targetPages:
            targetPage = targetPage.attrs['href']
            if targetPage not in self.visited:
                self.visited.append(targetPage)
                if not self.site.absoluteUrl:
                    targetPage = '{}{}'.format(self.site.url, targetPage)
                self.parse(targetPage)

reuters = Website('Reuters', 'https://www.reuters.com', '^(/article/)', False,
    'h1', 'div.StandardArticleBody_body_1gnLA')
crawler = Crawler(reuters)
crawler.crawl()

Another change here that was not used in previous examples: the Website object (in this case, the variable reuters) is a property of the Crawler object itself. This works well to store the visited pages (visited) in the crawler, but means that a new crawler must be instantiated for each website rather than reusing the same one to crawl a list of websites.

Whether you choose to make a crawler website-agnostic or choose to make the website an attribute of the crawler is a design decision that you must weigh in the context of your own specific needs. Either approach is generally fine.

Another thing to note is that this crawler will get the pages from the home page, but will not continue crawling after all those pages have been logged. You may want to write a crawler incorporating one of the patterns in Chapter 3 and have it look for more targets on each page it visits. You can even follow all the URLs on each page (not just ones matching the target pattern) to look for URLs containing the target pattern.

Crawling Multiple Page Types

Unlike crawling through a predetermined set of pages, crawling through all internal links on a website can present a challenge in that you never know exactly what you’re getting. Fortunately, there are a few basic ways to identify the page type:

By the URL: All blog posts on a website might contain a URL (http://example.com/blog/title-of-post, for example).
By the presence or lack of certain fields on a site: If a page has a date, but no author name, you might categorize it as a press release. If it has a title, main image, price, but no main content, it might be a product page.
By the presence of certain tags on the page to identify the page: You can take advantage of tags even if you’re not collecting the data within the tags. Your crawler might look for an element such as <div id="related-products"> to identify the page as a product page, even though the crawler is not interested in the content of the related products.

To keep track of multiple page types, you need to have multiple types of page objects in Python. This can be done in two ways:

If the pages are all similar (they all have basically the same types of content), you may want to add a pageType attribute to your existing web-page object:

class Website:
    def __init__(self, name, url, titleTag, bodyTag, pageType):
        self.name = name
        self.url = url
        self.titleTag = titleTag
        self.bodyTag = bodyTag
        self.pageType = pageType

If you’re storing these pages in an SQL-like database, this type of pattern indicates that all these pages would probably be stored in the same table, and that an extra pageType column would be added.

If the pages/content you’re scraping are different enough from each other (they contain different types of fields), this may warrant creating new objects for each page type. Of course, some things will be common to all web pages—they will all have a URL, and will likely also have a name or page title. This is an ideal situation in which to use subclasses:

class Webpage:
    def __init__(self, name, url, titleTag):
        self.name = name
        self.url = url
        self.titleTag = titleTag

This is not an object that will be used directly by your crawler, but an object that will be referenced by your page types:

class Product(Website):
    """Contains information for scraping a product page"""
    def __init__(self, name, url, titleTag, productNumberTag, priceTag):
        Website.__init__(self, name, url, TitleTag)
        self.productNumberTag = productNumberTag
        self.priceTag = priceTag

class Article(Website):
    """Contains information for scraping an article page"""
    def __init__(self, name, url, titleTag, bodyTag, dateTag):
        Website.__init__(self, name, url, titleTag)
        self.bodyTag = bodyTag
        self.dateTag = dateTag

This Product page extends the Website base class and adds the attributes productNumber and price that apply only to products, and the Article class adds the attributes body and date, which don’t apply to products.

You can use these two classes to scrape, for example, a store website that might contain blog posts or press releases in addition to products.

Thinking About Web Crawler Models

Collecting information from the internet can be like drinking from a fire hose. There’s a lot of stuff out there, and it’s not always clear what you need or how you need it. The first step of any large web scraping project (and even some of the small ones) should be to answer these questions.

When collecting similar data across multiple domains or from multiple sources, your goal should almost always be to try to normalize it. Dealing with data with identical and comparable fields is much easier than dealing with data that is completely dependent on the format of its original source.

In many cases, you should build scrapers under the assumption that more sources of data will be added to them in the future, and with the goal to minimize the programming overhead required to add these new sources. Even if a website doesn’t appear to fit your model at first glance, there may be more subtle ways that it does conform. Being able to see these underlying patterns can save you time, money, and a lot of headaches in the long run.

The connections between pieces of data should also not be ignored. Are you looking for information that has properties such as “type,” “size,” or “topic” that span across data sources? How do you store, retrieve, and conceptualize these attributes?

Software architecture is a broad and important topic that can take an entire career to master. Fortunately, software architecture for web scraping is a much more finite and manageable set of skills that can be relatively easily acquired. As you continue to scrape data, you will likely find the same basic patterns occurring over and over. Creating a well-structured web scraper doesn’t require a lot of arcane knowledge, but it does require taking a moment to step back and think about your project.

Get Web Scraping with Python, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Web Scraping with Python, 2nd Edition by Ryan Mitchell