Chapter 4. Web Crawling Models
Writing clean and scalable code is difficult enough when you have control over your data and your inputs. Writing code for web crawlers, which may need to scrape and store a variety of data from diverse sets of websites that the programmer has no control over, often presents unique organizational challenges.
You may be asked to collect news articles or blog posts from a variety of websites, each with different templates and layouts. One website’s h1
tag contains the title of the article, another’s h1
tag contains the title of the website itself, and the article title is in <span id="title">
.
You may need flexible control over which websites are scraped and how they’re scraped, and a way to quickly add new websites or modify existing ones, as fast as possible, without writing multiple lines of code.
You may be asked to scrape product prices from different websites, with the ultimate aim of comparing prices for the same product. Perhaps these prices are in different currencies, and perhaps you’ll also need to combine this with external data from some other nonweb source.
Although the applications of web crawlers are nearly endless, large scalable crawlers tend to fall into one of several patterns. By learning these patterns and recognizing the situations they apply to, you can vastly improve the maintainability and robustness of your web crawlers.
This chapter focuses primarily on web crawlers that collect a limited number of “types” of data (such as restaurant reviews, news articles, company profiles) from a variety of websites, and that store these data types as Python objects that read and write from a database.
Planning and Defining Objects
One common trap of web scraping is defining the data that you want to collect based entirely on what’s available in front of your eyes. For instance, if you want to collect product data, you may first look at a clothing store and decide that each product you scrape needs to have the following fields:
-
Product name
-
Price
-
Description
-
Sizes
-
Colors
-
Fabric type
-
Customer rating
Looking at another website, you find that it has SKUs (stock keeping units, used to track and order items) listed on the page. You definitely want to collect that data as well, even if it doesn’t appear on the first site! You add this field:
-
Item SKU
Although clothing may be a great start, you also want to make sure you can extend this crawler to other types of products. You start perusing product sections of other websites and decide you also need to collect this information:
-
Hardcover/Paperback
-
Matte/Glossy print
-
Number of customer reviews
-
Link to manufacturer
Clearly, this is an unsustainable approach. Simply adding attributes to your product type every time you see a new piece of information on a website will lead to far too many fields to keep track of. Not only that, but every time you scrape a new website, you’ll be forced to perform a detailed analysis of the fields the website has and the fields you’ve accumulated so far, and potentially add new fields (modifying your Python object type and your database structure). This will result in a messy and difficult-to-read dataset that may lead to problems using it.
One of the best things you can do when deciding which data to collect is often to ignore the websites altogether. You don’t start a project that’s designed to be large and scalable by looking at a single website and saying, “What exists?” but by saying, “What do I need?” and then finding ways to seek the information that you need from there.
Perhaps what you really want to do is compare product prices among multiple stores and track those product prices over time. In this case, you need enough information to uniquely identify the product, and that’s it:
-
Product title
-
Manufacturer
-
Product ID number (if available/relevant)
It’s important to note that none of this information is specific to a particular store. For instance, product reviews, ratings, price, and even description are specific to the instance of that product at a particular store. That can be stored separately.
Other information (colors the product comes in, what it’s made of) is specific to the product, but may be sparse—it’s not applicable to every product. It’s important to take a step back and perform a checklist for each item you consider and ask yourself the following questions:
-
Will this information help with the project goals? Will it be a roadblock if I don’t have it, or is it just “nice to have” but won’t ultimately impact anything?
-
If it might help in the future, but I’m unsure, how difficult will it be to go back and collect the data at a later time?
-
Is this data redundant to data I’ve already collected?
-
Does it make logical sense to store the data within this particular object? (As mentioned before, storing a description in a product doesn’t make sense if that description changes from site to site for the same product.)
If you do decide that you need to collect the data, it’s important to ask a few more questions to then decide how to store and handle it in code:
-
Is this data sparse or dense? Will it be relevant and populated in every listing, or just a handful out of the set?
-
How large is the data?
-
Especially in the case of large data, will I need to regularly retrieve it every time I run my analysis, or only on occasion?
-
How variable is this type of data? Will I regularly need to add new attributes, modify types (such as fabric patterns, which may be added frequently), or is it set in stone (shoe sizes)?
Let’s say you plan to do some meta analysis around product attributes and prices: for example, the number of pages a book has, or the type of fabric a piece of clothing is made of, and potentially other attributes in the future, correlated to price. You run through the questions and realize that this data is sparse (relatively few products have any one of the attributes), and that you may decide to add or remove attributes frequently. In this case, it may make sense to create a product type that looks like this:
-
Product title
-
Manufacturer
-
Product ID number (if available/relevant)
-
Attributes (optional list or dictionary)
And an attribute type that looks like this:
-
Attribute name
-
Attribute value
This allows you to flexibly add new product attributes over time, without requiring you to redesign your data schema or rewrite code. When deciding how to store these attributes in the database, you can write JSON to the attribute
field, or store each attribute in a separate table with a product ID. See Chapter 6 for more information about implementing these types of database models.
You can apply the preceding questions to the other information you’ll need to store as well. For keeping track of the prices found for each product, you’ll likely need the following:
-
Product ID
-
Store ID
-
Price
-
Date/Timestamp price was found at
But what if you have a situation in which the product’s attributes actually modify the price of the product? For instance, stores might charge more for a large shirt than a small one, because the large shirt requires more labor or materials. In this case, you may consider splitting the single shirt product into separate product listings for each size (so that each shirt product can be priced independently) or creating a new item type to store information about instances of a product, containing these fields:
-
Product ID
-
Instance type (the size of the shirt, in this case)
And each price would then look like this:
-
Product Instance ID
-
Store ID
-
Price
-
Date/Timestamp price was found at
While the subject of “products and prices” may seem overly specific, the basic questions you need to ask yourself, and the logic used when designing your Python objects, apply in almost every situation.
If you’re scraping news articles, you may want basic information such as the following:
-
Title
-
Author
-
Date
-
Content
But say some articles contain a “revision date,” or “related articles, or a “number of social media shares.” Do you need these? Are they relevant to your project? How do you efficiently and flexibly store the number of social media shares when not all news sites use all forms of social media, and social media sites may grow or wane in popularity over time?
It can be tempting, when faced with a new project, to dive in and start writing Python to scrape websites immediately. The data model, left as an afterthought, often becomes strongly influenced by the availability and format of the data on the first website you scrape.
However, the data model is the underlying foundation of all the code that uses it. A poor decision in your model can easily lead to problems writing and maintaining code down the line, or difficulty in extracting and efficiently using the resulting data. Especially when dealing with a variety of websites—both known and unknown—it becomes vital to give serious thought and planning to what, exactly, you need to collect and how you need to store it.
Dealing with Different Website Layouts
One of the most impressive feats of a search engine such as Google is that it manages to extract relevant and useful data from a variety of websites, having no upfront knowledge about the website structure itself. Although we, as humans, are able to immediately identify the title and main content of a page (barring instances of extremely poor web design), it is far more difficult to get a bot to do the same thing.
Fortunately, in most cases of web crawling, you’re not looking to collect data from sites you’ve never seen before, but from a few, or a few dozen, websites that are pre-selected by a human. This means that you don’t need to use complicated algorithms or machine learning to detect which text on the page “looks most like a title” or which is probably the “main content.” You can determine what these elements are manually.
The most obvious approach is to write a separate web crawler or page parser for each website. Each might take in a URL, string, or BeautifulSoup
object, and return a Python object for the thing that was scraped.
The following is an example of a Content
class (representing a piece of content on a website, such as a news article) and two scraper functions that take in a BeautifulSoup
object and return an instance of Content
:
import
requests
class
Content
:
def
__init__
(
self
,
url
,
title
,
body
):
self
.
url
=
url
self
.
title
=
title
self
.
body
=
body
def
getPage
(
url
):
req
=
requests
.
get
(
url
)
return
BeautifulSoup
(
req
.
text
,
'html.parser'
)
def
scrapeNYTimes
(
url
):
bs
=
getPage
(
url
)
title
=
bs
.
find
(
'h1'
)
.
text
lines
=
bs
.
select
(
'div.StoryBodyCompanionColumn div p'
)
body
=
'
\n
'
.
join
([
line
.
text
for
line
in
lines
])
return
Content
(
url
,
title
,
body
)
def
scrapeBrookings
(
url
):
bs
=
getPage
(
url
)
title
=
bs
.
find
(
'h1'
)
.
text
body
=
bs
.
find
(
'div'
,
{
'class'
,
'post-body'
})
.
text
return
Content
(
url
,
title
,
body
)
url
=
'https://www.brookings.edu/blog/future-development/2018/01/26/'
'delivering-inclusive-urban-access-3-uncomfortable-truths/'
content
=
scrapeBrookings
(
url
)
(
'Title: {}'
.
format
(
content
.
title
))
(
'URL: {}
\n
'
.
format
(
content
.
url
))
(
content
.
body
)
url
=
'https://www.nytimes.com/2018/01/25/opinion/sunday/'
'silicon-valley-immortality.html'
content
=
scrapeNYTimes
(
url
)
(
'Title: {}'
.
format
(
content
.
title
))
(
'URL: {}
\n
'
.
format
(
content
.
url
))
(
content
.
body
)
As you start to add scraper functions for additional news sites, you might notice a pattern forming. Every site’s parsing function does essentially the same thing:
-
Selects the title element and extracts the text for the title
-
Selects the main content of the article
-
Selects other content items as needed
-
Returns a
Content
object instantiated with the strings found previously
The only real site-dependent variables here are the CSS selectors used to obtain each piece of information. BeautifulSoup’s find
and find_all
functions take in two arguments—a tag string and a dictionary of key/value attributes—so you can pass these arguments in as parameters that define the structure of the site itself and the location of the target data.
To make things even more convenient, rather than dealing with all of these tag arguments and key/value pairs, you can use the BeautifulSoup select
function with a single string CSS selector for each piece of information you want to collect and put all of these selectors in a dictionary object:
class
Content
:
"""
Common base class for all articles/pages
"""
def
__init__
(
self
,
url
,
title
,
body
):
self
.
url
=
url
self
.
title
=
title
self
.
body
=
body
def
(
self
):
"""
Flexible printing function controls output
"""
(
'URL: {}'
.
format
(
self
.
url
))
(
'TITLE: {}'
.
format
(
self
.
title
))
(
'BODY:
\n
{}'
.
format
(
self
.
body
))
class
Website
:
"""
Contains information about website structure
"""
def
__init__
(
self
,
name
,
url
,
titleTag
,
bodyTag
):
self
.
name
=
name
self
.
url
=
url
self
.
titleTag
=
titleTag
self
.
bodyTag
=
bodyTag
Note that the Website
class does not store information collected from the individual pages themselves, but stores instructions about how to collect that data. It doesn’t store the title “My Page Title.” It simply stores the string tag h1
that indicates where the titles can be found. This is why the class is called Website
(the information here pertains to the entire website) and not Content
(which contains information from just a single page).
Using these Content
and Website
classes you can then write a Crawler
to scrape the title and content of any URL that is provided for a given web page from a given website:
import
requests
from
bs4
import
BeautifulSoup
class
Crawler
:
def
getPage
(
self
,
url
):
try
:
req
=
requests
.
get
(
url
)
except
requests
.
exceptions
.
RequestException
:
return
None
return
BeautifulSoup
(
req
.
text
,
'html.parser'
)
def
safeGet
(
self
,
pageObj
,
selector
):
"""
Utility function used to get a content string from a
Beautiful Soup object and a selector. Returns an empty
string if no object is found for the given selector
"""
selectedElems
=
pageObj
.
select
(
selector
)
if
selectedElems
is
not
None
and
len
(
selectedElems
)
>
0
:
return
'
\n
'
.
join
(
[
elem
.
get_text
()
for
elem
in
selectedElems
])
return
''
def
parse
(
self
,
site
,
url
):
"""
Extract content from a given page URL
"""
bs
=
self
.
getPage
(
url
)
if
bs
is
not
None
:
title
=
self
.
safeGet
(
bs
,
site
.
titleTag
)
body
=
self
.
safeGet
(
bs
,
site
.
bodyTag
)
if
title
!=
''
and
body
!=
''
:
content
=
Content
(
url
,
title
,
body
)
content
.
()
And here’s the code that defines the website objects and kicks off the process:
crawler
=
Crawler
()
siteData
=
[
[
'O
\'
Reilly Media'
,
'http://oreilly.com'
,
'h1'
,
'section#product-description'
],
[
'Reuters'
,
'http://reuters.com'
,
'h1'
,
'div.StandardArticleBody_body_1gnLA'
],
[
'Brookings'
,
'http://www.brookings.edu'
,
'h1'
,
'div.post-body'
],
[
'New York Times'
,
'http://nytimes.com'
,
'h1'
,
'div.StoryBodyCompanionColumn div p'
]
]
websites
=
[]
for
row
in
siteData
:
websites
.
append
(
Website
(
row
[
0
],
row
[
1
],
row
[
2
],
row
[
3
]))
crawler
.
parse
(
websites
[
0
],
'http://shop.oreilly.com/product/'
\
'0636920028154.do'
)
crawler
.
parse
(
websites
[
1
],
'http://www.reuters.com/article/'
\
'us-usa-epa-pruitt-idUSKBN19W2D0'
)
crawler
.
parse
(
websites
[
2
],
'https://www.brookings.edu/blog/'
\
'techtank/2016/03/01/idea-to-retire-old-methods-of-policy-education/'
)
crawler
.
parse
(
websites
[
3
],
'https://www.nytimes.com/2018/01/'
\
'28/business/energy-environment/oil-boom.html'
)
While this new method might not seem remarkably simpler than writing a new Python function for each new website at first glance, imagine what happens when you go from a system with 4 website sources to a system with 20 or 200 sources.
Each list of strings is relatively easy to write. It doesn’t take up much space. It can be loaded from a database or a CSV file. It can be imported from a remote source or handed off to an nonprogrammer with some frontend experience to fill out and add new websites to, and they never have to look at a line of code.
Of course, the downside is that you are giving up a certain amount of flexibility. In the first example, each website gets its own free-form function to select and parse HTML however necessary, in order to get the end result. In the second example, each website needs to have a certain structure in which fields are guaranteed to exist, data must be clean coming out of the field, and each target field must have a unique and reliable CSS selector.
However, I believe that the power and relative flexibility of this approach more than makes up for its real or perceived shortcomings. The next section covers specific applications and expansions of this basic template so that you can, for example, deal with missing fields, collect different types of data, crawl only through specific parts of a website, and store more-complex information about pages.
Structuring Crawlers
Creating flexible and modifiable website layout types doesn’t do much good if you still have to locate each link you want to scrape by hand. The previous chapter showed various methods of crawling through websites and finding new pages in an automated way.
This section shows how to incorporate these methods into a well-structured and expandable website crawler that can gather links and discover data in an automated way. I present just three basic web crawler structures here, although I believe that they apply to the majority of situations that you will likely need when crawling sites in the wild, perhaps with a few modifications here and there. If you encounter an unusual situation with your own crawling problem, I also hope that you will use these structures as inspiration in order to create an elegant and robust crawler design.
Crawling Sites Through Search
One of the easiest ways to crawl a website is via the same method that humans do: using the search bar. Although the process of searching a website for a keyword or topic and collecting a list of search results may seem like a task with a lot of variability from site to site, several key points make this surprisingly trivial:
-
Most sites retrieve a list of search results for a particular topic by passing that topic as a string through a parameter in the URL. For example:
http://example.com?search=myTopic
. The first part of this URL can be saved as a property of theWebsite
object, and the topic can simply be appended to it. -
After searching, most sites present the resulting pages as an easily identifiable list of links, usually with a convenient surrounding tag such as
<span class="result">
, the exact format of which can also be stored as a property of theWebsite
object. -
Each result link is either a relative URL (e.g., /articles/page.html) or an absolute URL (e.g., http://example.com/articles/page.html). Whether or not you are expecting an absolute or relative URL can be stored as a property of the
Website
object. -
After you’ve located and normalized the URLs on the search page, you’ve successfully reduced the problem to the example in the previous section—extracting data from a page, given a website format.
Let’s look at an implementation of this algorithm in code. The Content
class is much the same as in previous examples. You are adding the URL property to keep track of where the content was found:
class
Content
:
"""Common base class for all articles/pages"""
def
__init__
(
self
,
topic
,
url
,
title
,
body
):
self
.
topic
=
topic
self
.
title
=
title
self
.
body
=
body
self
.
url
=
url
def
(
self
):
"""
Flexible printing function controls output
"""
(
'New article found for topic: {}'
.
format
(
self
.
topic
))
(
'URL: {}'
.
format
(
self
.
url
))
(
'TITLE: {}'
.
format
(
self
.
title
))
(
'BODY:
\n
{}'
.
format
(
self
.
body
))
The Website
class has a few new properties added to it. The searchUrl
defines where you should go to get search results if you append the topic you are looking for. The resultListing
defines the “box” that holds information about each result, and the resultUrl
defines the tag inside this box that will give you the exact URL for the result. The absoluteUrl
property is a boolean that tells you whether these search results are absolute or relative URLs.
class
Website
:
"""Contains information about website structure"""
def
__init__
(
self
,
name
,
url
,
searchUrl
,
resultListing
,
resultUrl
,
absoluteUrl
,
titleTag
,
bodyTag
):
self
.
name
=
name
self
.
url
=
url
self
.
searchUrl
=
searchUrl
self
.
resultListing
=
resultListing
self
.
resultUrl
=
resultUrl
self
.
absoluteUrl
=
absoluteUrl
self
.
titleTag
=
titleTag
self
.
bodyTag
=
bodyTag
crawler.py has been expanded a bit and contains our Website
data, a list of topics to search for, and a two loops that iterate through all the topics and all the websites. It also contains a search
function that navigates to the search page for a particular website and topic, and extracts all the result URLs listed on that page.
import
requests
from
bs4
import
BeautifulSoup
class
Crawler
:
def
getPage
(
self
,
url
):
try
:
req
=
requests
.
get
(
url
)
except
requests
.
exceptions
.
RequestException
:
return
None
return
BeautifulSoup
(
req
.
text
,
'html.parser'
)
def
safeGet
(
self
,
pageObj
,
selector
):
childObj
=
pageObj
.
select
(
selector
)
if
childObj
is
not
None
and
len
(
childObj
)
>
0
:
return
childObj
[
0
]
.
get_text
()
return
''
def
search
(
self
,
topic
,
site
):
"""
Searches a given website for a given topic and records all pages found
"""
bs
=
self
.
getPage
(
site
.
searchUrl
+
topic
)
searchResults
=
bs
.
select
(
site
.
resultListing
)
for
result
in
searchResults
:
url
=
result
.
select
(
site
.
resultUrl
)[
0
]
.
attrs
[
'href'
]
# Check to see whether it's a relative or an absolute URL
if
(
site
.
absoluteUrl
):
bs
=
self
.
getPage
(
url
)
else
:
bs
=
self
.
getPage
(
site
.
url
+
url
)
if
bs
is
None
:
(
'Something was wrong with that page or URL. Skipping!'
)
return
title
=
self
.
safeGet
(
bs
,
site
.
titleTag
)
body
=
self
.
safeGet
(
bs
,
site
.
bodyTag
)
if
title
!=
''
and
body
!=
''
:
content
=
Content
(
topic
,
title
,
body
,
url
)
content
.
()
crawler
=
Crawler
()
siteData
=
[
[
'O
\'
Reilly Media'
,
'http://oreilly.com'
,
'https://ssearch.oreilly.com/?q='
,
'article.product-result'
,
'p.title a'
,
True
,
'h1'
,
'section#product-description'
],
[
'Reuters'
,
'http://reuters.com'
,
'http://www.reuters.com/search/news?blob='
,
'div.search-result-content'
,
'h3.search-result-title a'
,
False
,
'h1'
,
'div.StandardArticleBody_body_1gnLA'
],
[
'Brookings'
,
'http://www.brookings.edu'
,
'https://www.brookings.edu/search/?s='
,
'div.list-content article'
,
'h4.title a'
,
True
,
'h1'
,
'div.post-body'
]
]
sites
=
[]
for
row
in
siteData
:
sites
.
append
(
Website
(
row
[
0
],
row
[
1
],
row
[
2
],
row
[
3
],
row
[
4
],
row
[
5
],
row
[
6
],
row
[
7
]))
topics
=
[
'python'
,
'data science'
]
for
topic
in
topics
:
(
'GETTING INFO ABOUT: '
+
topic
)
for
targetSite
in
sites
:
crawler
.
search
(
topic
,
targetSite
)
This script loops through all the topics in the topics
list and announces before it starts scraping for a topic:
GETTING INFO ABOUT python
Then it loops through all of the sites in the sites
list and crawls each particular site for each particular topic. Each time that it successfully scrapes information about a page, it prints it to the console:
New article found for topic: python URL: http://example.com/examplepage.html TITLE: Page Title Here BODY: Body content is here
Note that it loops through all topics and then loops through all websites in the inner loop. Why not do it the other way around, collecting all topics from one website, and then all topics from the next website? Looping through all topics first is a way to more evenly distribute the load placed on any one web server. This is especially important if you have a list of hundreds of topics and dozens of websites. You’re not making tens of thousands of requests to one website at once; you’re making 10 requests, waiting a few minutes, making another 10 requests, waiting a few minutes, and so forth.
Although the number of requests is ultimately the same either way, it’s generally better to distribute these requests over time as much as is reasonable. Paying attention to how your loops are structured is an easy way to do this.
Crawling Sites Through Links
The previous chapter covered some ways of identifying internal and external links on web pages and then using those links to crawl across the site. In this section, you’ll combine those same basic methods into a more flexible website crawler that can follow any link matching a specific URL pattern.
This type of crawler works well for projects when you want to gather all the data from a site—not just data from a specific search result or page listing. It also works well when the site’s pages may be disorganized or widely dispersed.
These types of crawlers don’t require a structured method of locating links, as in the previous section on crawling through search pages, so the attributes that describe the search page aren’t required in the Website
object. However, because the crawler isn’t given specific instructions for the locations/positions of the links it’s looking for, you do need some rules to tell it what sorts of pages to select. You provide a targetPattern
(regular expression for the target URLs) and leave the boolean absoluteUrl
variable to accomplish this:
class
Website
:
def
__init__
(
self
,
name
,
url
,
targetPattern
,
absoluteUrl
,
titleTag
,
bodyTag
):
self
.
name
=
name
self
.
url
=
url
self
.
targetPattern
=
targetPattern
self
.
absoluteUrl
=
absoluteUrl
self
.
titleTag
=
titleTag
self
.
bodyTag
=
bodyTag
class
Content
:
def
__init__
(
self
,
url
,
title
,
body
):
self
.
url
=
url
self
.
title
=
title
self
.
body
=
body
def
(
self
):
(
'URL: {}'
.
format
(
self
.
url
))
(
'TITLE: {}'
.
format
(
self
.
title
))
(
'BODY:
\n
{}'
.
format
(
self
.
body
))
The Content
class is the same one used in the first crawler example.
The Crawler
class is written to start from the home page of each site, locate internal links, and parse the content from each internal link found:
import
re
class
Crawler
:
def
__init__
(
self
,
site
):
self
.
site
=
site
self
.
visited
=
[]
def
getPage
(
self
,
url
):
try
:
req
=
requests
.
get
(
url
)
except
requests
.
exceptions
.
RequestException
:
return
None
return
BeautifulSoup
(
req
.
text
,
'html.parser'
)
def
safeGet
(
self
,
pageObj
,
selector
):
selectedElems
=
pageObj
.
select
(
selector
)
if
selectedElems
is
not
None
and
len
(
selectedElems
)
>
0
:
return
'
\n
'
.
join
([
elem
.
get_text
()
for
elem
in
selectedElems
])
return
''
def
parse
(
self
,
url
):
bs
=
self
.
getPage
(
url
)
if
bs
is
not
None
:
title
=
self
.
safeGet
(
bs
,
self
.
site
.
titleTag
)
body
=
self
.
safeGet
(
bs
,
self
.
site
.
bodyTag
)
if
title
!=
''
and
body
!=
''
:
content
=
Content
(
url
,
title
,
body
)
content
.
()
def
crawl
(
self
):
"""
Get pages from website home page
"""
bs
=
self
.
getPage
(
self
.
site
.
url
)
targetPages
=
bs
.
find_all
(
'a'
,
href
=
re
.
compile
(
self
.
site
.
targetPattern
))
for
targetPage
in
targetPages
:
targetPage
=
targetPage
.
attrs
[
'href'
]
if
targetPage
not
in
self
.
visited
:
self
.
visited
.
append
(
targetPage
)
if
not
self
.
site
.
absoluteUrl
:
targetPage
=
'{}{}'
.
format
(
self
.
site
.
url
,
targetPage
)
self
.
parse
(
targetPage
)
reuters
=
Website
(
'Reuters'
,
'https://www.reuters.com'
,
'^(/article/)'
,
False
,
'h1'
,
'div.StandardArticleBody_body_1gnLA'
)
crawler
=
Crawler
(
reuters
)
crawler
.
crawl
()
Another change here that was not used in previous examples: the Website
object (in this case, the variable reuters
) is a property of the Crawler
object itself. This works well to store the visited pages (visited
) in the crawler, but means that a new crawler must be instantiated for each website rather than reusing the same one to crawl a list of websites.
Whether you choose to make a crawler website-agnostic or choose to make the website an attribute of the crawler is a design decision that you must weigh in the context of your own specific needs. Either approach is generally fine.
Another thing to note is that this crawler will get the pages from the home page, but will not continue crawling after all those pages have been logged. You may want to write a crawler incorporating one of the patterns in Chapter 3 and have it look for more targets on each page it visits. You can even follow all the URLs on each page (not just ones matching the target pattern) to look for URLs containing the target pattern.
Crawling Multiple Page Types
Unlike crawling through a predetermined set of pages, crawling through all internal links on a website can present a challenge in that you never know exactly what you’re getting. Fortunately, there are a few basic ways to identify the page type:
- By the URL
- All blog posts on a website might contain a URL (http://example.com/blog/title-of-post, for example).
- By the presence or lack of certain fields on a site
- If a page has a date, but no author name, you might categorize it as a press release. If it has a title, main image, price, but no main content, it might be a product page.
- By the presence of certain tags on the page to identify the page
- You can take advantage of tags even if you’re not collecting the data within the tags. Your crawler might look for an element such as
<div id="related-products">
to identify the page as a product page, even though the crawler is not interested in the content of the related products.
To keep track of multiple page types, you need to have multiple types of page objects in Python. This can be done in two ways:
If the pages are all similar (they all have basically the same types of content), you may want to add a pageType
attribute to your existing web-page object:
class
Website
:
def
__init__
(
self
,
name
,
url
,
titleTag
,
bodyTag
,
pageType
):
self
.
name
=
name
self
.
url
=
url
self
.
titleTag
=
titleTag
self
.
bodyTag
=
bodyTag
self
.
pageType
=
pageType
If you’re storing these pages in an SQL-like database, this type of pattern indicates that all these pages would probably be stored in the same table, and that an extra pageType
column would be added.
If the pages/content you’re scraping are different enough from each other (they contain different types of fields), this may warrant creating new objects for each page type. Of course, some things will be common to all web pages—they will all have a URL, and will likely also have a name or page title. This is an ideal situation in which to use subclasses:
class
Webpage
:
def
__init__
(
self
,
name
,
url
,
titleTag
):
self
.
name
=
name
self
.
url
=
url
self
.
titleTag
=
titleTag
This is not an object that will be used directly by your crawler, but an object that will be referenced by your page types:
class
Product
(
Website
):
"""Contains information for scraping a product page"""
def
__init__
(
self
,
name
,
url
,
titleTag
,
productNumberTag
,
priceTag
):
Website
.
__init__
(
self
,
name
,
url
,
TitleTag
)
self
.
productNumberTag
=
productNumberTag
self
.
priceTag
=
priceTag
class
Article
(
Website
):
"""Contains information for scraping an article page"""
def
__init__
(
self
,
name
,
url
,
titleTag
,
bodyTag
,
dateTag
):
Website
.
__init__
(
self
,
name
,
url
,
titleTag
)
self
.
bodyTag
=
bodyTag
self
.
dateTag
=
dateTag
This Product page extends the Website
base class and adds the attributes productNumber
and price
that apply only to products, and the Article
class adds the attributes body
and date
, which don’t apply to products.
You can use these two classes to scrape, for example, a store website that might contain blog posts or press releases in addition to products.
Thinking About Web Crawler Models
Collecting information from the internet can be like drinking from a fire hose. There’s a lot of stuff out there, and it’s not always clear what you need or how you need it. The first step of any large web scraping project (and even some of the small ones) should be to answer these questions.
When collecting similar data across multiple domains or from multiple sources, your goal should almost always be to try to normalize it. Dealing with data with identical and comparable fields is much easier than dealing with data that is completely dependent on the format of its original source.
In many cases, you should build scrapers under the assumption that more sources of data will be added to them in the future, and with the goal to minimize the programming overhead required to add these new sources. Even if a website doesn’t appear to fit your model at first glance, there may be more subtle ways that it does conform. Being able to see these underlying patterns can save you time, money, and a lot of headaches in the long run.
The connections between pieces of data should also not be ignored. Are you looking for information that has properties such as “type,” “size,” or “topic” that span across data sources? How do you store, retrieve, and conceptualize these attributes?
Software architecture is a broad and important topic that can take an entire career to master. Fortunately, software architecture for web scraping is a much more finite and manageable set of skills that can be relatively easily acquired. As you continue to scrape data, you will likely find the same basic patterns occurring over and over. Creating a well-structured web scraper doesn’t require a lot of arcane knowledge, but it does require taking a moment to step back and think about your project.
Get Web Scraping with Python, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.