To those who have not developed the skill, computer programming can seem like a kind of magic. If programming is magic, then web scraping is wizardry; that is, the application of magic for particularly impressive and useful—yet surprisingly effortless—feats.
In fact, in my years as a software engineer, I’ve found that very few programming practices capture the excitement of both programmers and laymen alike quite like web scraping. The ability to write a simple bot that collects data and streams it down a terminal or stores it in a database, while not difficult, never fails to provide a certain thrill and sense of possibility, no matter how many times you might have done it before.
This book seeks to put an end to many of these common questions and misconceptions about web scraping, while providing a comprehensive guide to most common web-scraping tasks.
Beginning in Chapter 1, I’ll provide code samples periodically to demonstrate concepts. These code samples are in the public domain, and can be used with or without attribution (although acknowledgment is always appreciated). All code samples also will be available on the website for viewing and downloading.
The automated gathering of data from the Internet is nearly as old as the Internet itself. Although web scraping is not a new term, in years past the practice has been more commonly known as screen scraping, data mining, web harvesting, or similar variations. General consensus today seems to favor web scraping, so that is the term I’ll use throughout the book, although I will occasionally refer to the web-scraping programs themselves as bots.
In theory, web scraping is the practice of gathering data through any means other than a program interacting with an API (or, obviously, through a human using a web browser). This is most commonly accomplished by writing an automated program that queries a web server, requests data (usually in the form of the HTML and other files that comprise web pages), and then parses that data to extract needed information.
In practice, web scraping encompasses a wide variety of programming techniques and technologies, such as data analysis and information security. This book will cover the basics of web scraping and crawling (Part I), and delve into some of the advanced topics in Part II.
In addition, web scrapers can go places that traditional search engines cannot. A Google search for “cheapest flights to Boston” will result in a slew of advertisements and popular flight search sites. Google only knows what these websites say on their content pages, not the exact results of various queries entered into a flight search application. However, a well-developed web scraper can chart the cost of a flight to Boston over time, across a variety of websites, and tell you the best time to buy your ticket.
You might be asking: “Isn’t data gathering what APIs are for?” (If you’re unfamiliar with APIs, see Chapter 4.) Well, APIs can be fantastic, if you find one that suits your purposes. They can provide a convenient stream of well-formatted data from one server to another. You can find an API for many different types of data you might want to use such as Twitter posts or Wikipedia pages. In general, it is preferable to use an API (if one exists), rather than build a bot to get the same data. However, there are several reasons why an API might not exist:
Even when an API does exist, request volume and rate limits, the types of data, or the format of data that it provides might be insufficient for your purposes.
This is where web scraping steps in. With few exceptions, if you can view it in your browser, you can access it via a Python script. If you can access it in a script, you can store it in a database. And if you can store it in a database, you can do virtually anything with that data.
There are obviously many extremely practical applications of having access to nearly unlimited data: market forecasting, machine language translation, and even medical diagnostics have benefited tremendously from the ability to retrieve and analyze data from news sites, translated texts, and health forums, respectively.
Even in the art world, web scraping has opened up new frontiers for creation. The 2006 project “We Feel Fine” by Jonathan Harris and Sep Kamvar, scraped a variety of English-language blog sites for phrases starting with “I feel” or “I am feeling.” This led to a popular data visualization, describing how the world was feeling day by day and minute by minute.
This book is designed to serve not only as an introduction to web scraping, but as a comprehensive guide to scraping almost every type of data from the modern Web. Although it uses the Python programming language, and covers many Python basics, it should not be used as an introduction to the language.
If you are not an expert programmer and don’t know any Python at all, this book might be a bit of a challenge. If, however, you are an experienced programmer, you should find the material easy to pick up. Appendix A covers installing and working with Python 3.x, which is used throughout this book. If you have only used Python 2.x, or do not have 3.x installed, you might want to review Appendix A.
If you’re looking for a more comprehensive Python resource, the book Introducing Python by Bill Lubanovic is a very good, if lengthy, guide. For those with shorter attention spans, the video series Introduction to Python by Jessika McKeller is an excellent resource.
Appendix C includes case studies, as well as a breakdown of key issues that might affect how you can legally run scrapers in the United States and use the data that they produce.
Technical books are often able to focus on a single language or technology, but web scraping is a relatively disparate subject, with practices that require the use of databases, web servers, HTTP, HTML, Internet security, image processing, data science, and other tools. This book attempts to cover all of these to an extent for the purpose of gathering data from remote sources across the Internet.
Part I covers the subject of web scraping and web crawling in depth, with a strong focus on a small handful of libraries used throughout the book. Part I can easily be used as a comprehensive reference for these libraries and techniques (with certain exceptions, where additional references will be provided).
Part II covers additional subjects that the reader might find useful when writing web scrapers. These subjects are, unfortunately, too broad to be neatly wrapped up in a single chapter. Because of this, frequent references will be made to other resources for additional information.
The structure of this book is arranged to be easy to jump around among chapters to find only the web-scraping technique or information that you are looking for. When a concept or piece of code builds on another mentioned in a previous chapter, I will explicitly reference the section that it was addressed in.
The following typographical conventions are used in this book:
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values determined by context.
This element signifies a tip or suggestion.
This element signifies a general note.
This element indicates a warning or caution.
Supplemental material (code examples, exercises, etc.) is available for download at http://pythonscraping.com/code/.
This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Web Scraping with Python by Ryan Mitchell (O’Reilly). Copyright 2015 Ryan Mitchell, 978-1-491-91029-0.”
If you feel your use of code examples falls outside fair use or the permission given here, feel free to contact us at email@example.com.
Technology professionals, software developers, web designers, and business and creative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training.
Safari Books Online offers a range of product mixes and pricing programs for organizations, government agencies, and individuals. Subscribers have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and dozens more. For more information about Safari Books Online, please visit us online.
Please address comments and questions concerning this book to the publisher:
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://oreil.ly/1ePG2Uj.
To comment or ask technical questions about this book, send email to firstname.lastname@example.org.
For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Just like some of the best products arise out of a sea of user feedback, this book could have never existed in any useful form without the help of many collaborators, cheerleaders, and editors. Thank you to the O’Reilly staff and their amazing support for this somewhat unconventional subject, to my friends and family who have offered advice and put up with impromptu readings, and to my coworkers at LinkeDrive who I now likely owe many hours of work to.
Thank you, in particular, to Allyson MacDonald, Brian Anderson, Miguel Grinberg, and Eric VanWyk for their feedback, guidance, and occasional tough love. Quite a few sections and code samples were written as a direct result of their inspirational suggestions.
Thank you to Yale Specht for his limitless patience throughout the past nine months, providing the initial encouragement to pursue this project, and stylistic feedback during the writing process. Without him, this book would have been written in half the time but would not be nearly as useful.
Finally, thanks to Jim Waldo, who really started this whole thing many years ago when he mailed a Linux box and The Art and Science of C to a young and impressionable teenager.