Skip to Content
Data Wrangling with Python
book

Data Wrangling with Python

by Jacqueline Kazil, Katharine Jarmul
February 2016
Beginner to intermediate
508 pages
12h 27m
English
O'Reilly Media, Inc.
Content preview from Data Wrangling with Python

Chapter 11. Web Scraping: Acquiring and Storing Data from the Web

Web scraping is an essential part of data mining in today’s world, as you can find nearly everything on the Web. With web scraping, you can use Python libraries to explore web pages, search for information, and collect it for your reporting. Web scraping lets you crawl sites and find information not easily accessible without robotic assistance.

This technique gives you access to data not contained in an API or a file. Imagine a script to log into your email account, download files, run analysis, and send an aggregated report. Imagine testing your site to make sure it’s fully functional without ever touching a browser. Imagine grabbing data from a series of tables on a regularly updated website. These examples show how web scraping can assist with your data wrangling needs.

Depending on what you need to scrape—local or public websites, XML documents—you can use many of the same tools to accomplish these tasks. Most websites contain data in HTML code on the site. HTML is a markup language, and uses brackets (like our XML example in Chapter 3) to hold data. In this chapter, we will use libraries that understand how to parse and read markup languages like HTML and XML.

There are many sites that use internal APIs and embedded JavaScript to control the content on their pages. Because of these new ways to build the Web, not all of the information can be found using page-reading scrapers. We’ll also learn how to use some ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Wrangling with Python

Data Wrangling with Python

Dr. Tirthajyoti Sarkar, Shubhadeep Roychowdhury
Python for Data Analytics

Python for Data Analytics

O'Reilly Media, Inc.

Publisher Resources

ISBN: 9781491948804Errata PageSupplemental Content