Blueprints for Text Analytics Using Python
by Jens Albrecht, Sidharth Ramachandran, Christian Winkler
Chapter 3. Scraping Websites and Extracting Data
Often, it will happen that you visit a website and find the content interesting. If there are only a few pages, it’s possible to read everything on your own. But as soon as there is a considerable amount of content, reading everything on your own will not be possible.
To use the powerful text analytics blueprints described in this book, you have to acquire the content first. Most websites won’t have a “download all content” button, so we have to find a clever way to download (“scrape”) the pages.
Usually we are mainly interested in the content part of each individual web page, less so in navigation, etc. As soon as we have the data locally available, we can use powerful extraction techniques to dissect the pages into elements such as title, content, and also some meta-information (publication date, author, and so on).
What You’ll Learn and What We’ll Build
In this chapter, we will show you how to acquire HTML data from websites and use powerful tools to extract the content from these HTML files. We will show this with content from one specific data source, the Reuters news archive.
In the first step, we will download single HTML files and extract data from each one with different methods.
Normally, you will not be interested in single pages. Therefore, we will build a blueprint solution. We will download and analyze a news archive page (which contains links to all articles). After completing this, we know the URLs of the referred ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access