4. Collecting Text Data with Web Scraping and APIs

Overview

This chapter introduces you to the concept of web scraping. You will first learn how to extract data (such as text, images, lists, and tables) from pages that are written using HTML. You will then learn about the various types of semi-structured data used to create web pages (such as JSON and XML) and extract data from them. Finally, you will use APIs for data extraction from Twitter, using the tweepy package.

Introduction

In the last chapter, we developed a simple classifier using feature extraction methods. We also covered different algorithms that fall under supervised and unsupervised learning. In this chapter, you will learn how to collect text data by scraping web pages, ...

Get The Natural Language Processing Workshop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.