Skip to Content
Applied Text Analysis with Python
book

Applied Text Analysis with Python

by Benjamin Bengfort, Rebecca Bilbro, Tony Ojeda
June 2018
Beginner to intermediate
330 pages
9h 3m
English
O'Reilly Media, Inc.
Book available
Content preview from Applied Text Analysis with Python

Chapter 3. Corpus Preprocessing and Wrangling

In the previous chapter, we learned how to build and structure a custom, domain-specific corpus. Unfortunately, any real corpus in its raw form is completely unusable for analytics without significant preprocessing and compression. In fact, a key motivation for writing this book is the immense challenge we ourselves have encountered in our efforts to build and wrangle corpora large and rich enough to power meaningfully literate data products. Given how much of our own routine time and effort is dedicated to text preprocessing and wrangling, it is surprising how few resources exist to support (or even acknowledge!) these phases.

In this chapter, we propose a multipurpose preprocessing framework that can be used to systematically transform our raw ingested text into a form that is ready for computation and modeling. Our framework includes the five key stages shown in Figure 3-1: content extraction, paragraph blocking, sentence segmentation, word tokenization, and part-of-speech tagging. For each of these stages, we will provide functions conceived as methods under the HTMLCorpusReader class defined in the previous chapter.

Streaming preprocessing performs segmentation, tokenization, and tagging on our raw corpus.
Figure 3-1. Breakdown of document segmentation, tokenization, and tagging

Breaking Down Documents

In the previous chapter, we began constructing a custom HTMLCorpusReader, providing it with methods for filtering, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Python for Data Analysis, 3rd Edition

Python for Data Analysis, 3rd Edition

Wes McKinney
Introduction to Machine Learning with Python

Introduction to Machine Learning with Python

Andreas C. Müller, Sarah Guido

Publisher Resources

ISBN: 9781491963036Errata Page