Chapter 13. Working with Text
Data can reside not just as numbers but also in words: names of dog breeds, restaurant violation descriptions, street addresses, speeches, blog posts, internet reviews, and much more. To organize and analyze information contained in text, we often need to do some of the following tasks:
- Convert text into a standard format
-
This is also referred to as canonicalizing text. For example, we might need to convert characters to lowercase, use common spellings and abbreviations, or remove punctuation and blank spaces.
- Extract a piece of text to create a feature
-
As an example, a string might contain a date embedded in it, and we want to pull it out from the string to create a date feature.
- Transform text into features
-
We might want to encode particular words or phrases as 0-1 features to indicate their presence in a string.
- Analyze text
-
In order to compare entire documents at once, we can transform a document into a vector of word counts.
This chapter introduces common techniques for working with text data. We show how simple string manipulation tools are often all we need to put text in a standard form or extract portions of strings. We also introduce regular expressions for more general and robust pattern matching. To demonstrate these text operations we use several examples. We first introduce these examples and describe the work we want to do to prepare the text for analysis.
Examples of Text and Tasks
For each type of task just introduced, ...
Get Learning Data Science now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.