Chapter 6. Handling Text
6.0 Introduction
Unstructured text data, like the contents of a book or a tweet, is both one of the most interesting sources of features and one of the most complex to handle. In this chapter, we will cover strategies for transforming text into information-rich features and use some out-of-the-box features (termed embeddings) that have become increasingly ubiquitous in tasks that involve natural language processing (NLP).
This is not to say that the recipes covered here are comprehensive. Entire academic disciplines focus on handling unstructured data such as text. In this chapter, we will cover some commonly used techniques; knowledge of these will add valuable tools to our preprocessing toolbox. In addition to many generic text processing recipes, we’ll also demonstrate how you can import and leverage some pretrained machine learning models to generate richer text features.
6.1 Cleaning Text
Problem
You have some unstructured text data and want to complete some basic cleaning.
Solution
In the following example, we look at the text for three books and clean it by using Python’s core
string operations, in particular strip, replace, and split:
# Create texttext_data=[" Interrobang. By Aishwarya Henriette ","Parking And Going. By Karl Gautier"," Today Is The night. By Jarek Prakash "]# Strip whitespacesstrip_whitespace=[string.strip()forstringintext_data]# Show textstrip_whitespace
['Interrobang. By Aishwarya Henriette', 'Parking And ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access