Chapter 6. Handling Text
6.0 Introduction
Unstructured text data, like the contents of a book or a tweet, is both one of the most interesting sources of features and one of the most complex to handle. In this chapter, we will cover strategies for transforming text into information-rich features and use some out-of-the-box features (termed embeddings) that have become increasingly ubiquitous in tasks that involve natural language processing (NLP).
This is not to say that the recipes covered here are comprehensive. Entire academic disciplines focus on handling unstructured data such as text. In this chapter, we will cover some commonly used techniques; knowledge of these will add valuable tools to our preprocessing toolbox. In addition to many generic text processing recipes, we’ll also demonstrate how you can import and leverage some pretrained machine learning models to generate richer text features.
6.1 Cleaning Text
Problem
You have some unstructured text data and want to complete some basic cleaning.
Solution
In the following example, we look at the text for three books and clean it by using Python’s core
string operations, in particular strip
, replace
, and split
:
# Create text
text_data
=
[
" Interrobang. By Aishwarya Henriette "
,
"Parking And Going. By Karl Gautier"
,
" Today Is The night. By Jarek Prakash "
]
# Strip whitespaces
strip_whitespace
=
[
string
.
strip
()
for
string
in
text_data
]
# Show text
strip_whitespace
['Interrobang. By Aishwarya Henriette', 'Parking And ...
Get Machine Learning with Python Cookbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.