Skip to Content
Machine Learning with Python Cookbook
book

Machine Learning with Python Cookbook

by Chris Albon
March 2018
Intermediate to advanced content levelIntermediate to advanced
364 pages
7h 12m
English
O'Reilly Media, Inc.
Content preview from Machine Learning with Python Cookbook

Chapter 6. Handling Text

6.0 Introduction

Unstructured text data, like the contents of a book or a tweet, is both one of the most interesting sources of features and one of the most complex to handle. In this chapter, we will cover strategies for transforming text into information-rich features. This is not to say that the recipes covered here are comprehensive. There exist entire academic disciplines focused on handling this and similar types of data, and the contents of all their techniques would fill a small library. Despite this, there are some commonly used techniques, and a knowledge of these will add valuable tools to our preprocessing toolbox.

6.1 Cleaning Text

Problem

You have some unstructured text data and want to complete some basic cleaning.

Solution

Most basic text cleaning operations should only replace Python’s core string operations, in particular strip, replace, and split:

# Create text
text_data = ["   Interrobang. By Aishwarya Henriette     ",
             "Parking And Going. By Karl Gautier",
             "    Today Is The night. By Jarek Prakash   "]

# Strip whitespaces
strip_whitespace = [string.strip() for string in text_data]

# Show text
strip_whitespace
['Interrobang. By Aishwarya Henriette',
 'Parking And Going. By Karl Gautier',
 'Today Is The night. By Jarek Prakash']
# Remove periods
remove_periods = [string.replace(".", "") for string in strip_whitespace]

# Show text
remove_periods
['Interrobang By Aishwarya Henriette', 'Parking And Going By Karl Gautier', 'Today Is The night By ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Machine Learning with Python Cookbook, 2nd Edition

Machine Learning with Python Cookbook, 2nd Edition

Kyle Gallatin, Chris Albon

Publisher Resources

ISBN: 9781491989371Errata Page