Skip to Content
Introduction to Machine Learning with Python
book

Introduction to Machine Learning with Python

by Andreas C. Müller, Sarah Guido
October 2016
Beginner to intermediate
400 pages
10h 25m
English
O'Reilly Media, Inc.
Book available
Content preview from Introduction to Machine Learning with Python

Chapter 7. Working with Text Data

In Chapter 4, we talked about two kinds of features that can represent properties of the data: continuous features that describe a quantity, and categorical features that are items from a fixed list. There is a third kind of feature that can be found in many applications, which is text. For example, if we want to classify an email message as either a legitimate email or spam, the content of the email will certainly contain important information for this classification task. Or maybe we want to learn about the opinion of a politician on the topic of immigration. Here, that individual’s speeches or tweets might provide useful information. In customer service, we often want to find out if a message is a complaint or an inquiry. We can use the subject line and content of a message to automatically determine the customer’s intent, which allows us to send the message to the appropriate department, or even send a fully automatic reply.

Text data is usually represented as strings, made up of characters. In any of the examples just given, the length of the text data will vary. This feature is clearly very different from the numeric features that we’ve discussed so far, and we will need to process the data before we can apply our machine learning algorithms to it.

7.1 Types of Data Represented as Strings

Before we dive into the processing steps that go into representing text data for machine learning, we want to briefly discuss different kinds of text ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Python Machine Learning - Third Edition

Python Machine Learning - Third Edition

Sebastian Raschka, Vahid Mirjalili

Publisher Resources

ISBN: 9781449369880Errata PageSupplemental Content