Chapter 4. Text Classification
One of the more novel uses for binary classification is sentiment analysis, which examines a sample of text such as a product review, a tweet, or a comment left on a website and scores it on a scale of 0.0 to 1.0, where 0.0 represents negative sentiment and 1.0 represents positive sentiment. A review such as “great product at a great price” might score 0.9, while “overpriced product that barely works” might score 0.1. The score is the probability that the text expresses positive sentiment. Sentiment analysis models are difficult to build algorithmically but are relatively easy to craft with machine learning. For examples of how sentiment analysis is used in business today, see the article “8 Sentiment Analysis Real-World Use Cases” by Nicholas Bianchi.
Sentiment analysis is one example of a task that involves classifying textual data rather than numerical data. Because machine learning works with numbers, you must convert text to numbers before training a sentiment analysis model, a model that identifies spam emails, or any other model that classifies text. A common approach is to build a table of word frequencies called a bag of words. Scikit-Learn provides classes to help. It also includes support for normalizing text so that, for example, “awesome” and “Awesome” don’t count as two different words.
This chapter begins by describing how to prepare text for use in classification models. After building a sentiment analysis model, you’ll learn about ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access