Chapter 4. Text Classification
One of the more novel uses for binary classification is sentiment analysis, which examines a sample of text such as a product review, a tweet, or a comment left on a website and scores it on a scale of 0.0 to 1.0, where 0.0 represents negative sentiment and 1.0 represents positive sentiment. A review such as “great product at a great price” might score 0.9, while “overpriced product that barely works” might score 0.1. The score is the probability that the text expresses positive sentiment. Sentiment analysis models are difficult to build algorithmically but are relatively easy to craft with machine learning. For examples of how sentiment analysis is used in business today, see the article “8 Sentiment Analysis Real-World Use Cases” by Nicholas Bianchi.
Sentiment analysis is one example of a task that involves classifying textual data rather than numerical data. Because machine learning works with numbers, you must convert text to numbers before training a sentiment analysis model, a model that identifies spam emails, or any other model that classifies text. A common approach is to build a table of word frequencies called a bag of words. Scikit-Learn provides classes to help. It also includes support for normalizing text so that, for example, “awesome” and “Awesome” don’t count as two different words.
This chapter begins by describing how to prepare text for use in classification models. After building a sentiment analysis model, you’ll learn about another popular learning algorithm called Naive Bayes that works particularly well with text and use it to build a model that distinguishes between legitimate emails and spam emails. Finally, you’ll learn about a mathematical technique for measuring the similarity of two text samples and use it to build an app that recommends movies based on other movies you enjoy.
Preparing Text for Classification
Before you train a model to classify text, you must convert the text into numbers, a process known as vectorization. Chapter 1 presented the illustration reproduced in Figure 4-1, which demonstrates a common technique for vectorizing text. Each row represents a text sample such as a tweet or a movie review, and each column represents a word in the training text. The numbers in the rows are word counts, and the final number in each row is a label: 0 for negative and 1 for positive.
Text is typically cleaned before it’s vectorized. Examples of cleaning include converting characters to lowercase (so, for example, “Excellent” is equivalent to “excellent”), removing punctuation symbols, and optionally removing stop words—common words such as the and and that are likely to have little impact on the outcome. Once cleaned, sentences are divided into individual words (tokenized) and the words are used to produce datasets like the one in Figure 4-1.
Scikit-Learn has three classes that handle the bulk of the work of cleaning and vectorizing text:
CountVectorizer
- Creates a dictionary (vocabulary) from the corpus of words in the training text and generates a matrix of word counts like the one in Figure 4-1
HashingVectorizer
- Uses word hashes rather than an in-memory vocabulary to produce word counts and is therefore more memory efficient
TfidfVectorizer
- Creates a dictionary from words provided to it and generates a matrix similar to the one in Figure 4-1, but rather than containing integer word counts, the matrix contains term frequency-inverse document frequency (TFIDF) values between 0.0 and 1.0 reflecting the relative importance of individual words
All three classes are capable of converting text to lowercase, removing punctuation symbols, removing stop words, splitting sentences into individual words, and more. They also support n-grams, which are combinations of two or more consecutive words (you specify the number n) that should be treated as a single word. The idea is that words such as credit and score might be more meaningful if they appear next to each other in a sentence than if they appear far apart. Without n-grams, the relative proximity of words is ignored. The downside to using n-grams is that it increases memory consumption and training time. Used judiciously, however, it can make text classification models more accurate.
Note
Neural networks have other, more powerful ways of taking word order into account that don’t require related words to occur next to each other. A conventional machine learning model can’t connect the words blue and sky in the sentence “I like blue, for it is the color of the sky,” but a neural network can. I will shed more light on this in Chapter 13.
Here’s an example demonstrating what CountVectorizer
does and how it’s used:
import
pandas
as
pd
from
sklearn.feature_extraction.text
import
CountVectorizer
lines
=
[
'Four score and 7 years ago our fathers brought forth,'
,
'... a new NATION, conceived in liberty $$$,'
,
'and dedicated to the PrOpOsItIoN that all men are created equal'
,
'One nation
\'
s freedom equals #freedom for another $nation!'
]
# Vectorize the lines
vectorizer
=
CountVectorizer
(
stop_words
=
'english'
)
word_matrix
=
vectorizer
.
fit_transform
(
lines
)
# Show the resulting word matrix
feature_names
=
vectorizer
.
get_feature_names_out
()
line_names
=
[
f
'Line
{
(
i
+
1
)
:
d
}
'
for
i
,
_
in
enumerate
(
word_matrix
)]
df
=
pd
.
DataFrame
(
data
=
word_matrix
.
toarray
(),
index
=
line_names
,
columns
=
feature_names
)
df
.
head
()
Here’s the output:
The corpus of text in this case is four strings in a Python list. CountVectorizer
broke the strings into words, removed stop words and symbols, and converted all remaining words to lowercase. Those words comprise the columns in the dataset, and the numbers in the rows show how many times a given word appears in each string. The stop_words='english'
parameter tells CountVectorizer
to remove stop words using a built-in dictionary of more than 300 English-language stop words. If you prefer, you can provide your own list of stop words in a Python list. (Or you can leave the stop words in there; it often doesn’t matter.) And if you’re training with text written in another language, you can get lists of multilanguage stop words from other Python libraries such as the Natural Language Toolkit (NLTK) and Stop-words.
Observe from the output that equal
and equals
count as separate words, even though they have similar meaning. Data scientists sometimes go a step further when preparing text for machine learning by stemming or lemmatizing words. If the preceding text were stemmed, all occurrences of equals
would be converted to equal
. Scikit lacks support for stemming and lemmatization, but you can get it from other libraries such as NLTK.
CountVectorizer
removes punctuation symbols, but it doesn’t remove numbers. It ignored the 7 in line 1 because it ignores single characters. But if you changed 7 to 777, the term 777 would appear in the vocabulary. One way to fix that is to define a function that removes numbers and pass it to CountVectorizer
via the preprocessor
parameter:
import
re
def
preprocess_text
(
text
):
return
re
.
sub
(
r
'\d+'
,
''
,
text
)
.
lower
()
vectorizer
=
CountVectorizer
(
stop_words
=
'english'
,
preprocessor
=
preprocess_text
)
word_matrix
=
vectorizer
.
fit_transform
(
lines
)
Note the call to lower
to convert the text to lowercase. CountVectorizer
doesn’t convert text to lowercase if you provide a preprocessing function, so the preprocessing function must convert it itself. It still removes punctuation characters, however.
Another useful parameter to CountVectorizer
is min_df
, which ignores words that appear fewer than the specified number of times. It can be an integer specifying a minimum count (for example, ignore words that appear fewer than five times in the training text, or min_df=5
), or it can be a floating-point value from 0.0 to 1.0 specifying the minimum percentage of samples in which a word must appear—for example, ignore words that appear in less than 10% of the samples (min_df=0.1
). It’s great for filtering out words that probably aren’t meaningful anyway, and it reduces memory consumption and training time by decreasing the size of the vocabulary. CountVectorizer
also supports a max_df
parameter for eliminating words that appear too frequently.
The preceding examples use CountVectorizer
, which probably leaves you wondering when (and why) you would use HashingVectorizer
or TfidfVectorizer
instead. HashingVectorizer
is useful when dealing with large datasets. Rather than store words in memory, it hashes each word and uses the hash as an index into an array of word counts. It can therefore do more with less memory and is very useful for reducing the size of vectorizers when serializing them so that you can restore them later—a topic I’ll say more about in Chapter 7. The downside to HashingVectorizer
is that it doesn’t let you work backward from vectorized text to the original text. CountVectorizer
does, and it provides an inverse_transform
method for that purpose.
TfidfVectorizer
is frequently used to perform keyword extraction: examining a document or set of documents and extracting keywords that characterize their content. It assigns words numerical weights reflecting their importance, and it uses two factors to determine the weights: how often a word appears in individual documents, and how often it appears in the overall document set. Words that appear more frequently in individual documents but occur in fewer documents receive higher weights. I won’t go further into it here, but if you’re curious to learn more, this book’s GitHub repo contains a notebook that uses TfidfVectorizer
to extract keywords from the manuscript of Chapter 1.
Sentiment Analysis
To train a sentiment analysis model, you need a labeled dataset. Several such datasets are available in the public domain. One of those is the IMDB movie review dataset, which contains 25,000 samples of negative reviews and 25,000 samples of positive reviews posted on the Internet Movie Database website. Each review is meticulously labeled with a 0 for negative sentiment or a 1 for positive sentiment. To demonstrate how sentiment analysis works, let’s build a binary classification model and train it with this dataset. We’ll use logistic regression as the learning algorithm. A sentiment analysis score yielded by this model is simply the probability that the input expresses positive sentiment, which is easily retrieved by calling LogisticRegression
’s predict_proba
method.
Start by downloading the dataset and copying it to the Data subdirectory of the directory that hosts your Jupyter notebooks. Then run the following code in a notebook to load the dataset and show the first five rows:
import
pandas
as
pd
df
=
pd
.
read_csv
(
'Data/reviews.csv'
,
encoding
=
'ISO-8859-1'
)
df
.
head
()
The encoding
attribute is necessary because the CSV file uses ISO-8859-1 character encoding rather than UTF-8. The output is as follows:
Find out how many rows the dataset contains and confirm that there are no missing values:
df
.
info
()
Use the following statement to see how many instances there are of each class (0 for negative and 1 for positive):
df
.
groupby
(
'Sentiment'
)
.
describe
()
Here is the output:
There is an even number of positive and negative samples, but in each case, the number of unique samples is less than the number of samples for that class. That means the dataset has duplicate rows, and duplicate rows could bias a machine learning model. Use the following statements to delete the duplicate rows and check for balance again:
df
=
df
.
drop_duplicates
()
df
.
groupby
(
'Sentiment'
)
.
describe
()
Now there are no duplicate rows, and the number of positive and negative samples is roughly equal.
Next, use CountVectorizer
to prepare and vectorize the text in the Text
column. Set min_df
to 20 to ignore words that appear infrequently in the training text. This reduces the likelihood of out-of-memory errors and will probably make the model more accurate as well. Also use the ngram_range
parameter to allow CountVectorizer
to include word pairs as well as individual words:
from
sklearn.feature_extraction.text
import
CountVectorizer
vectorizer
=
CountVectorizer
(
ngram_range
=
(
1
,
2
),
stop_words
=
'english'
,
min_df
=
20
)
x
=
vectorizer
.
fit_transform
(
df
[
'Text'
])
y
=
df
[
'Sentiment'
]
Now split the dataset for training and testing. We’ll use a 50/50 split since there are almost 50,000 samples in total:
from
sklearn.model_selection
import
train_test_split
x_train
,
x_test
,
y_train
,
y_test
=
train_test_split
(
x
,
y
,
test_size
=
0.5
,
random_state
=
0
)
The next step is to train a classifier. We’ll use Scikit’s LogisticRegression
class, which uses logistic regression to fit a model to the data:
from
sklearn.linear_model
import
LogisticRegression
model
=
LogisticRegression
(
max_iter
=
1000
,
random_state
=
0
)
model
.
fit
(
x_train
,
y_train
)
Validate the trained model with the 50% of the dataset set aside for testing and show the results in a confusion matrix:
%
matplotlib
inline
from
sklearn.metrics
import
ConfusionMatrixDisplay
as
cmd
cmd
.
from_estimator
(
model
,
x_test
,
y_test
,
display_labels
=
[
'Negative'
,
'Positive'
],
cmap
=
'Blues'
,
xticks_rotation
=
'vertical'
)
The confusion matrix reveals that the model correctly identified 10,795 negative reviews while misclassifying 1,574 of them. It correctly identified 10,966 positive reviews and got it wrong 1,456 times:
Now comes the fun part: analyzing text for sentiment. Use the following statements to produce a sentiment score for the sentence “The long lines and poor customer service really turned me off”:
text
=
'The long lines and poor customer service really turned me off'
model
.
predict_proba
(
vectorizer
.
transform
([
text
]))[
0
][
1
]
Here’s the output:
0.09183447847778639
Now do the same for “The food was great and the service was excellent!”:
text
=
'The food was great and the service was excellent!'
model
.
predict_proba
(
vectorizer
.
transform
([
text
]))[
0
][
1
]
If you expected a higher score for this one, you won’t be disappointed:
0.8536277207125618
Feel free to try sentences of your own and see if you agree with the sentiment scores the model predicts. It’s not perfect, but it’s good enough that if you run hundreds of reviews or comments through it, you should get a reliable indication of the sentiment expressed in the text.
Note
Sometimes CountVectorizer
’s built-in list of stop words lowers the accuracy of a model because the list is so broad. As an experiment, remove stop_words='english'
from CountVectorizer
and run the code again. Check the confusion matrix. Does the accuracy increase or decrease? Feel free to vary other parameters such as min_df
and ngram_range
too. In the real world, data scientists often try many different parameter combinations to determine which one produces the best results.
Naive Bayes
Logistic regression is a go-to algorithm for classification models and is often very effective at classifying text. But in scenarios involving text classification, data scientists often turn to another learning algorithm called Naive Bayes. It’s a classification algorithm based on Bayes’ theorem, which provides a means for calculating conditional probabilities. Mathematically, Bayes’ theorem is stated this way:
This says the probability that A is true given that B is true is equal to the probability that B is true given that A is true multiplied by the probability that A is true divided by the probability that B is true. That’s a mouthful, and while accurate, it doesn’t explain why Naive Bayes is so useful for classifying text—or how you apply it, for example, to a collection of emails to determine which ones are spam.
Let’s start with a simple example. Suppose 10% of all the emails you receive are spam. That’s P(A). Analysis reveals that 5% of the spam emails you receive contain the word congratulations, but just 1% of all your emails contain the same word. Therefore, P(B|A) is 0.05 and P(B) is 0.01. The probability of an email being spam if it contains the word congratulations is P(A|B), which is (0.05 x 0.10) / 0.01, or 0.50.
Of course, a spam filter must consider all the words in an email, not just one. It turns out that if you make some simple (naive) assumptions—that the order of the words in an email doesn’t matter, and that every word has equal weight—you can write Bayes’ equation this way for a spam classifier:
In plain English, the probability that a message is spam is proportional to the product of:
-
The probability that any message in the dataset is spam, or P(S)
-
The probability that each word in the message appears in a spam message, or P(word|S)
P(S) can be calculated easily enough: it’s simply the fraction of the messages in the dataset that are spam messages. If you train a machine learning model with 1,000 messages and 500 of them are spam, then P(S) = 0.5. For a given word, P(word|S) is simply the number of times the word appears in spam messages divided by the number of words in all the spam messages. The entire problem reduces to word counts. You can do a similar calculation to compute the probability that the message is not spam, and then use the higher of the two probabilities to make a prediction.
Here’s an example involving four sample emails. The emails are:
Text | Spam |
---|---|
Raise your credit score in minutes | 1 |
Here are the minutes from yesterday’s meeting | 0 |
Meeting tomorrow to review yesterday’s scores | 0 |
Score tomorrow’s meds at yesterday’s prices | 1 |
If you remove stop words, convert characters to lowercase, and stem the words such that tomorrow’s becomes tomorrow, you’re left with this:
Text | Spam |
---|---|
raise credit score minute | 1 |
minute yesterday meeting | 0 |
meeting tomorrow review yesterday score | 0 |
score tomorrow med yesterday price | 1 |
Because two of the four messages are spam and two are not, the probability that any message is spam (P(S)) is 0.5. The same goes for the probability that any message is not spam (P(N) = 0.5). In addition, the spam messages contain nine unique words, while the nonspam messages contain a total of eight.
The next step is to build the following table of word frequencies. Take the word yesterday as an example. It appears once in a message that’s labeled as spam, so P(yesterday|S) is 1/9, or 0.111. It appears twice in nonspam messages, so P(yesterday|N) is 2/8, or 0.250:
Word | P(word|S) | P(word|N) |
---|---|---|
raise | 1/9 = 0.111 | 0/8 = 0.000 |
credit | 1/9 = 0.111 | 0/8 = 0.000 |
score | 2/9 = 0.222 | 1/8 = 0.125 |
minute | 1/9 = 0.111 | 1/8 = 0.125 |
yesterday | 1/9 = 0.111 | 2/8 = 0.250 |
meeting | 0/9 = 0.000 | 2/8 = 0.250 |
tomorrow | 1/9 = 0.111 | 1/8 = 0.125 |
review | 0/9 = 0.000 | 1/8 = 0.125 |
med | 1/9 = 0.111 | 0/8 = 0.000 |
price | 1/9 = 0.111 | 0/8 = 0.000 |
This works up to a point, but the zeros in the table are a problem. Let’s say you want to determine whether “Scores must be reviewed by tomorrow” is spam. Removing stop words leaves you with “score review tomorrow.” You can compute the probability that the message is spam this way:
The result is 0 because review doesn’t appear in a spam message, and 0 times anything is 0. The algorithm simply can’t assign a spam probability to “Scores must be reviewed by tomorrow.”
A common way to resolve this is to apply Laplace smoothing, also known as additive smoothing. Typically, this involves adding 1 to each numerator and the number of unique words in the dataset (in this case, 10) to each denominator. Now, P(review|S) evaluates to (0 + 1) / (9 + 10), which equals 0.053. It’s not much, but it’s better than nothing (literally). Here are the word frequencies again, this time revised with Laplace smoothing:
Word | P(word|S) | P(word|N) |
---|---|---|
raise | (1 + 1) / (9 + 10) = 0.105 | (0 + 1) / (8 + 10) = 0.056 |
credit | (1 + 1) / (9 + 10) = 0.105 | (0 + 1) / (8 + 10) = 0.056 |
score | (2 + 1) / (9 + 10) = 0.158 | (1 + 1) / (8 + 10) = 0.111 |
minute | (1 + 1) / (9 + 10) = 0.105 | (1 + 1) / (8 + 10) = 0.111 |
yesterday | (1 + 1) / (9 + 10) = 0.105 | (2 + 1) / (8 + 10) = 0.167 |
meeting | (0 + 1) / (9 + 10) = 0.053 | (2 + 1) / (8 + 10) = 0.167 |
tomorrow | (1 + 1) / (9 + 10) = 0.105 | (1 + 1) / (8 + 10) = 0.111 |
review | (0 + 1) / (9 + 10) = 0.053 | (1 + 1) / (8 + 10) = 0.111 |
med | (1 + 1) / (9 + 10) = 0.105 | (0 + 1) / (8 + 10) = 0.056 |
price | (1 + 1) / (9 + 10) = 0.105 | (0 + 1) / (8 + 10) = 0.056 |
Now you can determine whether “Scores must be reviewed by tomorrow” is spam by performing two simple calculations:
By this measure, “Scores must be reviewed by tomorrow” is likely not to be spam. The probabilities are relative, but you could normalize them and conclude there’s about a 40% chance the message is spam and a 60% chance it’s not based on the emails the model was trained with.
Fortunately, you don’t have to do these computations by hand. Scikit-Learn provides several classes to help out, including the MultinomialNB
class, which works great with tables of word counts produced by CountVectorizer
.
Spam Filtering
It’s no coincidence that modern spam filters are remarkably adept at identifying spam. Virtually all of them rely on machine learning. Such models are difficult to implement algorithmically because an algorithm that uses keywords such as credit and score to determine whether an email is spam is easily fooled. Machine learning, by contrast, looks at a body of emails and uses what it learns to classify the next email. Such models often achieve more than 99% accuracy. And they get smarter over time as they’re trained with more and more emails.
The previous example used logistic regression to predict whether text input to it expresses positive or negative sentiment. It used the probability that the text expresses positive sentiment as a sentiment score, and you saw that expressions such as “The long lines and poor customer service really turned me off” score close to 0.0, while expressions such as “The food was great and the service was excellent” score close to 1.0. Now let’s build a binary classification model that classifies emails as spam or not spam and use Naive Bayes to fit the model to the training data.
There are several spam classification datasets available in the public domain. Each contains a collection of emails with samples labeled with 1s for spam and 0s for not spam. We’ll use a relatively small dataset containing 1,000 samples. Begin by downloading the dataset and copying it into your notebooks’ Data subdirectory. Then load the data and display the first five rows:
import
pandas
as
pd
df
=
pd
.
read_csv
(
'Data/ham-spam.csv'
)
df
.
head
()
Now check for duplicate rows in the dataset:
df
.
groupby
(
'IsSpam'
)
.
describe
()
The dataset contains one duplicate row. Let’s remove it and check for balance:
df
=
df
.
drop_duplicates
()
df
.
groupby
(
'IsSpam'
)
.
describe
()
The dataset now contains 499 samples that are not spam, and 500 that are. The next step is to use CountVectorizer
to vectorize the emails. Once more, we’ll allow CountVectorizer
to consider word pairs as well as individual words and remove stop words using Scikit’s built-in dictionary of English stop words:
from
sklearn.feature_extraction.text
import
CountVectorizer
vectorizer
=
CountVectorizer
(
ngram_range
=
(
1
,
2
),
stop_words
=
'english'
)
x
=
vectorizer
.
fit_transform
(
df
[
'Text'
])
y
=
df
[
'IsSpam'
]
Split the dataset so that 80% can be used for training and 20% for testing:
from
sklearn.model_selection
import
train_test_split
x_train
,
x_test
,
y_train
,
y_test
=
train_test_split
(
x
,
y
,
test_size
=
0.2
,
random_state
=
0
)
The next step is to train a Naive Bayes classifier using Scikit’s MultinomialNB
class:
from
sklearn.naive_bayes
import
MultinomialNB
model
=
MultinomialNB
()
model
.
fit
(
x_train
,
y_train
)
Validate the trained model with the 20% of the dataset set aside for testing using a confusion matrix:
%
matplotlib
inline
from
sklearn.metrics
import
ConfusionMatrixDisplay
as
cmd
cmd
.
from_estimator
(
model
,
x_test
,
y_test
,
display_labels
=
[
'Not Spam'
,
'Spam'
],
cmap
=
'Blues'
,
xticks_rotation
=
'vertical'
)
The model correctly identified 101 of 102 legitimate emails as not spam, and 95 of 98 spam emails as spam:
Use the score
method to get a rough measure of the model’s accuracy:
model
.
score
(
x_test
,
y_test
)
Now use Scikit’s RocCurveDisplay
class to visualize the ROC curve:
from
sklearn.metrics
import
RocCurveDisplay
as
rcd
import
seaborn
as
sns
sns
.
set
()
rcd
.
from_estimator
(
model
,
x_test
,
y_test
)
The results are encouraging. Trained with just 999 samples, the area under the ROC curve (AUC) indicates the model is more than 99.9% accurate at classifying emails as spam or not spam:
Let’s see how the model classifies a few emails that it hasn’t seen before, starting with one that isn’t spam. The model’s predict
method predicts a class—0 for not spam, or 1 for spam:
msg
=
'Can you attend a code review on Tuesday to make sure the logic is solid?'
input
=
vectorizer
.
transform
([
msg
])
model
.
predict
(
input
)[
0
]
The model says this message is not spam, but what’s the probability that it’s not spam? You can get that from predict_proba
, which returns an array containing two values: the probability that the predicted class is 0, and the probability that the predicted class is 1, in that order:
model
.
predict_proba
(
input
)[
0
][
0
]
The model seems very sure that this email is legitimate:
0.9999497111473539
Now test the model with a spam message:
msg
=
'Why pay more for expensive meds when you can order them online '
\'and save $$$?'
input
=
vectorizer
.
transform
([
msg
])
model
.
predict
(
input
)[
0
]
What is the probability that the message is not spam?
model
.
predict_proba
(
input
)[
0
][
0
]
The answer is:
0.00021423891260677753
What is the probability that the message is spam?
model
.
predict_proba
(
input
)[
0
][
1
]
And the answer is:
0.9997857610873945
Observe that predict
and predict_proba
accept a list of inputs. Based on that, could you classify an entire batch of emails with one call to either method? How would you get the results for each email?
Recommender Systems
Another branch of machine learning that has proven its mettle in recent years is recommender systems—systems that recommend products or services to customers. Amazon’s recommender system reportedly drives 35% of its sales. The good news is that you don’t have to be Amazon to benefit from a recommender system, nor do you have to have Amazon’s resources to build one. They’re relatively simple to create once you learn a few basic principles.
Recommender systems come in many forms. Popularity-based systems present options to customers based on what products and services are popular at the time—for example, “Here are this week’s bestsellers.” Collaborative systems make recommendations based on what others have selected, as in “People who bought this book also bought these books.” Neither of these systems requires machine learning.
Content-based systems, by contrast, benefit greatly from machine learning. An example of a content-based system is one that says “if you bought this book, you might like these books also.” These systems require a means for quantifying similarity between items. If you liked the movie Die Hard, you might or might not like Monty Python and the Holy Grail. If you liked Toy Story, you’ll probably like A Bug’s Life too. But how do you make that determination algorithmically?
Content-based recommenders require two ingredients: a way to vectorize—convert to numbers—the attributes that characterize a service or product, and a means for calculating similarity between the resulting vectors. The first one is easy. CountVectorizer
converts text into tables of word counts. All you need is a way to measure similarity between rows of word counts and you can build a recommender system. And one of the simplest and most effective ways to do that is a technique called cosine similarity.
Cosine Similarity
Cosine similarity is a mathematical means for computing the similarity between pairs of vectors (or rows of numbers treated as vectors). The basic idea is to take each value in a sample—for example, word counts in a row of vectorized text—and use them as endpoint coordinates for a vector, with the other endpoint at the origin of the coordinate system. Do that for two samples, and then compute the cosine between vectors in m-dimensional space, where m is the number of values in each sample. Because the cosine of 0 is 1, two identical vectors have a similarity of 1. The more dissimilar the vectors, the closer the cosine will be to 0.
Here’s an example in two-dimensional space to illustrate. Suppose you have three rows containing two values each:
1 | 2 |
2 | 3 |
3 | 1 |
You want to determine whether row 2 is more similar to row 1 or row 3. It’s hard to tell just by looking at the numbers, and in real life, there are many more numbers. If you simply added the numbers in each row and compared the sums, you would conclude that row 2 is more similar to row 3. But what if you treated each row as a vector, as shown in Figure 4-2?
-
Row 1: (0, 0) → (1, 2)
-
Row 2: (0, 0) → (2, 3)
-
Row 3: (0, 0) → (3, 1)
Now you can plot each row as a vector, compute the cosines of the angles formed by 1 and 2 and 2 and 3, and determine that row 2 is more like row 1 than row 3. That’s cosine similarity in a nutshell.
Cosine similarity isn’t limited to two dimensions; it works in higher-dimensional space as well. To help compute cosine similarities regardless of the number of dimensions, Scikit offers the cosine_similarity
function. The following code computes the cosine similarities of the three samples in the preceding example:
data
=
[[
1
,
2
],
[
2
,
3
],
[
3
,
1
]]
cosine_similarity
(
data
)
The return value is a similarity matrix containing the cosines of every vector pair. The width and height of the matrix equals the number of samples:
array
([[
1.
,
0.99227788
,
0.70710678
],
[
0.99227788
,
1.
,
0.78935222
],
[
0.70710678
,
0.78935222
,
1.
]])
From this, you can see that the similarity of rows 1 and 2 is 0.992, while the similarity of rows 2 and 3 is 0.789. In other words, row 2 is more similar to row 1 than it is to row 3. There is also more similarity between rows 2 and 3 (0.789) than there is between rows 1 and 3 (0.707).
Building a Movie Recommendation System
Let’s put cosine similarity to work building a content-based recommender system for movies. Start by downloading the dataset, which is one of several movie datasets available from Kaggle.com. This one has information for about 4,800 movies, including title, budget, genres, keywords, cast, and more. Place the CSV file in your Jupyter notebooks’ Data subdirectory. Then load the dataset and peruse its contents:
import
pandas
as
pd
df
=
pd
.
read_csv
(
'Data/movies.csv'
)
df
.
head
()
The dataset contains 24 columns, only a few of which are needed to describe a movie. Use the following statements to extract key columns such as title
and genres
and fill missing values with empty strings:
df
=
df
[[
'title'
,
'genres'
,
'keywords'
,
'cast'
,
'director'
]]
df
=
df
.
fillna
(
''
)
# Fill missing values with empty strings
df
.
head
()
Next, add a column named features
that combines all the words in the other columns:
df
[
'features'
]
=
df
[
'title'
]
+
' '
+
df
[
'genres'
]
+
' '
+
\df
[
'keywords'
]
+
' '
+
df
[
'cast'
]
+
' '
+
\df
[
'director'
]
Use CountVectorizer
to vectorize the text in the features
column:
from
sklearn.feature_extraction.text
import
CountVectorizer
vectorizer
=
CountVectorizer
(
stop_words
=
'english'
,
min_df
=
20
)
word_matrix
=
vectorizer
.
fit_transform
(
df
[
'features'
])
word_matrix
.
shape
The table of word counts contains 4,803 rows—one for each movie—and 918 columns. The next task is to compute cosine similarities for each row pair:
from
sklearn.metrics.pairwise
import
cosine_similarity
sim
=
cosine_similarity
(
word_matrix
)
Ultimately, the goal of this system is to input a movie title and identify the n movies that are most similar to that movie. To that end, define a function named get_recommendations
that accepts a movie title, a DataFrame
containing information about all the movies, a similarity matrix, and the number of movie titles to return:
def
get_recommendations
(
title
,
df
,
sim
,
count
=
10
):
# Get the row index of the specified title in the DataFrame
index
=
df
.
index
[
df
[
'title'
]
.
str
.
lower
()
==
title
.
lower
()]
# Return an empty list if there is no entry for the specified title
if
(
len
(
index
)
==
0
):
return
[]
# Get the corresponding row in the similarity matrix
similarities
=
list
(
enumerate
(
sim
[
index
[
0
]]))
# Sort the similarity scores in that row in descending order
recommendations
=
sorted
(
similarities
,
key
=
lambda
x
:
x
[
1
],
reverse
=
True
)
# Get the top n recommendations, ignoring the first entry in the list since
# it corresponds to the title itself (and thus has a similarity of 1.0)
top_recs
=
recommendations
[
1
:
count
+
1
]
# Generate a list of titles from the indexes in top_recs
titles
=
[]
for
i
in
range
(
len
(
top_recs
)):
title
=
df
.
iloc
[
top_recs
[
i
][
0
]][
'title'
]
titles
.
append
(
title
)
return
titles
This function sorts the cosine similarities in descending order to identify the count
movies most like the one identified by the title
parameter. Then it returns the titles of those movies.
Now use get_recommendations
to search the database for similar movies. First ask for the 10 movies that are most similar to the James Bond thriller Skyfall:
get_recommendations
(
'Skyfall'
,
df
,
sim
)
Here is the output:
['Spectre', 'Quantum of Solace', 'Johnny English Reborn', 'Clash of the Titans', 'Die Another Day', 'Diamonds Are Forever', 'Wrath of the Titans', 'I Spy', 'Sanctum', 'Blackthorn']
Call get_recommendations
again to list movies that are like Mulan:
get_recommendations
(
'Mulan'
,
df
,
sim
)
Feel free to try other movies as well. Note that you can only input movie titles that are in the dataset. Use the following statements to print a complete list of titles:
pd
.
set_option
(
'display.max_rows'
,
None
)
(
df
[
'title'
])
I think you’ll agree that the system does a pretty credible job of picking similar movies. Not bad for about 20 lines of code!
Summary
Machine learning models that classify text are common and see a variety of uses in industry and in everyday life. What rational human being doesn’t wish for a magic wand that eradicates all spam mails, for example?
Text used to train a text classification model must be prepared and vectorized prior to training. Preparation includes converting characters to lowercase and removing punctuation characters, and may include removing stop words, removing numbers, and stemming or lemmatizing. Once prepared, text is vectorized by converting it into a table of word frequencies. Scikit’s CountVectorizer
class makes short work of the vectorization process and handles some of the preparation duties too.
Logistic regression and other popular classification algorithms can be used to classify text once it’s converted to numerical form. For text classification tasks, however, the Naive Bayes learning algorithm frequently outperforms other algorithms. By making a few “naive” assumptions such as that the order in which words appear in a text sample doesn’t matter, Naive Bayes reduces to a process of word counting. Scikit’s MultinomialNB
class provides a handy Naive Bayes implementation.
Cosine similarity is a mathematical means for computing the similarity between two rows of numbers. One use for it is building systems that recommend products or services based on other products or services that a customer has purchased. Word frequency tables produced from textual descriptions by CountVectorizer
can be combined with cosine similarity to create intelligent recommender systems intended to supplement a company’s bottom line.
Feel free to use this chapter’s examples as a starting point for experiments of your own. For instance, see if you can tweak the parameters passed to CountVectorizer
in any of the examples and increase the accuracy of the resulting model. Data scientists call the search for the optimum parameter combination hyperparameter tuning, and it’s a subject you’ll learn about in the next chapter.
Get Applied Machine Learning and AI for Engineers now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.