Chapter 13. Naive Bayes

It is well for the heart to be naive and for the mind not to be.

Anatole France

A social network isn’t much good if people can’t network. Accordingly, DataSciencester has a popular feature that allows members to send messages to other members. And while most of your members are responsible citizens who send only well-received “how’s it going?” messages, a few miscreants persistently spam other members about get-rich schemes, no-prescription-required pharmaceuticals, and for-profit data science credentialing programs. Your users have begun to complain, and so the VP of Messaging has asked you to use data science to figure out a way to filter out these spam messages.

A Really Dumb Spam Filter

Imagine a “universe” that consists of receiving a message chosen randomly from all possible messages. Let S be the event “the message is spam” and V be the event “the message contains the word viagra.” Then Bayes’s Theorem tells us that the probability that the message is spam conditional on containing the word viagra is:

The numerator is the probability that a message is spam and contains viagra, while the denominator is just the probability that a message contains viagra. Hence you can think of this calculation as simply representing the proportion of viagra messages that are spam.

If we have a large collection of messages we know are spam, and a large collection ...

Get Data Science from Scratch now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.