Chapter 13. Naive Bayes
It is well for the heart to be naive and for the mind not to be.
Anatole France
A social network isn’t much good if people can’t network. Accordingly, DataSciencester has a popular feature that allows members to send messages to other members. And while most members are responsible citizens who send only well-received “how’s it going?” messages, a few miscreants persistently spam other members about get-rich schemes, no-prescription-required pharmaceuticals, and for-profit data science credentialing programs. Your users have begun to complain, and so the VP of Messaging has asked you to use data science to figure out a way to filter out these spam messages.
A Really Dumb Spam Filter
Imagine a “universe” that consists of receiving a message chosen randomly from all possible messages. Let S be the event “the message is spam” and B be the event “the message contains the word bitcoin.” Bayes’s theorem tells us that the probability that the message is spam conditional on containing the word bitcoin is:
The numerator is the probability that a message is spam and contains bitcoin, while the denominator is just the probability that a message contains bitcoin. Hence, you can think of this calculation as simply representing the proportion of bitcoin messages that are spam.
If we have a large collection of messages we know are spam, and a large collection of messages ...