It is well for the heart to be naive and for the mind not to be.

Anatole France

A social network isn’t much good if people can’t network. Accordingly, DataSciencester has a popular feature that allows members to send messages to other members. And while most of your members are responsible citizens who send only well-received “how’s it going?” messages, a few miscreants persistently spam other members about get-rich schemes, no-prescription-required pharmaceuticals, and for-profit data science credentialing programs. Your users have begun to complain, and so the VP of Messaging has asked you to use data science to figure out a way to filter out these spam messages.

Imagine a “universe” that consists of receiving a message chosen randomly from all possible messages. Let *S* be the event “the message is spam” and *V* be the event “the message contains the word *viagra*.” Then Bayes’s Theorem tells us that the probability that the message is spam conditional on containing the word *viagra* is:

The numerator is the probability that a message is spam *and* contains *viagra*, while the denominator is just the probability that a message contains *viagra*. Hence you can think of this calculation as simply representing the proportion of *viagra* messages that are spam.

If we have a large collection of messages we know are spam, and a large collection ...

Start Free Trial

No credit card required