It is well for the heart to be naive and for the mind not to be.

Anatole France

A social network isn’t much good if people can’t network. Accordingly, DataSciencester has a popular feature that allows members to send messages to other members. And while most members are responsible citizens who send only well-received “how’s it going?” messages, a few miscreants persistently spam other members about get-rich schemes, no-prescription-required pharmaceuticals, and for-profit data science credentialing programs. Your users have begun to complain, and so the VP of Messaging has asked you to use data science to figure out a way to filter out these spam messages.

Imagine a “universe” that consists of receiving a message chosen randomly from all possible messages. Let *S* be the event “the message is spam” and *B* be the event “the message contains the word *bitcoin*.” Bayes’s theorem tells us that the probability that the message is spam conditional on containing the word *bitcoin* is:

$$P\left(S\right|B)=[P\left(B\right|S\left)P\right(S\left)\right]/\left[P\right(B\left|S\right)P\left(S\right)+P\left(B\right|\neg S\left)P\right(\neg S\left)\right]$$

The numerator is the probability that a message is spam *and* contains *bitcoin*, while the denominator is just the probability that a message contains *bitcoin*. Hence, you can think of this calculation as simply representing the proportion of *bitcoin* messages that are spam.

If we have a large collection of messages we know are spam, and a large collection of messages ...

Start Free Trial

No credit card required