3Naïve Bayes and the Incredible Lightness of Being an Idiot

In this chapter, we're going to talk about naïve Bayes, a common data science technique often used for document classification. Naïve Bayes is used in all sorts of textual analytics including detecting spam and classifying tweets. For a model like this, you supply the training data—prelabeled examples that help a training algorithm understand what you're looking for. From there you use this information to classify new documents into the predefined categories.

You'll soon quickly learn that naïve Bayes is easy. Stupid easy, in fact. But wrapping our minds around naïve Bayes requires we understand probability and the Bayes rule. What you'll find is that once we establish rules of dependence with probability, naïve Bayes rejects these assumptions—but is oddly still very effective anyway. Let's start with a super quick intro to probability.

The World's Fastest Intro to Probability Theory

Before we move forward, let's talk probability. Probability is the likelihood something will happen measured between 0 and 1 and often reported as a percent. When we talk about probability, we use the notation p(). For instance:

p left-parenthesis normal zero width space upper Jordan normal eat shot dogs normal zero width space right-parenthesis equals 1 left-parenthesis normal zero width space o r Baseline 100 percent-sign normal zero width space right-parenthesis
p left-parenthesis normal zero width space upper Jordan normal eats normal ketchup normal zero width space right-parenthesis equals 0 .0000001

I love hot dogs, so of course there's a 100 percent chance I will eat them. But I don't like ketchup, and you'll ...

Get Data Smart, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.