3Naïve Bayes and the Incredible Lightness of Being an Idiot

In this chapter, we're going to talk about naïve Bayes, a common data science technique often used for document classification. Naïve Bayes is used in all sorts of textual analytics including detecting spam and classifying tweets. For a model like this, you supply the training data—prelabeled examples that help a training algorithm understand what you're looking for. From there you use this information to classify new documents into the predefined categories.

You'll soon quickly learn that naïve Bayes is easy. Stupid easy, in fact. But wrapping our minds around naïve Bayes requires we understand probability and the Bayes rule. What you'll find is that once we establish rules of dependence with probability, naïve Bayes rejects these assumptions—but is oddly still very effective anyway. Let's start with a super quick intro to probability.

The World's Fastest Intro to Probability Theory

Before we move forward, let's talk probability. Probability is the likelihood something will happen measured between 0 and 1 and often reported as a percent. When we talk about probability, we use the notation p(). For instance:

I love hot dogs, so of course there's a 100 percent chance I will eat them. But I don't like ketchup, and you'll ...

Get Data Smart, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Data Smart, 2nd Edition by Jordan Goldmeier

3Naïve Bayes and the Incredible Lightness of Being an Idiot

The World's Fastest Intro to Probability Theory

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly