O'Reilly logo

Data Algorithms by Mahmoud Parsian

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 14. Naive Bayes

In data mining and machine learning, there are many classification algorithms. One of the simplest but most effective is the Naive Bayes classifier (NBC). The main focus of this chapter is to present a distributed MapReduce implementation (using Spark) of the NBC that is a combination of a supervised learning method and probabilistic classifier. Naive Bayes is a linear classifier. To understand it, we need to understand some basic and conditional probabilities. When we are dealing with numeric data, it is better to use clustering techniques (such as K-Means and k-Nearest Neighbors methods and algorithms), but for classification of names, symbols, emails, and texts, it may be better to use a probabilistic method such as the NBC. In some cases, the NBC is used to classify numeric data as well. In the following section, you will see examples of both symbolic and numeric data.

The NBC is a probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions. In a nutshell, an NBC assigns inputs into one of the k classes {C1, C2, ..., Ck} based on some properties (features) of the inputs. NBCs have applications such as email spam filtering and document classification.

For example, a spam filter using a Naive Bayes classifier will assign each email to one of two clusters: spam mail or not a spam mail. Since Naive Bayes is a supervised learning method, it has two distinct stages:

Stage 1: Training (see Figure 14-1)
This stage ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required