Skip to Main Content
Data Algorithms
book

Data Algorithms

by Mahmoud Parsian
July 2015
Intermediate to advanced content levelIntermediate to advanced
778 pages
17h 9m
English
O'Reilly Media, Inc.
Content preview from Data Algorithms

Chapter 14. Naive Bayes

In data mining and machine learning, there are many classification algorithms. One of the simplest but most effective is the Naive Bayes classifier (NBC). The main focus of this chapter is to present a distributed MapReduce implementation (using Spark) of the NBC that is a combination of a supervised learning method and probabilistic classifier. Naive Bayes is a linear classifier. To understand it, we need to understand some basic and conditional probabilities. When we are dealing with numeric data, it is better to use clustering techniques (such as K-Means and k-Nearest Neighbors methods and algorithms), but for classification of names, symbols, emails, and texts, it may be better to use a probabilistic method such as the NBC. In some cases, the NBC is used to classify numeric data as well. In the following section, you will see examples of both symbolic and numeric data.

The NBC is a probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions. In a nutshell, an NBC assigns inputs into one of the k classes {C1, C2, ..., Ck} based on some properties (features) of the inputs. NBCs have applications such as email spam filtering and document classification.

For example, a spam filter using a Naive Bayes classifier will assign each email to one of two clusters: spam mail or not a spam mail. Since Naive Bayes is a supervised learning method, it has two distinct stages:

Stage 1: Training (see Figure 14-1)
This stage ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Algorithms with Spark

Data Algorithms with Spark

Mahmoud Parsian
Algorithms and Data Structures for Massive Datasets

Algorithms and Data Structures for Massive Datasets

Dzejla Medjedovic, Emin Tahirovic, Ines Schweigert
Data Mesh

Data Mesh

Zhamak Dehghani
Learning Algorithms

Learning Algorithms

George Heineman

Publisher Resources

ISBN: 9781491906170Errata PageSupplemental Content