Skip to Main Content
Data Algorithms
book

Data Algorithms

by Mahmoud Parsian
July 2015
Intermediate to advanced content levelIntermediate to advanced
778 pages
17h 9m
English
O'Reilly Media, Inc.
Content preview from Data Algorithms

Chapter 13. k-Nearest Neighbors

This chapter focuses on an important machine learning algorithm called k-Nearest Neighbors (kNN), where k is an integer greater than 0. The kNN classification problem is to find the k nearest data points in a data set to a given query data point. This operation is also known as a kNN join, and can be defined as: given two data sets R and S, we want to find the k nearest Neighbor from S for every object in R. In data mining, R and S are called the query and training data sets, respectively. The training data set (S) refers to data that has already been classified, while the query data set (R) refers to data that is going to be classified using S’s classifications.

The objective of this chapter is to present a MapReduce solution using Spark for the kNN join algorithm. The kNN algorithm has been discussed extensively in many machine learning books, such as Machine Learning for Hackers[8] and Machine Learning in Action[10]. kNN is an important clustering algorithm that has many applications in data mining (such as pattern recognition) and bioinformatics (such as the breast cancer diagnosis problem[23], the weather generating model, and product recommendation systems). The kNN algorithm can be fairly expensive, especially when we have a large training set, which is why MapReduce is well suited for it.

So what exactly is kNN? Nearest Neighbor analysis, or Nearest Neighbor search, is an algorithm for classifying n-dimensional objects1 based on their similarity ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Algorithms with Spark

Data Algorithms with Spark

Mahmoud Parsian
Algorithms and Data Structures for Massive Datasets

Algorithms and Data Structures for Massive Datasets

Dzejla Medjedovic, Emin Tahirovic, Ines Schweigert
Data Mesh

Data Mesh

Zhamak Dehghani
Learning Algorithms

Learning Algorithms

George Heineman

Publisher Resources

ISBN: 9781491906170Errata PageSupplemental Content