This chapter focuses on an important machine learning algorithm called k-Nearest Neighbors (kNN), where *k* is an integer greater than 0. The kNN classification problem is to find the *k* nearest data points in a data set to a given query data point. This operation is also known as a *kNN join*, and can be defined as: given two data sets *R* and *S*, we want to find the *k* nearest Neighbor from *S* for every object in *R*. In data mining, *R* and *S* are called the *query* and *training* data sets, respectively. The training data set (*S*) refers to data that has already been classified, while the query data set (*R*) refers to data that is going to be classified using *S*’s classifications.

The objective of this chapter is to present a MapReduce solution using Spark for the kNN join algorithm. The kNN algorithm has been discussed extensively in many machine learning books, such as *Machine Learning for Hackers*[8] and *Machine Learning in Action*[10]. kNN is an important clustering algorithm that has many applications in data mining (such as pattern recognition) and bioinformatics (such as the breast cancer diagnosis problem[23], the weather generating model, and product recommendation systems). The kNN algorithm can be fairly expensive, especially when we have a large training set, which is why MapReduce is well suited for it.

So what exactly is kNN? Nearest Neighbor analysis, or Nearest Neighbor search, is an algorithm for classifying *n*-dimensional objects^{1} based on their similarity ...

Start Free Trial

No credit card required