CHAPTER 7k-Nearest Neighbors (k-NN)

In this chapter, we describe the k-nearest-neighbors algorithm that can be used for classification (of a categorical outcome) or prediction (of a numerical outcome). To classify or predict a new record, the method relies on finding “similar” records in the training data. These “neighbors” are then used to derive a classification or prediction for the new record by voting (for classification) or averaging (for prediction). We explain how similarity is determined, how the number of neighbors is chosen, and how a classification or prediction is computed. k-NN is a highly automated data-driven method. We discuss the advantages and weaknesses of the k-NN method in terms of performance and practical considerations such as computational time.

Python

In this chapter, we will use pandas for data handling, scikit-learn for k-nearest neighbors models, and matplotlib for visualization.

 import required functionality for this chapter

import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier
import matplotlib.pylab as plt

7.1 The k-NN Classifier (Categorical Outcome)

The idea in k-nearest-neighbors methods is to identify k records in the training dataset that are similar to a new record that ...

Get Data Mining for Business Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.