CHAPTER 7k-Nearest Neighbors (k-NN)
In this chapter, we describe the k-nearest-neighbors algorithm that can be used for classification (of a categorical outcome) or prediction (of a numerical outcome). To classify or predict a new record, the method relies on finding “similar” records in the training data. These “neighbors” are then used to derive a classification or prediction for the new record by voting (for classification) or averaging (for prediction). We explain how similarity is determined, how the number of neighbors is chosen, and how a classification or prediction is computed. k-NN is a highly automated data-driven method. We discuss the advantages and weaknesses of the k-NN method in terms of performance and practical considerations such as computational time.
Python
In this chapter, we will use pandas for data handling, scikit-learn for k-nearest neighbors models, and matplotlib for visualization.
import required functionality for this chapter
import pandas as pd from sklearn import preprocessing from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier import matplotlib.pylab as plt
7.1 The k-NN Classifier (Categorical Outcome)
The idea in k-nearest-neighbors methods is to identify k records in the training dataset that are similar to a new record that ...
Get Data Mining for Business Analytics now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.