Chapter 7. k-Nearest Neighbors (k-NN)
In this chapter we describe the k-nearest neighbor algorithm that can be used for classification (of a categorical outcome) or prediction (of a numerical outcome). To classify or predict a new record, the method relies on finding "similar" records in the training data. These "neighbors" are then used to derive a classification or prediction for the new record by voting (for classification) or averaging (for prediction). We explain how similarity is determined, how the number of neighbors is chosen, and how a classification or prediction is computed. k-NN is a highly automated data-driven method. We discuss the advantages and weaknesses of the k-NN method in terms of performance and practical considerations such as computational time.
k-NN Classifier (categorical outcome)
The idea in k-nearest neighbor methods is to identify k records in the training dataset that are similar to a new record that we wish to classify. We then use these similar (neighboring) records to classify the new record into a class, assigning the new record to the predominant class among these neighbors. Denote by (x1, x2, ... , xp) the values of the predictors for this new record. We look for records in our training data that are similar or "near" the record to be classified in the predictor space (i.e., records that have values close to x1, x2, ... , xp). Then, based on the classes to which those proximate records belong, we assign a class to the record that we want to classify. ...