CHAPTER 12Discriminant Analysis

In this chapter, we describe the method of discriminant analysis, which is a model-based approach to classification. We discuss the main principle, where classification is based on the distance of a record from each of the class means. We explain the underlying measure of “statistical distance”, which takes into account the correlation between predictors. The output of a discriminant analysis procedure generates estimated “classification functions”, which are then used to produce classification scores that can be translated into classifications or propensities (probabilities of class membership). One can also directly integrate misclassification costs into the discriminant analysis setup, and we explain how this is achieved. Finally, we discuss the underlying model assumptions, the practical robustness to some assumption violations, and the advantages of discriminant analysis when the assumptions are reasonably met (e.g., the sufficiency of a small training sample).

Python

In this chapter, we will use pandas for data handling and scikit-learn for the models. We will also make use of the utility functions from the Python Utilities Functions Appendix. Use the following import statements for the Python code in this chapter.

 import required functionality for this chapter

import numpy as np import pandas as pd from sklearn.discriminant_analysis import ...

Get Data Mining for Business Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.