Chapter 5. Classification
Data scientists are often tasked with automating decisions for business problems. Is an email an attempt at phishing? Is a customer likely to churn? Is the web user likely to click on an advertisement? These are all classification problems, a form of supervised learning in which we first train a model on data where the outcome is known and then apply the model to data where the outcome is not known. Classification is perhaps the most important form of prediction: the goal is to predict whether a record is a 1 or a 0 (phishing/not-phishing, click/don’t click, churn/don’t churn), or in some cases, one of several categories (for example, Gmail’s filtering of your inbox into “primary,” “social,” “promotional,” or “forums”).
Often, we need more than a simple binary classification: we want to know the predicted probability that a case belongs to a class.
Rather than having a model simply assign a binary classification, most algorithms can return a probability score (propensity) of belonging to the class of interest. In fact, with logistic regression, the default output from R is on the log-odds scale, and this must be transformed to a propensity. In Python’s scikit-learn, logistic regression, like most classification methods, provides two prediction methods: predict (which returns the class) and predict_proba (which returns probabilities for each class). A sliding cutoff can then be used to convert the propensity score to a decision. The general approach is ...