11Multiple Linear Regression

To this point, we have been concerned with only one predictor variable—salary in the baseball example, and cotton dust exposure in the pulmonary function example. Let us now turn to the more common situation where there are multiple independent or predictor variables. There are a variety of statistical and machine learning techniques that can model relationships between predictor variables and an outcome variable; we will focus on the oldest: multiple linear regression.

After completing this chapter, you should be able to:

  • Fit a multiple linear regression model
  • Test the statistical significance of coefficients
  • Establish confidence limits around coefficients
  • Distinguish traditional explanatory purposes from predictive modeling purposes, and identify the model performance metrics appropriate in each case
  • Explain the role of training and holdout datasets
  • Implement a predictive regression model and evaluate it using holdout data

11.1 Terminology

First, to minimize confusion, we will pause to review terminology. Different disciplines (statistics, computer science, IT) use different terms to refer to the variables in a regression. Figure 11.1 gives a summary.

There can also be different terms for the unit of observation, whether it is a patient in a health study, customer, web visitor, insurance claim, tax return, or whatever. All these terms, referring to that unit of observation, mean essentially the same thing:

  • Subject
  • Case
  • Observation
  • Example ...

Get Statistics for Data Science and Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.