11Multiple Linear Regression

To this point, we have been concerned with only one predictor variable—salary in the baseball example, and cotton dust exposure in the pulmonary function example. Let us now turn to the more common situation where there are multiple independent or predictor variables. There are a variety of statistical and machine learning techniques that can model relationships between predictor variables and an outcome variable; we will focus on the oldest: multiple linear regression.

After completing this chapter, you should be able to:

Fit a multiple linear regression model
Test the statistical significance of coefficients
Establish confidence limits around coefficients
Distinguish traditional explanatory purposes from predictive modeling purposes, and identify the model performance metrics appropriate in each case
Explain the role of training and holdout datasets
Implement a predictive regression model and evaluate it using holdout data

11.1 Terminology

First, to minimize confusion, we will pause to review terminology. Different disciplines (statistics, computer science, IT) use different terms to refer to the variables in a regression. Figure 11.1 gives a summary.

There can also be different terms for the unit of observation, whether it is a patient in a health study, customer, web visitor, insurance claim, tax return, or whatever. All these terms, referring to that unit of observation, mean essentially the same thing:

Subject
Case
Observation
Example ...

Get Statistics for Data Science and Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Statistics for Data Science and Analytics by Peter C. Bruce, Peter Gedeck, Janet Dobbins

11Multiple Linear Regression

11.1 Terminology

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly