Chapter 4. Regression and Prediction
Perhaps the most common goal in statistics is to answer the question “Is the variable X (or more likely, ) associated with a variable Y, and if so, what is the relationship and can we use it to predict Y?”
Nowhere is the nexus between statistics and data science stronger than in the realm of prediction—specifically, the prediction of an outcome (target) variable based on the values of other “predictor” variables. This process of training a model on data where the outcome is known, for subsequent application to data where the outcome is not known, is termed supervised learning. Another important connection between data science and statistics is in the area of anomaly detection, where regression diagnostics originally intended for data analysis and improving the regression model can be used to detect unusual records.
Simple Linear Regression
Simple linear regression provides a model of the relationship between the magnitude of one variable and that of a second—for example, as X increases, Y also increases. Or as X increases, Y decreases.1 Correlation is another way to measure how two variables are related—see the section “Correlation”. The difference is that while correlation measures the strength of an association between two variables, regression quantifies the nature of the relationship.