# Chapter 4. Correlation and Regression

Have you heard that ice cream consumption is linked to shark attacks? Apparently Jaws has a lethal appetite for mint chocolate chip. FigureÂ 4-1 visualizes this proposed relationship.

âNot so,â you may retort. âThis does not necessarily mean that shark attacks are triggered by ice cream consumption.â

âIt could be,â you reason, âthat as the outside temperature increases, more ice cream is consumed. People also spend more time near the ocean when the weather is warm, and that coincidence leads to more shark attacks.â

# âCorrelation Does Not Imply Causationâ

Youâve likely heard repeatedly that âcorrelation does not imply causation.â

In ChapterÂ 3, you learned that causation is a fraught expression in statistics. We really only reject the null hypothesis because we simply donât have all the data to claim causality for sure. That semantic difference aside, does correlation have anything to do with causation? The standard expression somewhat oversimplifies their relationship; youâll see why in this chapter using the tools of inferential statistics you picked up earlier.

This will be our final chapter conducted primarily in Excel. After that, you will have sufficiently grasped the framework of analytics to be ready to be move into R and Python.

# Introducing Correlation

So far, weâve mostly been analyzing statistics one variable at a time: weâve found the average reading score or the variance in housing prices, for example. This is known as univariate analysis.

Weâve also done a bit of bivariate analysis. For example, we compared the frequencies of two categorical variables using a two-way frequency table. We also analyzed a continuous variable when grouped by multiple levels of a categorical variable, finding descriptive statistics for each group.

We will now calculate a bivariate measure of two continuous variables using correlation. More specifically, we will use the Pearson correlation coefficient to measure the strength of the linear relationship between two variables. Without a linear relationship, the Pearson correlation is unsuitable.

So, how do we know our data is linear? There are more rigorous ways to check, but, as usual, a visualization is a great start. In particular, we will use a scatterplot to depict all observations based on their x and y coordinates.

If it appears a line could be drawn through the scatterplot that summarizes the overall pattern, then itâs a linear relationship and the Pearson correlation can be used. If you would need a curve or some other shape to summarize the pattern, then the opposite holds. FigureÂ 4-2 depicts one linear and two nonlinear relationships.

In particular, FigureÂ 4-2 gives an example of a positive linear relationship: as values on the x-axis increase, so do the values on the y-axis (at a linear rate).

Itâs also possible to have a negative correlation, where a negative line summarizes the relationship, or no correlation at all, in which a flat line summarizes it. These different types of linear relationships are shown in FigureÂ 4-3. Remember, these all must be linear relationships for correlation to apply.

Once weâve established that the data is linear, we can find the correlation coefficient. It always takes a value between â1 and 1, with â1 indicating a perfect negative linear relationship, 1 a perfect positive linear relationship, and 0 no linear relationship at all. TableÂ 4-1 shows some rules of thumb for evaluating the strength of a correlation coefficient. These are not official standards by any means, but a useful jumping-off point for interpretation.

Table 4-1. Interpretation of correlation coefficients
Correlation coefficient Interpretation

â1.0

Perfect negative linear relationship

â0.7

Strong negative relationship

â0.5

Moderate negative relationship

â0.3

Weak negative relationship

0

No linear relationship

+0.3

Weak positive relationship

+0.5

Moderate positive relationship

+0.7

Strong positive relationship

+1.0

Perfect positive linear relationship

With the basic conceptual framework for correlations in mind, letâs do some analysis in Excel. We will be using a vehicle mileage dataset; you can find mpg.xlsx in the mpg subfolder of the book repositoryâs datasets folder. This is a new dataset, so take some time to learn about it: what types of variables are we working with? Summarize and visualize them using the tools covered in ChapterÂ 1. To help with subsequent analysis, donât forget to add an index column and convert the dataset into a table, which I will call mpg.

Excel includes the `CORREL()` function to calculate the correlation coefficient between two arrays:

`CORREL(array1, array2)`

Letâs use this function to find the correlation between `weight` and `mpg` in our dataset:

`=CORREL(mpg[weight], mpg[mpg])`

This indeed returns a value between â1 and 1: â0.832. (Do you remember how to interpret this?)

A correlation matrix presents the correlations across all pairs of variables. Letâs build one using the Data Analysis ToolPak. From the ribbon, head to Data â Data Analysis â Correlation.

Remember that this is a measure of linear relationship between two continuous variables, so we should exclude categorical variables like origin and be judicious about including discrete variables like cylinders or model.year. The ToolPak insists on all variables being in a contiguous range, so I will cautiously include cylinders. FigureÂ 4-4 shows what the ToolPak source menu should look like.

This results in a correlation matrix as shown in FigureÂ 4-5.

We can see the â0.83 in cell `B6`: itâs the intersection of weight and mpg. We would also see the same value in cell `F2`, but Excel left this half of the matrix blank, as itâs redundant information. All values along the diagonal are 1, as any variable is perfectly correlated with itself.

###### Warning

The Pearson correlation coefficient is only a suitable measure when the relationship between the two variables is linear.

Weâve made a big leap in our assumptions about our variables by analyzing their correlations. Can you think of what that is? We assumed their relationship is linear. Letâs check that assumption with scatterplots. Unfortunately, there is no way in basic Excel to generate scatterplots of each pair of variables at once. For practice, consider plotting them all, but letâs try it with the weight and mpg variables. Highlight this data, then head to the ribbon and click Insert â Scatter.

I will add a custom chart title and relabel the axes to aid in interpretation. To change the chart title, double-click on it. To relabel the axes, click the perimeter of the chart and then select the plus sign that appears to expand the Chart Elements menu. (On Mac, click inside the chart and then Chart Design â Add Chart Element.) Select Axis Titles from the menu. FigureÂ 4-6 shows the resulting scatterplot. Itâs not a bad idea to include units of measurement on the axes to help outsiders make sense of the data.

FigureÂ 4-6 looks basically like a negative linear relationship, with a greater spread given lower weights and higher mileages. By default, Excel plotted the first variable in our data selection along the x-axis and the second along the y-axis. But why not the other way around? Try switching the order of these columns in your worksheet so that weight is in column `E` and mpg in column `F`, then insert a new scatterplot.

FigureÂ 4-7 shows a mirror image of the relationship. Excel is a great tool, but as with any tool, you have to tell it what to do. Excel will calculate correlations regardless of whether the relationship is linear. It will also make you a scatterplot without concern for which variable should go on which axis.

So, which scatterplot is âright?â Does it matter? By convention, the independent variable goes on the x-axis, and dependent on the y-axis. Take a moment to consider which is which. If youâre not sure, remember that the independent variable is generally the one measured first.

Our independent variable is weight because it was determined by the design and building of the car. mpg is the dependent variable because we assume itâs affected by the carâs weight. This puts weight on the x-axis and mpg on the y-axis.

In business analytics, it is uncommon to have collected data just for the sake of statistical analysis; for example, the cars in our mpg dataset were built to generate revenue, not for a research study on the impact of weight on mileage. Because there are not always clear independent and dependent variables, we need to be all the more aware of what these variables are measuring, and how they are measured. This is why having some knowledge of the domain you are studying, or at least descriptions of your variables and how your observations were collected, is so valuable.

# From Correlation to Regression

Though itâs conventional to place the independent variable on the x-axis, it makes no difference to the related correlation coefficient. However, there is a big caveat here, and it relates to the earlier idea of using a line to summarize the relationship found by the scatterplot. This practice begins to diverge from correlation, and itâs one you may have heard of: linear regression.

Correlation is agnostic to which variable you call independent and which you call dependent; that doesnât factor into its definition as âthe extent to which two variables move together linearly.â

On the other hand, linear regression is inherently affected by this relationship as âthe estimated impact of a unit change of the independent variable X on the dependent variable Y.â

You are going to see that the line we fit through our scatterplot can be expressed as an equation; unlike the correlation coefficient, this equation depends on how we define our independent and dependent variables.

Like correlation, linear regression assumes that a linear relationship exists between the two variables. Other assumptions do exist and are important to consider when modeling data. For example, we do not want to have extreme observations that may disproportionately affect the overall trend of the linear relationship.

For the purposes of our demonstration, we will overlook this and other assumptions for now. These assumptions are often difficult to test using Excel; your knowledge of statistical programming will serve you well when looking into the deeper points of linear regression.

Take a deep breath; itâs time for another equation:

##### Equation 4-1. The equation for linear regression
$upper Y equals beta 0 plus beta 1 times upper X plus epsilon$

The goal of Equation 4-1 is to predict our dependent variable Y. Thatâs the left side. You may remember from school that a line can be broken into its intercept and slope. Thatâs where $beta 0$ and $beta 1 times upper X Subscript i$, respectively, come in. In the second term, we multiply our independent variable by a slope coefficient.

Finally, there will be a part of the relationship between our independent and dependent variable that can be explained not by the model per se but by some external influence. This is known as the modelâs error and is indicated by $epsilon Subscript i$.

Earlier we used the independent samples t-test to examine a significant difference in means between two groups. Here, we are measuring the linear influence of one continuous variable on another. We will do this by examining whether the slope of the fit regression line is statistically different than zero. That means our hypothesis test will work something like this:

H0: There is no linear influence of our independent variable on our dependent variable. (The slope of the regression line equals zero.)

Ha: There is a linear influence of our independent variable on our dependent variable. (The slope of the regression line does not equal zero.)

FigureÂ 4-8 shows some examples of what significant and insignificant slopes might look like.

Remember, we donât have all the data, so we donât know what the âtrueâ slope would be for the population. Instead, we are inferring whether, given our sample, this slope would be statistically different from zero. We can use the same p-value methodology to estimate the slopeâs significance that we did to find the difference in means of two groups. We will continue to conduct two-tailed tests at the 95% confidence interval. Letâs jump into finding the results using Excel.

# Linear Regression in Excel

In this demo of linear regression on the mpg dataset in Excel, we test whether a carâs weight (weight) has a significant influence on its mileage (mpg). That means our hypotheses will be:

H0: There is no linear influence of weight on mileage.

Ha: There is a linear influence of weight on mileage.

Before getting started, itâs a good idea to write out the regression equation using the specific variables of interest, which Iâve done in Equation 4-2:

##### Equation 4-2. Our regression equation for estimating mileage
$m p g equals beta 0 plus beta 1 times w e i g h t plus epsilon$

Letâs start with visualizing the results of the regression: we already have the scatterplot from FigureÂ 4-6, now itâs just a matter of overlaying or âfittingâ the regression line onto it. Click on the perimeter of the plot to launch the âChart Elementsâ menu. Click on âTrendline,â then âMore Optionsâ to the side. Click the radio button at the bottom of the âFormat Trendlineâ screen reading âDisplay Equation on chart.â

Now letâs click on the resulting equation on the graph to add bold formatting and increase its font size to 14. Weâll make the trendline solid black and give it a 2.5-point width by clicking on it in the graph, then going to the paint bucket icon at the top of the Format Trendline menu. We now have the making of linear regression. Our scatterplot with trendline looks like FigureÂ 4-9. Excel also includes the regression equation we are looking for from Equation 4-2 to estimate a carâs mileage based on its weight.

We can place the intercept before the slope in our equation to get Equation 4-3.

##### Equation 4-3. Equation 4-3. Our fit regression equation for estimating mileage
$m p g equals 46.217 minus 0.0076 times w e i g h t$

Notice that Excel does not include the error term as part of the regression equation. Now that weâve fit the regression line, weâve quantified the difference between what values we expect from the equation and what values are found in the data. This difference is known as the residual, and weâll come back to it later in this chapter. First, weâll get back to what we set out to do: establish statistical significance.

Itâs great that Excel fit the line for us and gave us the resulting equation. But this does not give us enough information to conduct the hypothesis test: we still donât know whether the lineâs slope is statistically different than zero. To get this information, we will again use the Analysis ToolPak. From the ribbon, go to Data â Data Analysis â Regression. Youâll be asked to select your Y and X ranges; these are your dependent and independent variables, respectively. Make sure to indicate that your inputs include labels, as shown in FigureÂ 4-10.

This results in quite a lot of information, which is shown in FigureÂ 4-11. Letâs step through it.

Ignore the first section in cells `A3:B8` for now; we will return to it later. Our second section in `A10:F14` is labeled ANOVA (short for analysis of variance). This tells us whether our regression performs significantly better with the coefficient of the slope included versus one with just the intercept.

TableÂ 4-2 spells out what the competing equations are here.

Table 4-2. Intercept-only versus full regression model
Incercept-only model Model with coefficients

mpg = 46.217

mpg = 46.217 â 0.0076 Ã weight

A statistically significant result indicates that our coefficients do improve the model. We can determine the results of the test from the p-value found in cell `F12` of FigureÂ 4-11. Remember, this is scientific notation, so read the p-value as 6.01 times 10 to the power of â102: much smaller than 0.05. We can conclude that weight is worth keeping as a coefficient in the regression model.

That brings us to the third section in cells `A16:I18`; here is where we find what we were originally looking for. This range contains a lot of information, so letâs go column by column starting with the coefficients in cells `B17:B18`. These should look familiar as the intercept and slope of the line that were given in Equation 4-3.

Next, the standard error in `C17:C18`. We talked about this in ChapterÂ 3: itâs a measure of variability across repeated samples and in this case can be thought of as a measure of our coefficientsâ precision.

We then have what Excel calls the ât Stat,â otherwise known as the t-statistic or test statistic, in `D17:D18`; this can be derived by dividing the coefficient by the standard error. We can compare it to our critical value of 1.96 to establish statistical significance at 95% confidence.

Itâs more common, however, to interpret and report on the p-value, which gives the same information. We have two p-values to interpret. First, the interceptâs coefficient in `E17`. This tells us whether the intercept is significantly different than zero. The significance of the intercept is not part of our hypothesis test, so this information is irrelevant. (This is another good example of why we canât always take Excelâs output at face value.)

###### Warning

While most statistical packages (including Excel) report the p-value of the intercept, itâs usually not relevant information.

Instead, we want the p-value of weight in cell `E18`: this is related to the lineâs slope. The p-value is well under 0.05, so we reject the null and conclude that weight does likely influence mileage. In other words, the lineâs slope is significantly different than zero. Just like with our earlier hypothesis tests, we will shy away from concluding that weâve âprovenâ a relationship, or that more weight causes lower mileage. Again, we are making inferences about a population based on a sample, so uncertainty is inherent.

The output also gives us the 95% confidence interval for our intercept and slope in cells `F17:I18`. By default, this is stated twice: had we asked for a different confidence interval in the input menu, weâd have received both here.

Now that youâre getting the hang of interpreting the regression output, letâs try making a point estimate based on the equation line: what would we expect the mileage to be for a car weighing 3,021 pounds? Letâs plug it into our regression equation in Equation 4-4:

##### Equation 4-4. Equation 4-4. Making a point estimate based on our equation
$m p g equals 46.217 minus 0.0076 times 3021$

Based on Equation 4-4, we expect a car weighing 3,021 pounds to get 23.26 miles per gallon. Take a look at the source dataset: there is an observation weighing 3,021 pounds (Ford Maverick, row `101` in the dataset) and it gets 18 miles per gallon, not 23.26. What gives?

This discrepancy is the residual that was mentioned earlier: itâs the difference between the values we estimated in the regression equation and those that are found in the actual data. Iâve included this and some other observations in FigureÂ 4-12. The scatterpoints represent what values are actually found in the dataset, and the line represents what values we predicted with the regression.

It stands to reason that weâd be motivated to minimize the difference between these values. Excel and most regression applications use ordinary least squares (OLS) to do this. Our goal in OLS is to minimize residuals, specifically, the sum of squared residuals, so that both negative and positive residuals are measured equally. The lower the sum of squared residuals, the less of a difference there is between our actual and expected values, and the better our regression equation is at making estimates.

We learned from the p-value of our slope that there is a significant relationship between independent and dependent variables. But this does not tell us how much of the variability in our dependent variable is explained by our independent variable.

Remember that variability is at the heart of what we study as analysts; variables vary, and we want to study why they vary. Experiments let us do that, by understanding the relationship between an independent and dependent variable. But we wonât be able to explain everything about our dependent variable with our independent variable. There will always be some unexplained error.

R-squared, or the coefficient of determination (which Excel refers to as R-square), expresses as a percentage how much variability in the dependent variable is explained by our regression model. For example, an R-squared of 0.4 indicates that 40% of variability in Y can be explained by the model. This means that 1 minus R-squared is what variability canât be explained by the model. If R-squared is 0.4, then 60% of Yâs variability is unaccounted for.

Excel calculates R-squared for us in the first box of regression output; take a look back to cell `B5` in FigureÂ 4-11. The square root of R-squared is multiple R, which is also seen in cell `B4` of the output. Adjusted R-square (cell `B6`) is used as a more conservative estimate of R-squared for a model with multiple independent variables. This measure is of interest when conducting multiple linear regression, which is beyond the scope of this book.

There are other ways than R-squared to measure the performance of regression: Excel includes one of them, the standard error of the regression, in its output (cell `B7` in FigureÂ 4-11). This measure tells us the average distance that observed values deviate from the regression line. Some analysts prefer this or other measures to R-squared for evaluating regression models, although R-squared remains a dominant choice. Regardless of preferences, the best evaluation often comes from evaluating multiple figures in their proper context, so thereâs no need to swear by or swear off any one measure.

Congratulations: you conducted and interpreted a complete regression analysis.

# Rethinking Our Results: Spurious Relationships

Based on their temporal ordering and our own logic, itâs nearly absolute in our mileage example that weight should be the independent variable and mpg the dependent. But what happens if we fit the regression line with these variables reversed? Go ahead and give it a try using the ToolPak. The resulting regression equation is shown in Equation 4-5.

##### Equation 4-5. Equation 4-5. A regression equation to estimate weight based on mileage
$w e i g h t equals 5101.1 minus 90.571 times m p g$

We can flip our independent and dependent variables and get the same correlation coefficient. But when we change them for regression, our coefficients change.

Were we to find out that mpg and weight were both influenced simultaneously by some outside variable, then neither of these models would be correct. And this is the same scenario that weâre faced with in ice cream consumption and shark attacks. Itâs silly to say that ice cream consumption has any influence on shark attacks, because both of these are influenced by temperature, as FigureÂ 4-13 depicts.

This is called a spurious relationship. Itâs frequently found in data, and it may not be as obvious as this example. Having some domain knowledge of the data you are studying can be invaluable for detecting spurious relationships.

###### Warning

Variables can be correlated; there could even be evidence of a causal relationship. But the relationship might be driven by some variable youâve not even accounted for.

# Conclusion

Remember this old phrase?

Correlation doesnât imply causation.

Analytics is highly incremental: we usually layer one concept on top of the next to build increasingly complex analyses. For example, weâll always start with descriptive statistics of the sample before attemping to infer parameters of the population. While correlation may not imply causation, causation is built on the foundations of correlation. That means a better way to summarize the relationship might be:

Correlation is a necessary but not sufficient condition for causation.

Weâve just scratched the surface of inferential statistics in this and previous chapters. A whole world of tests exists, but all of them use the same framework of hypothesis testing that weâve used here. Get this process down, and youâll be able to test for all sorts of different data relationships.

I hope youâve seen and agree that Excel is a fantastic tool for learning statistics and analytics. You got a hands-on look at the statistical principles that power much of this work, and learned how to explore and test for relationships in real datasets.

That said, Excel can have diminishing returns when it comes to more advanced analytics. For example, weâve been checking for properties like normality and linearity using visualizations; this is a good start, but there are more robust ways to test them (often, in fact, using statistical inference). These techniques often rely on matrix algebra and other computationally intensive operations that can be tedious to derive in Excel. While add-ins are available to make up for these shortcomings, they can be expensive and lack particular features. On the other hand, as open source tools R and Python are free, and include many app-like features called packages that serve nearly any use case. This environment will allow you to focus on the conceptual analysis of your data rather than raw computation, but you will need to learn how to program. These tools, and the analytics toolkit in general, will be the focus of ChapterÂ 5.

# Exercises

Practice your correlation and regression chops by analyzing the ais dataset found in the book repositoryâs datasets folder. This dataset includes height, weight, and blood readings from male and female Australian athletes of different sports.

With the dataset, try the following:

1. Produce a correlation matrix of the relevant variables in this dataset.

2. Visualize the relationship of ht and wt. Is this a linear relationship? If so, it is negative or positive?

3. Of ht and wt, which do you presume is the independent and dependent variable?

• Is there a significant influence of the independent variable on the dependent variable?

• What is the slope of your fit regression line?

• What percentage of the variance in the dependent variable is explained by the independent variable?

4. This dataset contains a variable for body mass index, bmi. If you are not familiar with this metric, take a moment to research how itâs calculated. Knowing this, would you want to analyze the relationship between ht and bmi? Donât hesitate to lean on common sense here rather than just statistical reasoning.

Get Advancing into Analytics now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.