# Chapter 1. Behaviors, Causality, and Prediction

We’ve seen in chapter 1 that the core question addressed by this book is “what drives behavior?” and that regression will be our workhorse. However, simply running a linear regression and looking at the coefficient for our variable of interest is not enough, because regression can be subject to biases.

In this chapter, I’ll walk you through a simple example of a linear regression gone awry because of a hidden joint cause. Such variables, which we’ll call “confounders” are the main obstacle to measuring what drives behavior. I’ll then introduce a new tool, causal diagrams, which allow us to identify and cancel confounders and that we’ll use again and again in the rest of the book.

# Confound it! The Hidden Dangers of Letting Regression Sort It Out

Earlier we saw the example of a laboratory experiment where cognitive load caused unhealthy snack choices. In that case, correlation was causation, and it was legitimate to interpret the regression coefficient as measuring the causal effect of cognitive load. In this section, we’re going to explore in more detail what happens when correlation is not causation and how that affects the interpretation of regression coefficients. We will identify several cases where attempting to answer a business question with a regression leads to improper conclusions. But worry not! In the following section I will give you tools that will enable you to draw the correct conclusions when we revisit these examples later in the chapter.

## Why Correlation Is Not Causation: a Confounder in Action

C-Mart has an ice-cream stand in each store. It is the company’s belief that the weather influences daily sales – or, to cast it in causality jargon, that the weather is a cause of sales. This belief is supported by a strong correlation in historical data between temperature and sales as shown in Figure 1-28.1 (the corresponding data and code are on the book’s Github).

The workhorse of data analysis is the linear regression. Running a regression of the sales of ice-cream on the temperature takes two lines of code in R:

````## R`
`>` `model` `<-` `lm``(``icecream_sales` `~` `temps``)`
`>` `summary``(``model``)`
`Call``:`
`lm``(``formula` `=` `icecream_sales` `~` `temps``)`
`Residuals``:`
`Min`     `1``Q` `Median`     `3``Q`    `Max`
`-29703`  `-5102`   `-388`   `3583`  `37567`
`Coefficients``:`
`Estimate` `Std.` `Error` `t` `value` `Pr``(``>|``t``|``)`
`(``Intercept``)` `-4526.304`    `457.869`  `-9.886`   `<``2e-16` `***`
`Temps`        `1147.145`      `7.916` `144.923`   `<``2e-16` `***`
`---`
`Signif.` `codes``:`  `0` ‘`***`’ `0.001` ‘`**`’ `0.01` ‘`*`’ `0.05` ‘`.’` `0.1` ‘ ’ `1`
`Residual` `standard` `error``:` `8696` `on` `2398` `degrees` `of` `freedom`
`Multiple` `R``-``squared``:`  `0.8975``,`	`Adjusted` `R``-``squared``:`  `0.8975`
`F``-``statistic``:` `2.1e+04` `on` `1` `and` `2398` `DF``,`  `p``-``value``:` `<` `2.2e-16````

Now, let’s imagine that we’re at the end of a particularly warm week of October, and based on the predictions of the model, the company had increased the stock of the ice cream stands ahead of time. However, the weekly sales, while higher than usual for this week of October, have fallen quite short of the quantity predicted by the model. Oops. What happened? Should the data analyst be fired?

What happened is that the model doesn’t account for a crucial fact: most of the sales of ice cream take place in the summer months when kids are out of school. The regression model made its best prediction with the data available, but some of the cause of ice-cream sales (summer break for students) was misattributed to temperature because summer months are positively correlated with temperature. The increase of temperature in October did not suddenly make it summer break (sorry kids!), so we did not see sales as high as most days at that temperature, which occur in a summer month.

In technical terms, the month of the year is a confounder of our relationship between temperature and sales. A confounder is a variable that introduces bias in a regression; when a confounder is present in the situation you’re analyzing, it means that interpreting the regression coefficient as causal will lead to improper conclusions.

###### Note

Confounder: a variable that biases the regression coefficient for the main variable when it’s not accounted for. A confounder obscures the true relationship between independent and dependent variables.

Let’s think of a place like Chicago, which has a continental climate: winter is very cold and summer is very hot. When comparing sales on a random hot day with sales on a random cold day without accounting for their respective month of the year, you’re very likely to actually be comparing sales on a hot day of summer, when kids are out, with sales on a cold day of winter, when kids are in school; this inflates the apparent relationship between temperature and sales.

In this example, we might also expect to see a consistent under-prediction of sales in colder weather. In truth, there is a paradigm shift in summer months, and when that shift has to be managed exclusively through temperature in a linear regression, the regression coefficient for temperature will invariably be too high for warmer temperatures and too low for colder temperatures.

## Too Many Variables Can Spoil the Broth

If a variable that was not included was a confounder, why not just throw every variable we have into the regression so that it can account for everything? This mindset of “everything and the kitchen sink” still has proponents among statisticians. In the Book of Why, Judea Pearl and Dana MacKenzie mention “a leading statistician even recently wrote, ‘to avoid conditioning on some observed covariates… is nonscientific ad hockery’.”1 It is also quite common among data scientists. To be fair, if your goal is only to predict a variable, you have a model that is carefully designed to avoid overfitting, and you don’t care about why the predicted variable is taking a certain value, then that’s a perfectly valid stance. But this does not work if your goal is to understand causal relationships so that we can act upon them. Because of this, just adding as many variables as you can to your model is not only inefficient, it can be downright counterproductive and misleading.

To demonstrate this, let’s add both Summer Month (a binary 1/0 value that indicates if the month was July or August) and Iced Coffee Sales to our regression. This variable is obviously correlated with Temperature, and we’ll assume that it’s not correlated with Summer Month.

Now, let’s look at what happens to our regression if we add iced coffee sales:

````## R`
`>` `model` `<-` `lm``(``icecream_sales` `~` `iced_coffee_sales` `+` `temps` `+` `summer_months``)`
`>` `summary``(``model``)`
`Call``:`
`lm``(``formula` `=` `icecream_sales_v` `~` `iced_coffee_sales_v` `+` `temps_v` `+`
`summer_months_v``)`
`Residuals``:`
`Min`       `1``Q`   `Median`       `3``Q`      `Max`
`-25763.1`  `-3044.9`     `41.3`   `3075.1`  `27518.1`
`Coefficients``:`
`Estimate` `Std.` `Error` `t` `value` `Pr``(``>|``t``|``)`
`(``Intercept``)`           `-25.114`    `313.562`  `-0.080`    `0.936`
`iced_coffee_sales_v`    `-1.754`      `2.086`  `-0.841`    `0.401`
`temps_v`              `2755.418`   `2086.185`   `1.321`    `0.187`
`summer_months_v`     `19563.873`    `352.430`  `55.511`   `<``2e-16` `***`
`---`
`Signif.` `codes``:`  `0` ‘`***`’ `0.001` ‘`**`’ `0.01` ‘`*`’ `0.05` ‘`.’` `0.1` ‘ ’ `1`
`Residual` `standard` `error``:` `5751` `on` `2396` `degrees` `of` `freedom`
`Multiple` `R``-``squared``:`  `0.9552``,`	`Adjusted` `R``-``squared``:`  `0.9552`
`F``-``statistic``:` `1.704e+04` `on` `3` `and` `2396` `DF``,`  `p``-``value``:` `<` `2.2e-16````

In this particular simulation, we see that the coefficient for iced coffee sales is not statistically significant, and the coefficient for Temperature has shifted dramatically from our prior example. Trying to use this model to plan stock, marketing, or promotions based on weather projections would result in a vast oversensitivity to temperature even though the R2 value is much higher. In an extreme case, a model like this could erroneously lead to the removal of iced coffee from the menu in order to promote the more profitable ice-cream sales only to find no additional cones sold. How is this possible?

The truth behind the data (which is knowable since I made the relationships and randomized data around those relationships), is that when it is hot out people are more likely to buy iced coffee. On hot days, people are also more likely to buy more ice-cream. But a purchase of iced coffee itself does not make customers any more or less likely to buy ice-cream. Summer months are also not correlated with iced-coffee purchases since school children are not a significant factor in the demand for iced-coffee (see the sidebar for the details of the math at hand).

Figure 1-28.2 shows a positive correlation between iced coffee sales and ice-cream sales since both increase when it is warmer out, but any increase in sales of iced coffee during summer months can be explained by a shared correlation with the temperature variable while summer months have a strong impact on coffee sales. When the regression algorithm tries to explain ice-cream sales using the three variables at hand the explanatory power of temperature on iced-coffee sales was added to the temperature variable while iced coffee was forced to compensate for the overpowering of temperature. Even though iced coffee sales are not statistically significant and the coefficient is relatively small, the dollars of sales are much higher than degrees of temperature, so ultimately iced-coffee sales cancel the inflation of the coefficient for temperature.

## Including the Wrong Variable Can Create Spurious Correlations

In the previous example, adding the confounding variable for iced coffee sales to the regression muddled the relationship between temperature and ice-creams sales. Unfortunately, the reverse can also be true: including the wrong variable in a regression can create the illusion of a relationship when there is none.

Sticking with our ice-cream example at C-Mart, let’s say that the category manager is interested in understanding customer tastes, so he asks an employee to stand outside the store and survey people walking by, asking them how much they like vanilla ice-cream and how much they like chocolate ice-cream, as well as whether they have ever purchased ice-cream from the stand. To keep things simple, we’ll assume that the stand only sells chocolate and vanilla ice-cream.

Let’s assume for the sake of the example that taste for vanilla ice-cream and taste for chocolate ice-cream are entirely uncorrelated. Some people like one but not the other, some people like both, etc. But both these tastes impact whether someone buys from the stand, a binary (Yes/No) variable.

Because the variable Shopped is binary, we would use a logistic regression if we wanted to measure the impact of either of the Taste variables on shopping behavior. Since the two Taste variables are uncorrelated, we would see a regular cloud with no apparent correlation if we were to plot them against each other; however, they each impact the probability of shopping at the ice-cream stand (Figure 1-28.3).

In the first graph, I added a line of best fit, which is almost perfectly flat, reflecting the lack of correlation between the variables (the correlation coefficient is equal to 0.004, reflecting sampling error). On the second and third graph, we can see that taste for vanilla and chocolate is higher on average for customers (Shopped = 1) than for non-customers, which makes sense.

So far, so good. Let’s say that once you get the survey data, your business partner tells you that they are considering introducing a coupon incentive for the ice-cream stand: when you purchase an ice-cream, you get a coupon for future visits. This loyalty incentive won’t impact the respondents who have never shopped at the stand, so the relevant population are those who have shopped at the store. The business partner is considering using flavor restrictions on the coupons to balance stock, but does not know how much flavor choices can be influenced. If someone who purchased vanilla ice-cream were given a coupon for 50% off chocolate ice-cream, would it do anything beyond add more paper to a landfill somewhere? How favorably do the people who like vanilla ice-cream view chocolate ice-cream anyway?

You plot the same graph again, this time restricting the data to people who have answered “Yes” to the shopping question (figure 2.4).

There is now a strong negative correlation between the two variables (the correlation coefficient is equal to -0.39). What happened? Do vanilla lovers who come to your stand turn into chocolate haters and vice versa? Of course not. This correlation was artificially created when you restrained yourself to customers.

Let’s get back to our true causal relationships: the stronger someone’s taste for vanilla, the more likely they are to shop at your stand, and similarly for chocolate. This means that there is a cumulative effect of these two variables. If someone has a weak taste for both vanilla and chocolate ice-creams, they are very unlikely to shop at your stand; in other words, most of the people with a weak taste for vanilla among your customers have a strong taste for chocolate. On the other hand, if someone has a strong taste for vanilla, they might shop at your stand even if they don’t have a strong taste for chocolate. You can see it reflected in the graph above: for high values for vanilla (say above 15), there are data points with lower values for chocolate (below 15), whereas for low values of vanilla (below 5), the only data points in the graph have a high value for chocolate (above 17). No one’s preferences have changed, but people with weak taste for both vanilla and chocolate are excluded from your dataset.

The technical name of this phenomenon is the Berkson paradox2, but Judea Pearl and Dana Mackenzie call it more intuitively the “explain away effect”. If one of your customers has a strong taste for vanilla, this “explains” completely why they are shopping at your stand and they don’t “need” to have a strong taste for chocolate. On the other hand, if one of your customers has a weak taste for vanilla, this can’t “explain” why they are shopping at your stand and they must have a stronger than average taste for chocolate.

The Berkson paradox is counterintuitive and hard to understand at first; we’ll explore it more in depth in the chapter on data collection and preparation. We’ll see that this can cause biases in your data depending on how it was collected, even before you start any analysis. A classic example of how this situation can create artificial correlations is that some diseases show a higher degree of correlation when looking at the population of hospital patients compared to the general population. In that case, the underlying intuition is that either disease is not enough for someone to go to a hospital; someone’s health status gets bad enough to justify hospitalization only when they are both present3.

I only introduced the Berkson paradox here here to show that controlling for the wrong thing can create the appearance of correlation between variables when there is no true correlation in the population.

## A New Respect For Variable Selection

Hopefully these three examples have convinced you that you can’t just throw a bunch of variables in a linear or logistic regression and hope for the best, along the lines of “include them all, God will recognize His own”. You may still wonder though about other types of models and algorithms. Are Gradient Boosting or Deep Learning models somehow immune to confounders, multicollinearity, and spurious correlations? Unfortunately, the answer is no. If anything, the “black box” nature of these models means that confounders can be harder to catch.

One example of that sort of situations that generated a lot of buzz online was an article4 purporting to distinguish between “normal” and “criminal” faces; one critic of the approach was that the “criminal” faces in the dataset were less smiling and more tense than the “normal” faces, and that the model was simply picking up on these cues. Here, Anger could be a confounder that would be correlated both with criminal behaviors and with facial micro-expressions. The very nature of the model used makes it difficult to determine whether this accusation is valid or not; we’ll revisit the question and see some potential solutions when discussing data sources and sampling.

In which cases then can you trust a regression to accurately reflect the causal impact of one variable on another? To answer this question, we need to leave aside regressions for a moment and dig deeper instead into causality and how to represent it. In the next section, I will therefore introduce a tool to do that, causal diagrams.

# Causal Diagrams to The Rescue

Causal diagrams (CDs) are visual representations of causal relationships. When used correctly, they can allow you to bypass the issues we ran into when pursuing a naive approach to regression when attempting to find the coefficient of causation. In this section I will explain how to read and draw causal diagrams and the different types of relationships that can be documented in CDs.

Causal diagrams have two fundamental building blocks:

• Boxes, which represent variables

• Arrows going from one box to another, which indicate causal relationships. An Arrow going from box A to box B indicates that A causes B.

Going back to our C-Mart ice-cream sales example, we recall that an increase (or decrease) in temperature causes an increase (or decrease) in iced coffee sales. This can be simplified to a statement that temperature causes iced coffee sales. Figure 1-28.5 shows the corresponding causal diagram.

Each rectangle represents a variable we can observe (a variable we have in our dataset), and the arrow between them represents the existence and direction of a causal relationship. Here, the arrow between Temperature and Iced coffee sales indicates that temperature is a causal factor of iced coffee sales.

Sometimes however, we won’t be able to observe a variable but we might still want to show it in a causal diagram. In that case, we’ll represent it with an oval.

In figure 2.6, Customers’ sweet tooth is a cause of Iced coffee sales, meaning that customers with a stronger sweet tooth buy more iced coffee. However, we can’t observe how much of a sweet tooth a customer has. We’ll discuss later the importance of unobserved confounders and more generally unobserved variables in causal analysis. For the time being, we’ll treat unobserved variables in causal diagrams as if they were observable and simply represent their unobservability with an oval box.

## Understanding Causal Diagrams

Depending on who you ask, CDs can mean a lot of things; they can be a purely qualitative tool for a discussion of causality, or they can be used as a modeling tool in its own right for statistics (in that case they’re called “probabilistic graphical models”). In this book, we’ll treat CDs as models that link data to the real world (figure 2.7).

There are two loops in this representation, one connecting reality to the causal diagram and one connecting the causal diagram to data. Switching from one perspective to another will require you to do some mental gymnastics at first, akin to that drawing that can be perceived either as a duck or as a rabbit5, but by the end of this book it should be pretty much effortless. And it will pay big dividends by giving you the ability to analyze complex situations effectively and confidently.

### Causal Diagrams Represent Our View of Reality

The first way of looking at causal diagrams is to treat them as representations of causal relationships in reality as we see them (Figure 1-28.8). From this perspective, the elements of CDs represent real “things” that exist and have effects on each other. An analogy from physical sciences would be a magnet, a bar of iron and the magnetic field around the magnet. You can’t see the magnetic field but it exists nonetheless and it affects the iron bar. You may not have any data on the magnetic field and you’ve maybe never seen the equations describing it, but you can sense it as you move the bar and you can develop intuitions as to what it does.

The same perspective applies when we want to understand what drives behaviors. We intuitively understand that human beings have habits, preferences and emotions, and we treat these as causes even though we often don’t have any numeric data about them. When we say “Joe bought peanuts because he was hungry”, we are relying on our knowledge, experience and beliefs about humans in general and Joe in particular.

Here, we’re making a causal statement about reality; we’re saying that had Joe not been hungry he would not have bought peanuts. Because we’re talking about one specific event, we can’t use data to understand it, and we can never be certain about what would have happened if Joe had not been hungry. Therefore, our statement is really just an intuition or an opinion. But that doesn’t mean that we can’t or shouldn’t draw the conclusion we did. Common sense and expertise are subject to a variety of cognitive biases, but more often than not they can still be useful, especially in complex situations where data are missing or it’s not clear which data would be relevant.

However, using CDs to represent intuitions and beliefs about the world introduces subjectivity and that’s okay. CDs are tools for thinking and analysis, they don’t have to be “true”. You and I might have different ideas as to why Joe bought peanuts, which means we would draw different CDs. Even if we fully agreed on what causes what, we couldn’t represent everything and their relationships in one diagram; there is judgment involved in determining what variables and relationships to include or exclude. In some cases, data will help: we’ll be able to reject a CD because the data at hand are incompatible with it. But in other cases, radically different CDs will be equally compatible with the data and we won’t be able to choose between them, especially if you don’t have experimental data.

This subjectivity might look like a (possibly fatal) flaw of CDs, but it’s actually a feature, not a bug. Our world is uncertain and CDs are just reflecting that uncertainty, not creating it. If there are several possible interpretations of the situation at hand that appear equally valid, you should make it explicit. The alternative would be to let people have different mental models in their head and each believe that they know the truth. At least, putting the uncertainty in the open will allow a principled discussion and guide your analysis.

### Causal Diagrams Represent Data

Now that you’ve seen the duck in the picture, let’s look at the rabbit. In this second perspective, we’ll assume that CDs represent data (Figure 1-28.9), and that arrows represent linear relationships between variables. This means we’ll be able to use our data to reject certain CDs, and conversely to use our CDs to guide our analysis of data.

From this perspective, the causal diagram from picture 2.6 connecting temperature to iced coffee sales would mean that

IcedCoffeeSales = β * Temperature + ϵ

This linear regression means that if temperature were to increase by one degree, “keeping everything else equal”, then sales of iced coffee would increase by β dollars. Each box in the causal diagram represents a column of data, as with the simulated data in table 2.1.

Table 1-1. Simulated data illustrating the relationship in our causal diagram
Date Temperature Iced Coffee Sales β * Temperature ε = IcedCoffeeSales – β * Temperature
6/1/2019 71 \$70,945 \$71,000 \$55
6/2/2019 57 \$56,969 \$57,000 \$31
6/3/2019 79 \$78,651 \$79,000 -\$349

For people who are familiar with linear algebra notation, we can rewrite the previous equation as

Translating this causal diagram in mathematical terms would yield the following equation:

IceCreamSales=βT.Temperature + βS.SummerMonth+ϵ

Obviously, this equation is a standard multiple linear regression, but the fact that it is based on a CD changes its interpretation. Outside of a causal framework, the only conclusion we would be able to draw from it is “an increase of one degree of temperature is associated with an increase of βT dollars in ice-cream sales”. Because correlation is not causation, it would be illegitimate to infer anything further. On the other hand, based on our CD, we can now say “assuming that the causal relationships represented in our CD are correct, then an increase of one degree of temperature will cause an increase of βT dollars in ice-cream sales”, which is what the business cares about.

Because data analysts tend to be more comfortable with quantitative approaches, I wouldn’t be surprised if this approach makes more sense to you and you’re tempted to try to avoid the qualitative side entirely. Couldn’t you build CDs based only on observed correlations in data without making any judgment call? Unfortunately, no. As in the optical illusion, neither of these two perspectives is “right” or “wrong”—CDs can be thought of as qualitative representations of the causal relationships we believe to exist in the world and they can be treated as an organizing tool for your data. The key to reaping the most benefits from your CDs is to go back and forth between the two perspectives and not stick only with one. This will allow you to check your intuitions against the data, while also ensuring that you’re interpreting the data correctly.

## Fundamental Structures of Causal Diagrams

Causal diagrams can take a bewildering variety of shapes. Fortunately, researchers have been working on causality for a while now, and they have brought some order to it:

• There exists only three fundamental structures and all causal diagrams can be represented as combinations of them: chains, forks and colliders.

• By looking at CDs as if they were family trees, we can easily describe relationships between variables that are far away from each other in the diagram, for example by saying that one is the “descendant” or the “child” of another.

And really, that’s all there is to it! Once you have familiarized yourself with these fundamental structures and how to name relationships between variables, you’ll be able to fully describe any CD you work with.

### Chains

A chain is a causal diagram with 3 boxes, representing 3 variables, and 2 arrows connecting these boxes, as in Figure 1-28.11.

What makes this CD a chain is that the two arrows are going “in the same direction”, i.e. the first arrow goes from one box to another, and the second arrow goes from that second box to the last one. This CD is an expansion of the one in figure 2.1. It represents the fact that temperature causes sales of iced coffee, which in turn cause sales of donuts.

Let’s define a few terms that will allow us to characterize the relationships between variables. In this diagram, Temperature is called the parent of Iced coffee sales, and Iced coffee sales is a child of Temperature. But Iced coffee sales is also a parent of Donuts sales, which is its child. When a variable has a parent/child relationship with another variable we call that a direct relationship. When there are intermediary variables between them, we call that an indirect relationship. The actual count of variables that makes a relationship indirect is not generally important, so you don’t have to count the number of boxes to describe the fundamental structure of the relationship between them.

In family terms, we say that a variable is the ancestor of another variable if the first variable is the parent of another, which may be the parent of another, and so on, ending up with our second variable as a child. In our example, Temperature is an ancestor of Donuts sales because it’s a parent of Iced coffee sales, which is itself a parent of Donuts sales. Very logically, this makes Donuts sales a descendant of Temperature.

If this were a complete diagram, another way of looking at it would be that Temperature influences Donuts sales only through its influence on Iced coffee sales. This makes Iced coffee sales the mediator of the influence of Temperature on Donuts sales.

If a mediator value does not change then the variables earlier in a chain won’t influence the variables further along the chain. For example, if C-Mart experiences a shortage of iced coffee, then we can expect that for the duration of that shortage, changes in temperature will not have an effect on the sales of donuts.

Taking it one step further, the influence that Temperature has on Donuts sales is already completely taken into account when we examine the relationship between Iced coffee sales and Donuts sales. If we were to run a regression of IcedCoffeeSales on DonutsSales without adding Temperature as a variable, it would not matter because the role of Temperature in DonutsSales would already be in the model.

#### Collapsing Chains

The causal diagram above translates into the following regression equations:

DonutsSales=βI.IcedCoffeeSales

IcedCoffeeSales=βT.Temperature

We can replace IcedCoffeeSales by its expression in the second equation:

DonutsSales=βI.(βT Temperature)= (βIβT)Temperature

But βIβT is just the product of two constant coefficients, so we can treat it as a new coefficient in itself: DonutsSales=βT.Temperature.6. We have managed to express DonutsSales as a linear function of temperature, which can in turn be translated into a causal diagram (figure 2.12).

Here, we have collapsed a chain, that is, we have removed the variable in the middle and replaced it with an arrow going from the first variable to the last. By doing so, we have effectively simplified our original causal diagram to focus on the relationship that we’re interested in. This can be useful when the last variable in a chain is a business metric we’re interested in and the first one is actionable. In some circumstances we might be interested in the intermediary relations between temperature and iced coffee sales, and between iced coffee sales and donuts sales, for example to manage pricing or promotions In other circumstances, we might be interested only in the relation between temperature and donuts sales, for example, to plan for inventory.

#### Expanding Chains

The collapsing operation can obviously be reversed: we can go from our last CD to the previous one by adding the Iced coffee sales variable in the middle. More generally, we say that we are expanding a chain whenever we inject an intermediary variable between two variables currently connected by an arrow. For example, let’s say that we start with the relationship between temperature and donuts sales (figure 2.8 above). This causal relationship translates into the equation DonutsSales=βT.Temperature. Let’s assume that Temperature affects DonutsSales only through Iced Coffee Sales. We can add this variable in our CD (figure 2.13).

Expanding chains can be useful to better understand what’s happening in a given situation. For example, let’s say that temperature increased but sales of donuts did not. There could be two potential reasons for that:

• First, the increase in temperature did not increase the sales of iced coffee, e.g. because the store manager has been more aggressive with the AC. In other words, the first arrow in figure 2.9 disappeared or weakened.

• Alternatively, the increase in temperature did increase the sales of iced coffee, but the increase in the sales of iced coffee did not increase the sales of donuts, e.g. because people are buying the newly offered biscuits instead. In other words, in figure 2.9, the first arrow is unchanged but the second one disappeared or weakened.

Depending on which one is true, you might take very different corrective actions--either turning off the AC or changing the price of biscuits. In many cases, looking at the variable in the middle of a chain, aka the mediator, will allow you to make better decisions.

### Forks

When a variable causes two or more effects, the relationship creates a fork. We have seen that temperature causes both iced coffee sales and ice-cream sales, so a representation of this fork would be as in figure 2.14.

This CD shows that temperature influences both iced coffee and ice-cream sales, but that they do not have a causal relationship with each other. If it is hot out, demand for both iced coffee and ice-cream increase, but buying one does not make you want to buy the other, nor does it make you less likely to buy the other.

This situation where two variables have a common cause is very frequent but also potentially problematic, because it creates a correlation among these two variables. It makes sense that when it is hot out, we will see an increase in sales of both, and when it is cold fewer people will want both. A linear regression predicting the sale of ice-cream from iced coffee sales would be fairly predictive, but here correlation does not equal causation and the coefficient provided by the model would not be accurate, since we know that the causal impact is 0.

Another way to look at this relationship is that if C-Mart experienced a shortage of iced coffee, we would not expect to see a change in the sale of ice-cream. More generally, it would only be a slight exaggeration to say that forks are one of the main roots of evil in the world of data analysis. Whenever we observe a correlation between two variables that doesn’t reflect direct causality between them (i.e. neither is the cause of the other), more often than not it will be because they share a common cause. From that perspective, one of the main benefits of using CDs is that they can show very clearly and intuitively what’s going on in those cases and how to correct for it.

Forks are also typical of situations where we look at demographic variables: age, gender and place of residence all influence a variety of other variables without necessarily any causal relationship between these other variables.

A question that sometimes comes up when you have a fork in the middle of a CD is whether you can still collapse the chain around it? For example, let’s say that we’re interested in analyzing the relationship between Summer Month and Iced Coffee sales and we have the CD in figure 2.15.

In this CD, there’s a fork between Summer Month on one side and Ice-cream sales and Temperature on the other, but there’s also a chain Summer Month → Temperature → Iced coffee sales. Can we collapse the chain?

In this case yes, because Ice-cream sales is not a confounder of the relationship between Summer Month and Iced coffee sales, which is the one we’re interested in. We can simplify our CD as in figure 2.16.

We’ll see in chapter 5 criteria to determine when our relationship of interest is confounded; when variables are not involved in any confounding, as in the CD above, they can safely be ignored and the CD simplified. However, we can do that only because neither sales of ice-cream nor temperature are confounders of the relationship between summer month and sales of iced coffee. If we were interested in the relationship between summer month and sales of ice-cream in figure 2.11, we could neglect sales of iced coffee but not temperature.

### Colliders

Very few things in the world have only one cause. When two or more variables cause the same outcome, the relationship creates a collider. Since C-Mart’s concession stand sells only two flavors of ice-cream, chocolate and vanilla, a causal diagram representing taste and ice-cream purchasing behavior would show that appetite for either flavor would cause past purchases of ice-cream at the stand. This would be displayed as in figure 2.17.

Colliders are often created when you slice or disaggregate a variable to reveal its components, as we’ll now see.

#### Slicing/Disaggregating Variables

Forks and colliders are often created when you slice or disaggregate a variable to reveal its components. In a previous example, we looked at the relationship between Temperature and Donuts sales, where Iced coffee sales was the mediator (Figure 1-28.18).

But maybe we want to split iced coffee sales by type to better understand demand dynamics. This is what I mean by “slicing” a variable. This is allowed, because we can express the total iced coffee sales as the sum of sales by type, say Americano and Latte:

IcedCoffeeSales = IcedAmericanoSales + IcedLatteSales

Our CD would now become figure 2.19, with a fork on the left and a collider on the right.

Each slice of the variable would have its own equation:

IcedAmericanoSales = βT,A.Temperature

IcedLatteSales = βT,L.Temperature

Since Temperature is mediated by our Iced Coffee sales slices, we can create a unified multiple regression for Donuts sales as follows:

DonutSales = βIA.IcedAmericanoSales + βIL.IcedLatteSales

This would allow you to understand more finely what’s happening—should you plan for the same increase in sales in both types when temperature increases? Do they both have the same effect on Donuts sales or should you try to favor one of them?

#### Aggregating Variables

As you may have guessed, slicing variables can be reversed, and more generally we can aggregate variables that have the same causes and effects. This can be used to aggregate and disaggregate data analysis by product, region, line of business, etc. But it can also be used more loosely, to represent important causal factors that are not precisely defined. For example, let’s say that age and gender both impact taste for vanilla ice-cream as well as the propensity to buy ice-cream at C-Mart concession stand (Figure 1-28.20).

Because Age and gender have the same causal relationships, they can be aggregated into a Demographics variable (figure 2.21).

In this case, we obviously don’t have a single column in our data called “Demographics”; we’re simply using that variable in our CD as a shortcut for a variety of variables that we may or may not want to explore in further detail later on. Let’s say that we want to run an A/B test and we want to understand the causal relationships at hand. As we’ll see later, randomization can allow us to control for demographic factors so that we won’t have to include them in our analysis, but we might want to include them in our CD of the situation without randomization. If need be, we can always expand our diagram to accurately represent the demographic variables involved. Remember however that any variable can be split, but only variables that have the same direct and indirect relationships can be aggregated.

In the three fundamental structures that we’ve seen, there has been only one arrow between two given boxes. More generally, it was not possible to reach the same variable twice by following the direction of arrows (e.g. A → B → C → A). A variable could be the effect of one variable and the cause of another, but it could not be at the same time the cause and the effect of one variable.

In real life however, we often see variables that influence each other causally. This type of CD is called a cycle. Cycles can arise for a variety of reasons; two of the most common in behavioral data analysis are substitution effects and feedback loops. Fortunately, there are some workarounds that will allow you to deal with cycles when you encounter them.

#### Understanding Cycles: Substitution Effects and Feedback Loops

Substitution effects are a cornerstone of economics theory: customers might substitute a product for another, depending on the products’ availability, price, and the customers’ desire for variety. For example, customers coming to the C-Mart concession store might choose between iced coffee and hot coffee based on temperature, but also special promotions and how often they had coffee this week. Therefore, there is a causal relationship from purchases of iced coffee to purchases of hot coffee, and another causal relationship in the opposite direction (figure 2.22).

One thing to note is that the direction of the arrows shows the direction of causality (what is the cause and what is the effect), not the sign of the effect. In all of the CDs we looked at before the variables had a positive relationship where an increase in one caused an increase in the other. In this case, the relationships are negative, where an increase in one variable will cause a decrease in the other. The sign of the effect does not matter for causal diagrams, and a regression will be able to sort out the sign for the coefficient correctly as long as you correctly identify the relevant causal relationships.

Another common cycle is a feedback loop, where an actor modifies their behavior in reaction to changes in the environment. For example, a store manager at C-Mart might keep an eye on the length of waiting lines and open new lines if the existing ones get too long, so that customers don’t give up and just leave (figure 2.23).

#### Managing Cycles

Cycles reflect situations that are often complex to study and manage, which is why a whole field of research, called systems thinking, has sprouted for that purpose7. Complex mathematical methods, such as Structural Equation Modeling, have been developed to deal accurately with cycles, but their analysis would take us beyond the scope of this book. I would be remiss however if I didn’t give you any solution, so I’ll mention two rules of thumb that should allow you to not get stuck with cycles.

The first one is to pay close attention to timing. In almost all cases, it takes some time for one variable to influence another, which means you can “break the cycle” and turn it into a noncyclical CD by looking at your data at a more granular level of time. For example, let’s say that it takes 15mn for a store manager to react to an increasing waiting time by getting new lines open, and it similarly takes 15mn for customers to adjust their perception of waiting time. In that case, we can rewrite the CD above as in figure 2.24.

Let’s break this CD down into pieces. On the left, we have an arrow from average waiting time to number of customers waiting:

NbCustomersWaiting(t+15mn) = β1.AvgWaitingTime(t)

This means that the number of customers waiting at say 9:15am would be expressed as a function of the average waiting time at 9:00am. Then the number of customers waiting at 9:30am would have the same relation to the average waiting time at 9:15am and so on.

Similarly, on the right, we have arrow from average waiting time to number of lines open:

NbLinesOpen(t+15mn) = β2.AvgWaitingTime(t)

This means that the number of lines open at 9:15am would be expressed as a function of the average waiting time at 9:00am. Then the number of lines open at 9:30am would have the same relation to the average waiting time at 9:15am and so on.

Then in the middle, we have causal arrows from the number of customers waiting and from the number of lines open to the average waiting time. This would translate into the equation

AvgWaitingTime(t) = β3.NbCustomersWaiting(t)+β4.NbLinesOpen(t)

This means that the average waiting time for customers reaching the checkout lines at 9:15am depends on the number of customers already present and the number of checkout lines open at 9:30am. Then the average waiting time for customers reaching the checkout lines at 9:30am depends on the number of customers already present and the number of checkout lines open at 9:30am and so on.

By breaking down variables into time increments, we have been able to create a CD where there is no cycle in the strict sense. We can estimate the three linear regression equations above without introducing any circular logic.

The second rule of thumb to deal with cycles is to simplify your CD and keep only the arrows along the causal path you’re most interested in. Feedback effects (where a variable influences the variable that just influenced it) are generally smaller, and often much smaller, than the first effect and can be ignored as a first approximation.

In our example of iced and hot coffee, you might be worried that the increase in sale of iced coffee when it is hot will decrease the sale of hot coffee; this is a reasonable concern that you should investigate. However, it’s unlikely that the decrease in sales of hot coffee would in turn trigger a further increase in sales of iced coffee and you can ignore that feedback effect in your CD (figure 2.25).

In figure 2.21, we delete the arrow from Purchases of hot coffee to Purchases of Iced coffee and ignore that relationship, as a reasonable approximation.

Once again, this is just a rule of thumb, and certainly not a blanket invitation to disregard cycles and feedback effects. These should be represented fully in your complete CD, to guide future analyses.

### Review of Elements in Causal Diagrams

Chains, forks and colliders represent the only possible 3 ways for 3 variables to be related to each other in a CD. They are not exclusive of each other, however, and it’s actually reasonably common to have 3 variables that exhibit all 3 structures at the same time, as was the case in our very first example (figure 2.26).

Here, Summer month influences Ice-cream sales as well as temperature, which itself influences Ice-cream sales. The causal relationships at play are reasonably simple and easy to grasp, but this graph also contains all three types of basic relationships:

• A chain: Summer month → Temperature → Ice-cream sales

• A fork, with Summer month causing both Temperature and Ice-cream sales

• A collider, with Ice-cream sales being caused both by Temperature and Summer month

Another thing to note in a situation like this one is that variables have more than one relationship with each other. For example, Summer month is the parent of Ice-cream sales because there is an arrow going directly from the former to the latter (a direct relationship); but at the same time, Summer month is also an ancestor of Ice-cream sales because of the chain Summer month → Temperature → Ice-cream sales (an indirect relationship). So you can see these are not exclusive!

### Paths

Having seen the various ways variables can interact, we can now introduce one last concept that encompasses all of them: paths. We say that there is a path between two variables if there are arrows between them, regardless of the direction of the arrows and no variable appears twice along the way. Let’s see what that looks like in a CD we have seen before (figure 2.28).

In the previous CD, there are two paths from Summer month to Iced coffee sales:

• One path along the chain Summer month → Temperature → Iced coffee sales,

• A second path through Ice-cream sales, Summer month → Ice-cream sales ← Temperature → Iced coffee sales

This means that a chain is a path, but so are a fork or a collider! Also note that two different paths between two variables can also share some arrows, as long as there is at least one difference between them, as is the case here: the arrow from temperature to iced coffee sales appears in both paths.

However, the following is not a valid path between Temperature and Iced Coffee sales because Temperature appears twice:

• Temperature ← Summer Month → Ice-cream sales ← Temperature → Iced Coffee sales

One consequence of these definitions is that if you pick two different variables in a CD, there is always at least one path between them. The definition of paths may seem so broad that it is useless, but as we’ll see in chapter 5, paths will actually play a crucial role in identifying confounders in a CD.

# Chapter Conclusion

Linear and logistic regressions are the workhorses of data analysis, but their results can be biased by the presence of confounders. Unfortunately, as we’ve seen through examples, simply throwing all available variables and the kitchen sink in a regression is not sufficient to resolve confounding. Worse, controlling on the wrong variables can introduce spurious correlations and create new biases.

As a first step toward unbiased regression, I introduced a tool, causal diagrams. CDs may be the best analytical tool you’ve never heard of. They can be used to represent abstract causal relationships in the real world, as well as causal correlations in our data; but they are most powerful as a bridge between the two, allowing us to connect our intuition and expert knowledge to observed correlations in data, and vice versa.

CDs can get convoluted and complex, but they are based on three simple building blocks: chains, forks and colliders. They can also be collapsed or expanded, sliced or aggregated, according to simple rules that are consistent with linear algebra.

The full power of CDs will become apparent in chapter 5, where we’ll see that they allow us to optimally handle confounders in regression, even with non-experimental data. But CDs are also helpful more broadly, to help us think better about data. In the next chapter, as we get into cleaning and prepping data for analysis, they will allow us to remove biases in our data prior to any analysis. This will give you the opportunity to get more familiar with CDs in a simple setting.

# References

• Pearl, Causality, Cambridge University Press, 2009. Pearl’s earlier book on causality, with detailed graduate-level math.

• Pearl & Mackenzie, The Book of Why: The New Science of Cause and Effect, Basic Books, 2018. The most approachable introduction to causal analysis and causal diagrams I have encountered so far, by one of the prominent researchers in the field.

• Shipley, Cause and Correlation in Biology: A User’s Guide to Path Analysis, Structural Equations and Causal Inference with R, Cambridge University Press, 2016. You’re not a biologist? Neither am I. That book has still helped me deepen my understanding of causal diagrams, and with the limited number of books on the topic, beggars can’t be choosers.

# Exercises

Building causal diagrams is like swimming or riding a bike: no amount of theoretical preparation can replace trying to do it again and again until it works. However, as you get the hang of it, it gets more and more enjoyable and you’ll find yourself quickly drawing a CD to analyze or explain a situation.

My hope is that these exercises will offer you a gentle learning curve that will minimize the pain along the way.

Exercise 1. The following descriptions relate to a C-Mart located across the street from a university campus. In each case, draw the corresponding causal diagram and give the name of the fundamental structure it represents.

1. Sales of alcohol are higher on certain days of the week, namely Friday and Saturday; whenever sales of alcohol are high on a given day, sales of aspirin are higher the next day.

2. Sales of ramen “120 for the price of 100!” maxi-packs are higher in September; sales of pens and paper are higher in September.

3. Sales of alcohol are higher on certain days of the week, namely Friday and Saturday; Sales of alcohol are higher during Spring Break, regardless of the day of the week.

Exercise 2. Complete the sentences for the CD in figure 2.29.

1. Electronic Toy sales is the parent of ___

2. Eggnog sales is the child of ___

3. Battery sales is the descendant of ___

4. December has ___ (a direct/an indirect) relationship with Electronic Toy sales

5. December has ___ (a direct/an indirect) relationship with Battery sales

1 The book of Why, p160. In case you’re wondering, the aforementioned statistician is Donald Rubin.

3 Technically speaking, this is a slightly different situation, because there is a threshold effect instead of two linear (or logistic) relationships, but the underlying principle that including the wrong variable can create artificial correlations remains.

4 Xiaolin Wu & Xi Zhang, “Automated Inference on Criminality Using Face Images”, https://arxiv.org/pdf/1611.04135.pdf.

5 You’ve probably seen that picture already, but just in case: https://www.illusionsindex.org/i/duck-rabbit.

6 In the Early Release, this is incorrectly rendered. There should be a tilde spanning the top of βT.

7 Interested readers are referred to Thinking in Systems: A Primer by Donella Meadows and Diana Wright, as well as The Fifth Discipline: The Art & Practice of The Learning Organization by Peter Senge.

Get Behavioral Data Analysis with R and Python now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.