Chapter 7. Relationships Between Variables
So far we have only looked at one variable at a time. In this chapter we look at relationships between variables. Two variables are related if knowing one gives you information about the other. For example, height and weight are related; people who are taller tend to be heavier. Of course, it is not a perfect relationship: there are short heavy people and tall light ones. But if you are trying to guess someone’s weight, you will be more accurate if you know their height than if you don’t.
The code for this chapter is in scatter.py. For information about downloading and
working with this code, see Using the Code.
Scatter Plots
The simplest way to check for a relationship between two variables is a scatter plot, but making a good scatter plot is not always easy. As an example, I’ll plot weight versus height for the respondents in the BRFSS (see The lognormal Distribution).
Here’s the code that reads the data file and extracts height and weight:
df = brfss.ReadBrfss(nrows=None)
sample = thinkstats2.SampleRows(df, 5000)
heights, weights = sample.htm3, sample.wtkg2SampleRows chooses a random subset of the data:
def SampleRows(df, nrows, replace=False):
indices = np.random.choice(df.index, nrows, replace=replace)
sample = df.loc[indices]
return sampledf is the DataFrame, nrows is the number of rows to choose, and
replace is a boolean indicating whether sampling should be done with replacement; in other words, whether the same row could be chosen more ...