Chapter 7. Relationships Between Variables
So far we have only looked at one variable at a time. In this chapter we look at relationships between variables. Two variables are related if knowing one gives you information about the other. For example, height and weight are related; people who are taller tend to be heavier. Of course, it is not a perfect relationship: there are short heavy people and tall light ones. But if you are trying to guess someone’s weight, you will be more accurate if you know their height than if you don’t.
The code for this chapter is in scatter.py
. For information about downloading and
working with this code, see Using the Code.
Scatter Plots
The simplest way to check for a relationship between two variables is a scatter plot, but making a good scatter plot is not always easy. As an example, I’ll plot weight versus height for the respondents in the BRFSS (see The lognormal Distribution).
Here’s the code that reads the data file and extracts height and weight:
df = brfss.ReadBrfss(nrows=None) sample = thinkstats2.SampleRows(df, 5000) heights, weights = sample.htm3, sample.wtkg2
SampleRows
chooses a random subset of the data:
def SampleRows(df, nrows, replace=False): indices = np.random.choice(df.index, nrows, replace=replace) sample = df.loc[indices] return sample
df
is the DataFrame, nrows
is the number of rows to choose, and
replace
is a boolean indicating whether sampling should be done with replacement; in other words, whether the same row could be chosen more ...
Get Think Stats, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.