THE ESSENTIAL STATS 39

OK, the difference is not signiﬁcant. If it were we would check if it is large; if the difference

matters.

download code and input ﬁles

When we ran the TwoWayAnovaConf.py code we got:

Observed F-statistic: 0.93

We have 90.0 % confidence that the true F-statistic is between: 0.50 and 8.44

***** Bias Corrected Confidence Interval *****

We have 90.0 % confidence that the true F-statistic is between: 0.20 and 2.04

4.7 Linear Regression

4.7.1 Why and when

Linear regression is closely tied to linear correlation. Here we try to ﬁnd the line that best ﬁts

our data so that we can use it to predict y given a new x. Correlation measures how tightly our data

ﬁts that line, and therefore how good we expect our prediction to be.

4.7.2 Calculate with example

The ﬁrst step is to draw a scatter plot of your data. Traditionally, the independent variable is

placed along the x-axis, and the dependent variable is placed along the y-axis. The dependent vari-

able is the one that we expect to change when the independent one changes. Note whether the

data looks as if it lies approximately along a straight line. If it has some other pattern, for exam-

ple, a curve, then a linear regression should not be used.

The most common method used to ﬁnd a regression line is the least squares method. Remember

that the equation of a line is:

y = bx + a

where b equals the slope of the line and a is where the line crosses the y-axis.

The least squares method minimizes the vertical distances between our data points (the

observed values) and our line (the predicted values).

The equation of the line we are trying to ﬁnd is:

y' = bx + a

where y' is the predicted value of y for some x.

BE CAREFUL

Regression can be misleading when there are outliers or a nonlinear relationship.

40 STATISTICS IS EASY!

We have to calculate b and a.

where XY

SP

is the sum of products:

and X

SS

is the sum of squares for X:

In our example:

XY

SP

= 230.5 and X

SS

= 163677 and 230.5 / 163677 = 0.0014, so b = 0.0014.

The regression line will always pass through the point ( , ) so we can plug this point into our

equation to get a, where the line passes through the y-axis.

x

i

y

i

1350 3.6 1353 3.5 -3 .1 -.3 9

1510 3.8 1353 3.5 157 .3 47.1 24649

1420 3.7 1353 3.5 67 .2 13.4 4489

1210 3.3 1353 3.5 -143 -.2 28.6 20449

1250 3.9 1353 3.5 -103 .4 -41.2 10609

1300 3.4 1353 3.5 -53 -.1 5.3 2809

1580 3.8 1353 3.5 227 .3 68.1 51529

1310 3.7 1353 3.5 -43 .2 -8.6 1849

1290 3.5 1353 3.5 -63 0 0 3969

1320 3.4 1353 3.5 -33 -.1 3.3 1089

1490 3.8 1353 3.5 137 .3 41.1 18769

1200 3.0 1353 3.5 -153 -.5 76.5 23409

1360 3.1 1353 3.5 7 -.4 -2.8 49

Totals - - - - - - 230.5 163677

Get *Statistics is Easy!* now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.