APPENDIX B 137

if observed_r > 0:

print "greater than or equal to",

else:

print "less than or equal to",

print "%.2f" %observed_r, "is %.2f" % (count / float(num_shuffles))

CorrelationConf.py

#!/usr/bin/python

######################################

# Linear Correlation Confidence Interval

# From: Statistics is Easy! By Dennis Shasha and Manda Wilson

#

#

# Uses shuffling & bootstrapping to get a 90% confidence interval for the r statistic.

#

# Author: Manda Wilson

#

# Example of FASTA formatted input file:

# >grp_x

# 1350 1510 1420 1210 1250 1300 1580 1310 1290 1320 1490 1200 1360

# >grp_y

# 3.6 3.8 3.7 3.3 3.9 3.4 3.8 3.7 3.5 3.4 3.8 3.0 3.1

#

# Included in the code, but NOT in the pseudocode,

# is the bias-corrected confidence interval.

# See http://www.statisticsiseasy.com/BiasCorrectedConfInter.html

# for more information on bias-corrected cofidence intervals.

#

# Pseudocode:

#

# 1. Calculate r for the observed values (in our example it is .58).

# a. mean_x = sum of x values / number of x values

# b. mean_y = sum of y values / number of y values

# c. Calculate the sum of products:

# i. Initialize total to 0

# ii. For each pair of x, y values:

# I. product = (x - mean_x) * (y - mean_y)

# II. total += product

# d. Calculate the sum of squares for the x values:

# i. Initialize total to 0

# ii. For each x value:

# I. diff_sq = (x - mean_x)^2

# II. total += diff_sq

# e. Calculate the sum of squares for the y values as we did for the x values in

step (1d)

# f. r = sum of products / square root(sum of squares x * sum of squares y)

#

# 2. Do the following 10,000 times:

# a.

138 STATISTICS IS EASY!

# i. Create two new arrays of the same size as the original sample arrays (one

for x and one for y)

# ii. Fill the new arrays with randomly picked pairs from the original arrays

(randomly picked with replacement)

# b. Compute the regression line on the bootstrap samples, just as we did in step

(1)

# with the original samples.

#

# 3. Sort the differences computed in step (2).

#

# 4. Compute the size of each interval tail. If we want a 90% confidence interval,

then 1 - 0.9 yields the

# portion of the interval in the tails. We divide this by 2 to get the size of each

tail, in this case 0.05.

#

# 5. Compute the upper and lower bounds. To get the lower bound we multiply the tail

size by the number of

# bootstraps we ran and round this value up to the nearest integer (to make sure

this value maps

# to an actual boostrap we ran). In this case we multiple 0.05 by 10,000 and get

500, which rounds up to 500.

#

# To compute the upper bound we subtract the tail size from 1 and then multiply that

by the number of bootstraps

# we ran. Round the result of the last step down to the nearest integer. Again,

this is to ensure this value

# maps to an actual bootstrap we ran. We round the lower bound up and the upper

bound down to reduce the

# confidence interval size, so that we can still say we have as much confidence in

this result.

#

# 6. The bootstrap values at the lower bound and upper bound give us our confidence

interval.

#

######################################

import math

import random

######################################

#

# Adjustable variables

#

######################################

input_file = 'input/Correlation.vals'

conf_interval = 0.9

######################################

#

# Subroutines

#

######################################

Get *Statistics is Easy!* now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.