 APPENDIX B 137
if observed_r > 0:
print "greater than or equal to",
else:
print "less than or equal to",
print "%.2f" %observed_r, "is %.2f" % (count / float(num_shuffles))
CorrelationConf.py
#!/usr/bin/python
######################################
# Linear Correlation Confidence Interval
# From: Statistics is Easy! By Dennis Shasha and Manda Wilson
#
#
# Uses shuffling & bootstrapping to get a 90% confidence interval for the r statistic.
#
# Author: Manda Wilson
#
# Example of FASTA formatted input file:
# >grp_x
# 1350 1510 1420 1210 1250 1300 1580 1310 1290 1320 1490 1200 1360
# >grp_y
# 3.6 3.8 3.7 3.3 3.9 3.4 3.8 3.7 3.5 3.4 3.8 3.0 3.1
#
# Included in the code, but NOT in the pseudocode,
# is the bias-corrected confidence interval.
# See http://www.statisticsiseasy.com/BiasCorrectedConfInter.html
#
# Pseudocode:
#
# 1. Calculate r for the observed values (in our example it is .58).
# a. mean_x = sum of x values / number of x values
# b. mean_y = sum of y values / number of y values
# c. Calculate the sum of products:
# i. Initialize total to 0
# ii. For each pair of x, y values:
# I. product = (x - mean_x) * (y - mean_y)
# II. total += product
# d. Calculate the sum of squares for the x values:
# i. Initialize total to 0
# ii. For each x value:
# I. diff_sq = (x - mean_x)^2
# II. total += diff_sq
# e. Calculate the sum of squares for the y values as we did for the x values in
step (1d)
# f. r = sum of products / square root(sum of squares x * sum of squares y)
#
# 2. Do the following 10,000 times:
# a. 138 STATISTICS IS EASY!
# i. Create two new arrays of the same size as the original sample arrays (one
for x and one for y)
# ii. Fill the new arrays with randomly picked pairs from the original arrays
(randomly picked with replacement)
# b. Compute the regression line on the bootstrap samples, just as we did in step
(1)
# with the original samples.
#
# 3. Sort the differences computed in step (2).
#
# 4. Compute the size of each interval tail. If we want a 90% confidence interval,
then 1 - 0.9 yields the
# portion of the interval in the tails. We divide this by 2 to get the size of each
tail, in this case 0.05.
#
# 5. Compute the upper and lower bounds. To get the lower bound we multiply the tail
size by the number of
# bootstraps we ran and round this value up to the nearest integer (to make sure
this value maps
# to an actual boostrap we ran). In this case we multiple 0.05 by 10,000 and get
500, which rounds up to 500.
#
# To compute the upper bound we subtract the tail size from 1 and then multiply that
by the number of bootstraps
# we ran. Round the result of the last step down to the nearest integer. Again,
this is to ensure this value
# maps to an actual bootstrap we ran. We round the lower bound up and the upper
bound down to reduce the
# confidence interval size, so that we can still say we have as much confidence in
this result.
#
# 6. The bootstrap values at the lower bound and upper bound give us our confidence
interval.
#
######################################
import math
import random
######################################
#