Chapter 22. The T-Test

The t-test (also known as the two-sample t-test) is used in clinical applications and genome analysis to test statistical hypotheses. The t-test for independent samples compares the means (μ, also known as the average) of two samples. In statistics, to compare two data sets, we convert the data to a simpler form, such as the means of the data, and then compute and compare the means. Since we are comparing random samples, there is room for random errors (usually denoted by the sample’s standard deviation, 𝜎). The standard deviation equation for a population of N samples is defined as:

sigma equals StartRoot StartFraction sigma-summation Underscript i equals 1 Overscript upper N Endscripts left-parenthesis upper X Subscript i Baseline minus mu right-parenthesis squared Over upper N EndFraction EndRoot

where:

  • 𝜎 = the standard deviation
  • Xi = ith value in the population
  • 𝜇 = the mean of the values in the population

In factoring a random error, therefore, we might be comparing μ ± σ. According to Sarah Boslaugh’s book Statistics in a Nutshell (O’Reilly), “The purpose of [the t-test] is to determine whether the means of the populations from which the samples were drawn are the same. The subjects in the two samples are assumed to be unrelated and to have been independently selected from their populations.”

This chapter will provide MapReduce/Hadoop and Spark solutions for the t-test. The MapReduce algorithm presented here is generic and can be used for any high volume of data.

Performing the T-Test on Biosets

In genome analysis and especially in somatic mutations, the t-test ...

Get Data Algorithms now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.