The t-test (also known as the two-sample t-test) is used in clinical applications and genome analysis to test statistical hypotheses. The t-test for independent samples compares the means (μ, also known as the average) of two samples. In statistics, to compare two data sets, we convert the data to a simpler form, such as the means of the data, and then compute and compare the means. Since we are comparing random samples, there is room for random errors (usually denoted by the sample’s standard deviation, 𝜎). The standard deviation equation for a population of N samples is defined as:
In factoring a random error, therefore, we might be comparing μ ± σ. According to Sarah Boslaugh’s book Statistics in a Nutshell (O’Reilly), “The purpose of [the t-test] is to determine whether the means of the populations from which the samples were drawn are the same. The subjects in the two samples are assumed to be unrelated and to have been independently selected from their populations.”
This chapter will provide MapReduce/Hadoop and Spark solutions for the t-test. The MapReduce algorithm presented here is generic and can be used for any high volume of data.
In genome analysis and especially in somatic mutations, the t-test ...