Chapter 22. The T-Test

The t-test (also known as the two-sample t-test) is used in clinical applications and genome analysis to test statistical hypotheses. The t-test for independent samples compares the means (μ, also known as the average) of two samples. In statistics, to compare two data sets, we convert the data to a simpler form, such as the means of the data, and then compute and compare the means. Since we are comparing random samples, there is room for random errors (usually denoted by the sample’s standard deviation, 𝜎). The standard deviation equation for a population of N samples is defined as:

sigma equals StartRoot StartFraction sigma-summation Underscript i equals 1 Overscript upper N Endscripts left-parenthesis upper X Subscript i Baseline minus mu right-parenthesis squared Over upper N EndFraction EndRoot

where:

𝜎 = the standard deviation
X_i = i^th value in the population
𝜇 = the mean of the values in the population

In factoring a random error, therefore, we might be comparing μ ± σ. According to Sarah Boslaugh’s book Statistics in a Nutshell (O’Reilly), “The purpose of [the t-test] is to determine whether the means of the populations from which the samples were drawn are the same. The subjects in the two samples are assumed to be unrelated and to have been independently selected from their populations.”

This chapter will provide MapReduce/Hadoop and Spark solutions for the t-test. The MapReduce algorithm presented here is generic and can be used for any high volume of data.

Performing the T-Test on Biosets

In genome analysis and especially in somatic mutations, the t-test ...

Get Data Algorithms now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Data Algorithms by Mahmoud Parsian

Chapter 22. The T-Test

Performing the T-Test on Biosets

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly