Chapter 21. Allelic Frequency

Allelic frequency analysis is a technique used to find the frequency of alleles for genomic data (especially for the germline data type). An allelic frequency is defined as “the percentage of a population of a species that carries a particular allele on a given chromosome locus.” In this chapter, we’ll develop a MapReduce solution to aggregate all genomic data for each desired key (composed of [chromosome, start-position, stop-position]), then apply Fisher’s Exact Test, a statistical test to determine if there are nonrandom associations between two groups of variables (these two groups of variables can be patient biosets, which will be discussed shortly). We will then analyze and plot the output of the MapReduce program. The input for allelic frequency calculation comes from VCF files generated by DNA sequencing pipelines. Typically each VCF record includes chromosome, start-position, stop-position, genome-reference, and two alleles (labeled allele1 and allele2—one from the mother and one from the father). This information will be sufficient for us to perform an allelic frequency analysis for two sets of data.

The main goal of this chapter is to present a MapReduce solution to allelic frequency calculation using Fisher’s Exact Test, comprising three MapReduce jobs.

To comprehend the importance and the impact of allelic frequency, you must first understand the meaning of mutations, migrations, and selections. For details on these concepts, see the ...

Get Data Algorithms now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.