Chapter 26. Gene Aggregation

This chapter provides four distinct solutions to gene aggregation (also known as marker frequency in clinical applications), in MapReduce/Hadoop and Spark. The input data for gene aggregation is patients’ biosets. As discussed in previous chapters, a bioset, also called a gene signature, encompasses data in the form of experimental sample comparisons (for transcriptomic, epigenetic, and copy-number variation data), as well as genotype signatures (for GWAS and mutational data). In simple terms, a bioset is a list of key-value pairs, where the key is a geneID and the value is a list of associated attributes. Gene aggregation is used in clinical applications to identify transcriptional signatures and patterns of gene expression data. Gene aggregation is also used to see how genes are grouped together and how this affects the overall analysis. Gene aggregation is an evolutionary method and depends on chromosomal folding and higher-order structures.

Gene aggregation is achieved through three metrics:

  • Reference type refers to the type of patient data:

    • r1 = normal

    • r2 = disease

    • r3 = paired

    • r4 = unknown

  • Gene filter type refers to the type of filter applied to the data. The filter type indicates how gene values will be grouped and analyzed. For example, if a filter type is up, then only gene values that are greater than a filter value threshold will be considered for further analysis. There are three gene filter types:

    • Absolute value (abs)

Get Data Algorithms now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.