book

Data Algorithms

by Mahmoud Parsian

July 2015

Intermediate to advanced

778 pages

17h 9m

English

O'Reilly Media, Inc.

Read now

Unlock full access

What Is MapReduce?Simple Explanation of MapReduceWhen to Use MapReduceWhat MapReduce Isn’tWhy Use MapReduce?Hadoop and SparkWhat Is in This Book?What Is the Focus of This Book?Who Is This Book For?Online ResourcesWhat Software Is Used in This Book?Conventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgmentsComments and Questions for This Book
Solutions to the Secondary Sort ProblemImplementation DetailsData Flow Using Plug-in ClassesMapReduce/Hadoop Solution to Secondary SortInputExpected Outputmap() Functionreduce() FunctionHadoop Implementation ClassesSample Run of Hadoop ImplementationHow to Sort in Ascending or Descending OrderSpark Solution to Secondary SortTime Series as InputExpected OutputOption 1: Secondary Sorting in MemorySpark Sample RunOption #2: Secondary Sorting Using the Spark FrameworkFurther Reading on Secondary Sorting
Secondary Sorting TechniqueComplete Example of Secondary SortingInput FormatOutput FormatComposite KeySample Run—Old Hadoop APIInputRunning the MapReduce JobOutputSample Run—New Hadoop APIInputRunning the MapReduce JobOutput
Top N, FormalizedMapReduce/Hadoop Implementation: Unique KeysImplementation Classes in MapReduce/HadoopTop 10 Sample RunFinding the Top 5Finding the Bottom 10Spark Implementation: Unique KeysRDD RefresherSpark’s Function ClassesReview of the Top N Pattern for SparkComplete Spark Top 10 SolutionSample Run: Finding the Top 10Parameterizing Top NFinding the Bottom NSpark Implementation: Nonunique KeysComplete Spark Top 10 SolutionSample RunSpark Top 10 Solution Using takeOrdered()Complete Spark ImplementationFinding the Bottom NAlternative to Using takeOrdered()MapReduce/Hadoop Top 10 Solution: Nonunique KeysSample Run
Left Outer Join ExampleExample QueriesImplementation of Left Outer Join in MapReduceMapReduce Phase 1: Finding Product LocationsMapReduce Phase 2: Counting Unique LocationsImplementation Classes in HadoopSample RunSpark Implementation of Left Outer JoinSpark ProgramRunning the Spark SolutionRunning Spark on YARNSpark Implementation with leftOuterJoin()Spark ProgramSample Run on YARN
Example of the Order Inversion PatternMapReduce/Hadoop Implementation of the Order Inversion PatternCustom PartitionerRelative Frequency MapperRelative Frequency ReducerImplementation Classes in HadoopSample RunInputRunning the MapReduce JobGenerated Output
Example 1: Time Series Data (Stock Prices)Example 2: Time Series Data (URL Visits)Formal DefinitionPOJO Moving Average SolutionsSolution 1: Using a QueueSolution 2: Using an ArrayTesting the Moving AverageSample RunMapReduce/Hadoop Moving Average SolutionInputOutputOption #1: Sorting in MemorySample RunOption #2: Sorting Using the MapReduce FrameworkSample Run
MBA GoalsApplication Areas for MBAMarket Basket Analysis Using MapReduceInputExpected Output for Tuple2 (Order of 2)Expected Output for Tuple3 (Order of 3)Informal MapperFormal MapperReducerMapReduce/Hadoop Implementation ClassesSample RunSpark SolutionMapReduce Algorithm WorkflowInputSpark ImplementationYARN Script for SparkCreating Item Sets from Transactions
InputPOJO Common Friends SolutionMapReduce AlgorithmThe MapReduce Algorithm in ActionSolution 1: Hadoop Implementation Using TextSample Run for Solution 1Solution 2: Hadoop Implementation Using ArrayListOfLongsWritableSample Run for Solution 2Spark SolutionSpark ProgramSample Run of Spark Program

Customers Who Bought This Item Also BoughtInputExpected OutputMapReduce SolutionFrequently Bought TogetherInput and Expected OutputMapReduce SolutionRecommend ConnectionInputOutputMapReduce SolutionSpark ImplementationSample Run of Spark Program
InputMapReduce Phase 1MapReduce Phases 2 and 3MapReduce Phase 2: MapperMapReduce Phase 2: ReducerMapReduce Phase 3: MapperMapReduce Phase 3: ReducerSimilarity MeasuresMovie Recommendation Implementation in SparkHigh-Level Solution in SparkSample Run of Spark Program
Markov Chains in a NutshellMarkov Model Using MapReduceGenerating Time-Ordered Transactions with MapReduceHadoop Solution 1: Time-Ordered TransactionsHadoop Solution 2: Time-Ordered TransactionsGenerating State SequencesGenerating a Markov State Transition Matrix with MapReduceUsing the Markov Model to Predict the Next Smart Email Marketing DateSpark SolutionInput FormatHigh-Level StepsSpark ProgramScript to Run the Spark ProgramSample Run
What Is K-Means Clustering?Application Areas for ClusteringInformal K-Means Clustering Method: Partitioning ApproachK-Means Distance FunctionK-Means Clustering FormalizedMapReduce Solution for K-Means ClusteringMapReduce Solution: map()MapReduce Solution: combine()MapReduce Solution: reduce()K-Means Implementation by SparkSample Run of Spark K-Means Implementation
kNN ClassificationDistance FunctionskNN ExampleAn Informal kNN AlgorithmFormal kNN AlgorithmJava-like Non-MapReduce Solution for kNNkNN Implementation in SparkFormalizing kNN for the Spark ImplementationInput Data Set FormatsSpark ImplementationYARN shell script
Training and Learning ExamplesNumeric Training DataSymbolic Training DataConditional ProbabilityThe Naive Bayes Classifier in DepthNaive Bayes Classifier ExampleThe Naive Bayes Classifier: MapReduce Solution for Symbolic DataStage 1: Building a Classifier Using Symbolic Training DataStage 2: Using the Classifier to Classify New Symbolic DataThe Naive Bayes Classifier: MapReduce Solution for Numeric DataNaive Bayes Classifier Implementation in SparkStage 1: Building a Classifier Using Training DataStage 2: Using the Classifier to Classify New DataUsing Spark and MahoutApache SparkApache Mahout
Sentiment ExamplesSentiment Scores: Positive or NegativeA Simple MapReduce Sentiment Analysis Examplemap() Function for Sentiment Analysisreduce() Function for Sentiment AnalysisSentiment Analysis in the Real World
Basic Graph ConceptsImportance of Counting TrianglesMapReduce/Hadoop SolutionStep 1: MapReduce in ActionStep 2: Identify TrianglesStep 3: Remove Duplicate TrianglesHadoop Implementation ClassesSample RunSpark SolutionHigh-Level StepsSample Run
Input Data for K-mer CountingSample Data for K-mer CountingApplications of K-mer CountingK-mer Counting Solution in MapReduce/HadoopThe map() FunctionThe reduce() FunctionHadoop Implementation ClassesK-mer Counting Solution in SparkSpark SolutionSample Run
Input Data for DNA SequencingInput Data ValidationDNA Sequence AlignmentMapReduce Algorithms for DNA SequencingStep 1: AlignmentStep 2: RecalibrationStep 3: Variant Detection
The Cox Model in a NutshellCox Regression Basic TerminologyCox Regression Using RExpression DataCox Regression ApplicationCox Regression POJO SolutionInput for MapReduceInput FormatCox Regression Using MapReduceCox Regression Phase 1: map()Cox Regression Phase 1: reduce()Cox Regression Phase 2: map()Sample Output Generated by Phase 1 reduce() FunctionSample Output Generated by the Phase 2 map() FunctionCox Regression Script for MapReduce
Cochran-Armitage AlgorithmApplication of Cochran-ArmitageMapReduce SolutionInputExpected OutputMapperReducerMapReduce/Hadoop Implementation ClassesSample Run
Basic DefinitionsChromosomeBiosetAllele and Allelic FrequencySource of Data for Allelic FrequencyAllelic Frequency Analysis Using Fisher’s Exact TestFisher’s Exact TestFormal Problem StatementMapReduce Solution for Allelic FrequencyMapReduce Solution, Phase 1InputOutput/ResultPhase 1 MapperPhase 1 ReducerSample Run of Phase 1 MapReduce/Hadoop ImplementationSample Plot of P-ValuesMapReduce Solution, Phase 2Phase 2 Mapper for Bottom 100 P-ValuesPhase 2 Reducer for Bottom 100 P-ValuesIs Our Bottom 100 List a Monoid?Hadoop Implementation Classes for Bottom 100 ListMapReduce Solution, Phase 3Phase 3 Mapper for Bottom 100 P-ValuesPhase 3 Reducer for Bottom 100 P-ValuesHadoop Implementation Classes for Bottom 100 List for Each ChromosomeSpecial Handling of Chromosomes X and Y
Performing the T-Test on BiosetsMapReduce Problem StatementInputExpected OutputMapReduce SolutionHadoop Implementation ClassesSpark ImplementationHigh-Level StepsT-Test AlgorithmSample Run
Pearson Correlation FormulaPearson Correlation ExampleData Set for Pearson CorrelationPOJO Solution for Pearson CorrelationPOJO Solution Test DriveMapReduce Solution for Pearson Correlationmap() Function for Pearson Correlationreduce() Function for Pearson CorrelationHadoop Implementation ClassesSpark Solution for Pearson CorrelationInputOutputSpark SolutionHigh-Level StepsStep 1: Import required classes and interfacessmaller() methodMutableDouble classtoMap() methodtoListOfString() methodreadBiosets() methodStep 2: Handle input parametersStep 3: Create a Spark context objectStep 4: Create list of input files/biomarkersStep 5: Broadcast reference as global shared objectStep 6: Read all biomarkers from HDFS and create the first RDDStep 7: Filter biomarkers by referenceStep 8: Create (Gene-ID, (Patient-ID, Gene-Value)) pairsStep 9: Group by geneStep 10: Create Cartesian product of all genesStep 11: Filter redundant pairs of genesStep 12: Calculate Pearson correlation and p-valuePearson Correlation Wrapper ClassTesting the Pearson ClassPearson Correlation Using RYARN Script to Run Spark ProgramSpearman Correlation Using SparkSpearman Correlation Wrapper ClassTesting the Spearman Correlation Wrapper Class
FASTA FormatFASTA Format ExampleFASTQ FormatFASTQ Format ExampleMapReduce Solution: FASTA FormatReading FASTA FilesMapReduce FASTA Solution: map()MapReduce FASTA Solution: reduce()Sample RunLog of sample runGenerated outputCustom SortingCustom PartitioningMapReduce Solution: FASTQ FormatReading FASTQ FilesMapReduce FASTQ Solution: map()MapReduce FASTQ Solution: reduce()Hadoop Implementation Classes: FASTQ FormatSample RunSpark Solution: FASTA FormatHigh-Level StepsSample RunSpark Solution: FASTQ FormatHigh-Level StepsStep 1: Import required classes and interfacesStep 2: Handle input parametersStep 3: Create a JavaPairRDD from FASTQ inputStep 4: Map partitionsStep 5: Collect all DNA base countsStep 6: Emit Final CountsSample Run
Data Size and FormatMapReduce WorkflowInput Data ValidationRNA Sequencing Analysis OverviewMapReduce Algorithms for RNA SequencingStep 1: MapReduce TopHat MappingStep 2: MapReduce Calling Cuffdiff
InputOutputMapReduce Solutions (Filter by Individual and by Average)Mapper: Filter by IndividualReducer: Filter by IndividualMapper: Filter by AverageReducer: Filter by AverageComputing Gene AggregationHadoop Implementation ClassesAnalysis of OutputGene Aggregation in SparkSpark Solution: Filter by IndividualSharing Data Between Cluster NodesHigh-Level StepsUtility FunctionsSample RunSpark Solution: Filter by AverageHigh-Level StepsUtility FunctionsSample Run
Basic DefinitionsSimple ExampleProblem StatementInput DataExpected OutputMapReduce Solution Using SimpleRegressionHadoop Implementation ClassesMapReduce Solution Using R’s Linear ModelPhase 1Phase 2Hadoop Implementation Using Classes
IntroductionDefinition of MonoidHow to Form a MonoidMonoidic and Non-Monoidic ExamplesMaximum over a Set of IntegersSubtraction over a Set of IntegersAddition over a Set of IntegersMultiplication over a Set of IntegersMean over a Set of IntegersNon-Commutative ExampleMedian over a Set of IntegersConcatenation over ListsUnion/Intersection over IntegersFunctional ExampleMatrix ExampleMapReduce Example: Not a MonoidMapReduce Example: MonoidHadoop Implementation ClassesSample RunView Hadoop outputSpark Example Using MonoidsHigh-Level StepsSample RunConclusion on Using MonoidsFunctors and Monoids
Solution 1: Merging Small Files Client-SideInput DataSolution with SmallFilesConsolidatorSolution Without SmallFilesConsolidatorSolution 2: Solving the Small Files Problem with CombineFileInputFormatCustom CombineFileInputFormatSample Run Using CustomCFIFAlternative Solutions
Implementation OptionsFormalizing the Cache ProblemAn Elegant, Scalable SolutionImplementing the LRUMap CacheExtending the LRUMap ClassTesting the Custom ClassThe MapDBEntry ClassUsing MapDBTesting MapDB: put()Testing MapDB: get()MapReduce Using the LRUMap CacheCacheManager DefinitionInitializing the CacheUsing the CacheClosing the Cache
Bloom Filter PropertiesA Simple Bloom Filter ExampleBloom Filters in Guava LibraryUsing Bloom Filters in MapReduce
Spark OperationsTuple<N>RDDsHow to Create RDDsCreating RDDs Using Collection ObjectsCollecting Elements of an RDDTransforming an Existing RDD into a New RDDCreating RDDs by Reading FilesGrouping by KeyMapping ValuesReducing by KeyCombining by KeyFiltering an RDDSaving an RDD as an HDFS Text FileSaving an RDD as an HDFS Sequence FileReading an RDD from an HDFS Sequence FileCounting RDD ItemsSpark RDD Examples in ScalaPySpark ExamplesHow to Package and Run Spark JobsCreating the JAR for Data AlgorithmsRunning a Job in a Spark ClusterRunning a Job in Hadoop’s YARN Environment

Content preview from Data Algorithms

Appendix A. Bioset

Biosets (also called gene signatures¹ or assays²) encompass data in the form of experimental sample comparisons (for transcriptomic, epigenetic, and copy-number variation data), as well as genotype signatures (for genome-wide association study [GWAS] and mutational data).

A bioset has an associated data type, which can be gene expression, protein expression, methylation, copy-number variation, miRNA, or somatic mutation. Also, each bioset entry/record has an associated reference type, which can be r1=normal, r2=disease, r3=paired, or r4=unknown. Note that a reference type does not apply to the somatic mutation data type.

The number of entries/records per bioset depends on its data type (see Table A-1).

Table A-1. Number of records per bioset data type
Bioset data type	Number of entries/records
Somatic mutation	3,000–20,000
Methylation	30,000
Gene expression	50,000
Copy-number variation	40,000
Germline	4,300,000
Protein expression	30,000
miRNA	30,000

¹ A gene signature is a group of genes in a cell whose combined expression pattern is uniquely characteristic of a biological phenotype or medical condition. The phenotypes that may theoretically be defined by a gene expression signature range from those that are used to differentiate between different subtypes of a disease, those that predict the survival or prognosis of an individual with a disease, to those that predict activation of a particular pathway. Ideally, gene signatures can be used ...