If you are ready to dive into the MapReduce framework for processing large datasets, this practical book takes you step by step through the algorithms and tools you need to build distributed MapReduce applications with Apache Hadoop or Apache Spark. Each chapter provides a recipe for solving a massive computational problem, such as building a recommendation system. You’ll learn how to implement the appropriate MapReduce solution with code that you can use in your projects.
Dr. Mahmoud Parsian covers basic design patterns, optimization techniques, and data mining and machine learning solutions for problems in bioinformatics, genomics, statistics, and social network analysis. This book also includes an overview of MapReduce, Hadoop, and Spark.
Table of contents
 Foreword

Preface
 What Is MapReduce?
 Hadoop and Spark
 What Is in This Book?
 What Is the Focus of This Book?
 Who Is This Book For?
 Online Resources
 What Software Is Used in This Book?
 Conventions Used in This Book
 Using Code Examples
 Safari® Books Online
 How to Contact Us
 Acknowledgments
 Comments and Questions for This Book
 1. Secondary Sort: Introduction
 2. Secondary Sort: A Detailed Example
 3. Top 10 List
 4. Left Outer Join
 5. Order Inversion
 6. Moving Average
 7. Market Basket Analysis
 8. Common Friends
 9. Recommendation Engines Using MapReduce
 10. ContentBased Recommendation: Movies
 11. Smarter Email Marketing with the Markov Model
 12. KMeans Clustering
 13. kNearest Neighbors
 14. Naive Bayes
 15. Sentiment Analysis
 16. Finding, Counting, and Listing All Triangles in Large Graphs
 17. Kmer Counting
 18. DNA Sequencing
 19. Cox Regression
 20. CochranArmitage Test for Trend
 21. Allelic Frequency
 22. The TTest

23. Pearson Correlation
 Pearson Correlation Formula
 Pearson Correlation Example
 Data Set for Pearson Correlation
 POJO Solution for Pearson Correlation
 POJO Solution Test Drive
 MapReduce Solution for Pearson Correlation
 Hadoop Implementation Classes

Spark Solution for Pearson Correlation
 Input
 Output
 Spark Solution
 HighLevel Steps
 Step 1: Import required classes and interfaces
 smaller() method
 MutableDouble class
 toMap() method
 toListOfString() method
 readBiosets() method
 Step 2: Handle input parameters
 Step 3: Create a Spark context object
 Step 4: Create list of input files/biomarkers
 Step 5: Broadcast reference as global shared object
 Step 6: Read all biomarkers from HDFS and create the first RDD
 Step 7: Filter biomarkers by reference
 Step 8: Create (GeneID, (PatientID, GeneValue)) pairs
 Step 9: Group by gene
 Step 10: Create Cartesian product of all genes
 Step 11: Filter redundant pairs of genes
 Step 12: Calculate Pearson correlation and pvalue
 Pearson Correlation Wrapper Class
 Testing the Pearson Class
 Pearson Correlation Using R
 YARN Script to Run Spark Program
 Spearman Correlation Using Spark
 24. DNA Base Count
 25. RNA Sequencing
 26. Gene Aggregation
 27. Linear Regression

28. MapReduce and Monoids
 Introduction
 Definition of Monoid

Monoidic and NonMonoidic Examples
 Maximum over a Set of Integers
 Subtraction over a Set of Integers
 Addition over a Set of Integers
 Multiplication over a Set of Integers
 Mean over a Set of Integers
 NonCommutative Example
 Median over a Set of Integers
 Concatenation over Lists
 Union/Intersection over Integers
 Functional Example
 Matrix Example
 MapReduce Example: Not a Monoid
 MapReduce Example: Monoid
 Spark Example Using Monoids
 Conclusion on Using Monoids
 Functors and Monoids
 29. The Small Files Problem
 30. Huge Cache for MapReduce
 31. The Bloom Filter
 A. Bioset

B. Spark RDDs
 Spark Operations
 Tuple<N>

RDDs
 How to Create RDDs
 Creating RDDs Using Collection Objects
 Collecting Elements of an RDD
 Transforming an Existing RDD into a New RDD
 Creating RDDs by Reading Files
 Grouping by Key
 Mapping Values
 Reducing by Key
 Combining by Key
 Filtering an RDD
 Saving an RDD as an HDFS Text File
 Saving an RDD as an HDFS Sequence File
 Reading an RDD from an HDFS Sequence File
 Counting RDD Items
 Spark RDD Examples in Scala
 PySpark Examples
 How to Package and Run Spark Jobs
 Creating the JAR for Data Algorithms
 Running a Job in a Spark Cluster
 Running a Job in Hadoop’s YARN Environment
 Bibliography
 Index
 Title: Data Algorithms
 Author(s):
 Release date: July 2015
 Publisher(s): O'Reilly Media, Inc.
 ISBN: 9781491906187
