book

Data Algorithms

Name: Data Algorithms
Author: Mahmoud Parsian
ISBN: 9781491906187

by Mahmoud Parsian

July 2015

Intermediate to advanced

778 pages

17h 9m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
What Is MapReduce?Simple Explanation of MapReduceWhen to Use MapReduceWhat MapReduce Isn’tWhy Use MapReduce?Hadoop and SparkWhat Is in This Book?What Is the Focus of This Book?Who Is This Book For?Online ResourcesWhat Software Is Used in This Book?Conventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgmentsComments and Questions for This Book
1. Secondary Sort: Introduction
Solutions to the Secondary Sort ProblemImplementation DetailsData Flow Using Plug-in ClassesMapReduce/Hadoop Solution to Secondary SortInputExpected Outputmap() Functionreduce() FunctionHadoop Implementation ClassesSample Run of Hadoop ImplementationHow to Sort in Ascending or Descending OrderSpark Solution to Secondary SortTime Series as InputExpected OutputOption 1: Secondary Sorting in MemorySpark Sample RunOption #2: Secondary Sorting Using the Spark FrameworkFurther Reading on Secondary Sorting
2. Secondary Sort: A Detailed Example
Secondary Sorting TechniqueComplete Example of Secondary SortingInput FormatOutput FormatComposite KeySample Run—Old Hadoop APIInputRunning the MapReduce JobOutputSample Run—New Hadoop APIInputRunning the MapReduce JobOutput
3. Top 10 List
Top N, FormalizedMapReduce/Hadoop Implementation: Unique KeysImplementation Classes in MapReduce/HadoopTop 10 Sample RunFinding the Top 5Finding the Bottom 10Spark Implementation: Unique KeysRDD RefresherSpark’s Function ClassesReview of the Top N Pattern for SparkComplete Spark Top 10 SolutionSample Run: Finding the Top 10Parameterizing Top NFinding the Bottom NSpark Implementation: Nonunique KeysComplete Spark Top 10 SolutionSample RunSpark Top 10 Solution Using takeOrdered()Complete Spark ImplementationFinding the Bottom NAlternative to Using takeOrdered()MapReduce/Hadoop Top 10 Solution: Nonunique KeysSample Run
4. Left Outer Join
Left Outer Join ExampleExample QueriesImplementation of Left Outer Join in MapReduceMapReduce Phase 1: Finding Product LocationsMapReduce Phase 2: Counting Unique LocationsImplementation Classes in HadoopSample RunSpark Implementation of Left Outer JoinSpark ProgramRunning the Spark SolutionRunning Spark on YARNSpark Implementation with leftOuterJoin()Spark ProgramSample Run on YARN
5. Order Inversion
Example of the Order Inversion PatternMapReduce/Hadoop Implementation of the Order Inversion PatternCustom PartitionerRelative Frequency MapperRelative Frequency ReducerImplementation Classes in HadoopSample RunInputRunning the MapReduce JobGenerated Output
6. Moving Average
Example 1: Time Series Data (Stock Prices)Example 2: Time Series Data (URL Visits)Formal DefinitionPOJO Moving Average SolutionsSolution 1: Using a QueueSolution 2: Using an ArrayTesting the Moving AverageSample RunMapReduce/Hadoop Moving Average SolutionInputOutputOption #1: Sorting in MemorySample RunOption #2: Sorting Using the MapReduce FrameworkSample Run
7. Market Basket Analysis
MBA GoalsApplication Areas for MBAMarket Basket Analysis Using MapReduceInputExpected Output for Tuple2 (Order of 2)Expected Output for Tuple3 (Order of 3)Informal MapperFormal MapperReducerMapReduce/Hadoop Implementation ClassesSample RunSpark SolutionMapReduce Algorithm WorkflowInputSpark ImplementationYARN Script for SparkCreating Item Sets from Transactions
8. Common Friends
InputPOJO Common Friends SolutionMapReduce AlgorithmThe MapReduce Algorithm in ActionSolution 1: Hadoop Implementation Using TextSample Run for Solution 1Solution 2: Hadoop Implementation Using ArrayListOfLongsWritableSample Run for Solution 2Spark SolutionSpark ProgramSample Run of Spark Program

9. Recommendation Engines Using MapReduce
Customers Who Bought This Item Also BoughtInputExpected OutputMapReduce SolutionFrequently Bought TogetherInput and Expected OutputMapReduce SolutionRecommend ConnectionInputOutputMapReduce SolutionSpark ImplementationSample Run of Spark Program
10. Content-Based Recommendation: Movies
InputMapReduce Phase 1MapReduce Phases 2 and 3MapReduce Phase 2: MapperMapReduce Phase 2: ReducerMapReduce Phase 3: MapperMapReduce Phase 3: ReducerSimilarity MeasuresMovie Recommendation Implementation in SparkHigh-Level Solution in SparkSample Run of Spark Program
11. Smarter Email Marketing with the Markov Model
Markov Chains in a NutshellMarkov Model Using MapReduceGenerating Time-Ordered Transactions with MapReduceHadoop Solution 1: Time-Ordered TransactionsHadoop Solution 2: Time-Ordered TransactionsGenerating State SequencesGenerating a Markov State Transition Matrix with MapReduceUsing the Markov Model to Predict the Next Smart Email Marketing DateSpark SolutionInput FormatHigh-Level StepsSpark ProgramScript to Run the Spark ProgramSample Run
12. K-Means Clustering
What Is K-Means Clustering?Application Areas for ClusteringInformal K-Means Clustering Method: Partitioning ApproachK-Means Distance FunctionK-Means Clustering FormalizedMapReduce Solution for K-Means ClusteringMapReduce Solution: map()MapReduce Solution: combine()MapReduce Solution: reduce()K-Means Implementation by SparkSample Run of Spark K-Means Implementation
13. k-Nearest Neighbors
kNN ClassificationDistance FunctionskNN ExampleAn Informal kNN AlgorithmFormal kNN AlgorithmJava-like Non-MapReduce Solution for kNNkNN Implementation in SparkFormalizing kNN for the Spark ImplementationInput Data Set FormatsSpark ImplementationYARN shell script
14. Naive Bayes
Training and Learning ExamplesNumeric Training DataSymbolic Training DataConditional ProbabilityThe Naive Bayes Classifier in DepthNaive Bayes Classifier ExampleThe Naive Bayes Classifier: MapReduce Solution for Symbolic DataStage 1: Building a Classifier Using Symbolic Training DataStage 2: Using the Classifier to Classify New Symbolic DataThe Naive Bayes Classifier: MapReduce Solution for Numeric DataNaive Bayes Classifier Implementation in SparkStage 1: Building a Classifier Using Training DataStage 2: Using the Classifier to Classify New DataUsing Spark and MahoutApache SparkApache Mahout
15. Sentiment Analysis
Sentiment ExamplesSentiment Scores: Positive or NegativeA Simple MapReduce Sentiment Analysis Examplemap() Function for Sentiment Analysisreduce() Function for Sentiment AnalysisSentiment Analysis in the Real World
16. Finding, Counting, and Listing All Triangles in Large Graphs
Basic Graph ConceptsImportance of Counting TrianglesMapReduce/Hadoop SolutionStep 1: MapReduce in ActionStep 2: Identify TrianglesStep 3: Remove Duplicate TrianglesHadoop Implementation ClassesSample RunSpark SolutionHigh-Level StepsSample Run
17. K-mer Counting
Input Data for K-mer CountingSample Data for K-mer CountingApplications of K-mer CountingK-mer Counting Solution in MapReduce/HadoopThe map() FunctionThe reduce() FunctionHadoop Implementation ClassesK-mer Counting Solution in SparkSpark SolutionSample Run
18. DNA Sequencing
Input Data for DNA SequencingInput Data ValidationDNA Sequence AlignmentMapReduce Algorithms for DNA SequencingStep 1: AlignmentStep 2: RecalibrationStep 3: Variant Detection
19. Cox Regression
The Cox Model in a NutshellCox Regression Basic TerminologyCox Regression Using RExpression DataCox Regression ApplicationCox Regression POJO SolutionInput for MapReduceInput FormatCox Regression Using MapReduceCox Regression Phase 1: map()Cox Regression Phase 1: reduce()Cox Regression Phase 2: map()Sample Output Generated by Phase 1 reduce() FunctionSample Output Generated by the Phase 2 map() FunctionCox Regression Script for MapReduce
20. Cochran-Armitage Test for Trend
Cochran-Armitage AlgorithmApplication of Cochran-ArmitageMapReduce SolutionInputExpected OutputMapperReducerMapReduce/Hadoop Implementation ClassesSample Run
21. Allelic Frequency
Basic DefinitionsChromosomeBiosetAllele and Allelic FrequencySource of Data for Allelic FrequencyAllelic Frequency Analysis Using Fisher’s Exact TestFisher’s Exact TestFormal Problem StatementMapReduce Solution for Allelic FrequencyMapReduce Solution, Phase 1InputOutput/ResultPhase 1 MapperPhase 1 ReducerSample Run of Phase 1 MapReduce/Hadoop ImplementationSample Plot of P-ValuesMapReduce Solution, Phase 2Phase 2 Mapper for Bottom 100 P-ValuesPhase 2 Reducer for Bottom 100 P-ValuesIs Our Bottom 100 List a Monoid?Hadoop Implementation Classes for Bottom 100 ListMapReduce Solution, Phase 3Phase 3 Mapper for Bottom 100 P-ValuesPhase 3 Reducer for Bottom 100 P-ValuesHadoop Implementation Classes for Bottom 100 List for Each ChromosomeSpecial Handling of Chromosomes X and Y
22. The T-Test
Performing the T-Test on BiosetsMapReduce Problem StatementInputExpected OutputMapReduce SolutionHadoop Implementation ClassesSpark ImplementationHigh-Level StepsT-Test AlgorithmSample Run
23. Pearson Correlation
Pearson Correlation FormulaPearson Correlation ExampleData Set for Pearson CorrelationPOJO Solution for Pearson CorrelationPOJO Solution Test DriveMapReduce Solution for Pearson Correlationmap() Function for Pearson Correlationreduce() Function for Pearson CorrelationHadoop Implementation ClassesSpark Solution for Pearson CorrelationInputOutputSpark SolutionHigh-Level StepsStep 1: Import required classes and interfacessmaller() methodMutableDouble classtoMap() methodtoListOfString() methodreadBiosets() methodStep 2: Handle input parametersStep 3: Create a Spark context objectStep 4: Create list of input files/biomarkersStep 5: Broadcast reference as global shared objectStep 6: Read all biomarkers from HDFS and create the first RDDStep 7: Filter biomarkers by referenceStep 8: Create (Gene-ID, (Patient-ID, Gene-Value)) pairsStep 9: Group by geneStep 10: Create Cartesian product of all genesStep 11: Filter redundant pairs of genesStep 12: Calculate Pearson correlation and p-valuePearson Correlation Wrapper ClassTesting the Pearson ClassPearson Correlation Using RYARN Script to Run Spark ProgramSpearman Correlation Using SparkSpearman Correlation Wrapper ClassTesting the Spearman Correlation Wrapper Class
24. DNA Base Count
FASTA FormatFASTA Format ExampleFASTQ FormatFASTQ Format ExampleMapReduce Solution: FASTA FormatReading FASTA FilesMapReduce FASTA Solution: map()MapReduce FASTA Solution: reduce()Sample RunLog of sample runGenerated outputCustom SortingCustom PartitioningMapReduce Solution: FASTQ FormatReading FASTQ FilesMapReduce FASTQ Solution: map()MapReduce FASTQ Solution: reduce()Hadoop Implementation Classes: FASTQ FormatSample RunSpark Solution: FASTA FormatHigh-Level StepsSample RunSpark Solution: FASTQ FormatHigh-Level StepsStep 1: Import required classes and interfacesStep 2: Handle input parametersStep 3: Create a JavaPairRDD from FASTQ inputStep 4: Map partitionsStep 5: Collect all DNA base countsStep 6: Emit Final CountsSample Run
25. RNA Sequencing
Data Size and FormatMapReduce WorkflowInput Data ValidationRNA Sequencing Analysis OverviewMapReduce Algorithms for RNA SequencingStep 1: MapReduce TopHat MappingStep 2: MapReduce Calling Cuffdiff
26. Gene Aggregation
InputOutputMapReduce Solutions (Filter by Individual and by Average)Mapper: Filter by IndividualReducer: Filter by IndividualMapper: Filter by AverageReducer: Filter by AverageComputing Gene AggregationHadoop Implementation ClassesAnalysis of OutputGene Aggregation in SparkSpark Solution: Filter by IndividualSharing Data Between Cluster NodesHigh-Level StepsUtility FunctionsSample RunSpark Solution: Filter by AverageHigh-Level StepsUtility FunctionsSample Run
27. Linear Regression
Basic DefinitionsSimple ExampleProblem StatementInput DataExpected OutputMapReduce Solution Using SimpleRegressionHadoop Implementation ClassesMapReduce Solution Using R’s Linear ModelPhase 1Phase 2Hadoop Implementation Using Classes
28. MapReduce and Monoids
IntroductionDefinition of MonoidHow to Form a MonoidMonoidic and Non-Monoidic ExamplesMaximum over a Set of IntegersSubtraction over a Set of IntegersAddition over a Set of IntegersMultiplication over a Set of IntegersMean over a Set of IntegersNon-Commutative ExampleMedian over a Set of IntegersConcatenation over ListsUnion/Intersection over IntegersFunctional ExampleMatrix ExampleMapReduce Example: Not a MonoidMapReduce Example: MonoidHadoop Implementation ClassesSample RunView Hadoop outputSpark Example Using MonoidsHigh-Level StepsSample RunConclusion on Using MonoidsFunctors and Monoids
29. The Small Files Problem
Solution 1: Merging Small Files Client-SideInput DataSolution with SmallFilesConsolidatorSolution Without SmallFilesConsolidatorSolution 2: Solving the Small Files Problem with CombineFileInputFormatCustom CombineFileInputFormatSample Run Using CustomCFIFAlternative Solutions
30. Huge Cache for MapReduce
Implementation OptionsFormalizing the Cache ProblemAn Elegant, Scalable SolutionImplementing the LRUMap CacheExtending the LRUMap ClassTesting the Custom ClassThe MapDBEntry ClassUsing MapDBTesting MapDB: put()Testing MapDB: get()MapReduce Using the LRUMap CacheCacheManager DefinitionInitializing the CacheUsing the CacheClosing the Cache
31. The Bloom Filter
Bloom Filter PropertiesA Simple Bloom Filter ExampleBloom Filters in Guava LibraryUsing Bloom Filters in MapReduce
A. Bioset
B. Spark RDDs
Spark OperationsTuple<N>RDDsHow to Create RDDsCreating RDDs Using Collection ObjectsCollecting Elements of an RDDTransforming an Existing RDD into a New RDDCreating RDDs by Reading FilesGrouping by KeyMapping ValuesReducing by KeyCombining by KeyFiltering an RDDSaving an RDD as an HDFS Text FileSaving an RDD as an HDFS Sequence FileReading an RDD from an HDFS Sequence FileCounting RDD ItemsSpark RDD Examples in ScalaPySpark ExamplesHow to Package and Run Spark JobsCreating the JAR for Data AlgorithmsRunning a Job in a Spark ClusterRunning a Job in Hadoop’s YARN Environment
Bibliography
Index

Content preview from Data Algorithms

Chapter 14. Naive Bayes

In data mining and machine learning, there are many classification algorithms. One of the simplest but most effective is the Naive Bayes classifier (NBC). The main focus of this chapter is to present a distributed MapReduce implementation (using Spark) of the NBC that is a combination of a supervised learning method and probabilistic classifier. Naive Bayes is a linear classifier. To understand it, we need to understand some basic and conditional probabilities. When we are dealing with numeric data, it is better to use clustering techniques (such as K-Means and k-Nearest Neighbors methods and algorithms), but for classification of names, symbols, emails, and texts, it may be better to use a probabilistic method such as the NBC. In some cases, the NBC is used to classify numeric data as well. In the following section, you will see examples of both symbolic and numeric data.

The NBC is a probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions. In a nutshell, an NBC assigns inputs into one of the k classes {C₁, C₂, ..., C_k} based on some properties (features) of the inputs. NBCs have applications such as email spam filtering and document classification.

For example, a spam filter using a Naive Bayes classifier will assign each email to one of two clusters: spam mail or not a spam mail. Since Naive Bayes is a supervised learning method, it has two distinct stages:

Stage 1: Training (see Figure 14-1): This stage ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491906170Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Algorithms

by Mahmoud Parsian

Chapter 14. Naive Bayes

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.