book

Data Algorithms with Spark

Name: Data Algorithms with Spark
Author: Mahmoud Parsian
ISBN: 9781492082385

by Mahmoud Parsian

April 2022

Intermediate to advanced

435 pages

9h 44m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Why I Wrote This BookWho This Book Is ForHow This Book Is OrganizedConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Fundamentals
1. Introduction to Spark and PySpark
Why Spark for Data AnalyticsThe Spark EcosystemSpark ArchitectureThe Power of PySparkPySpark ArchitectureSpark Data AbstractionsRDD ExamplesSpark RDD OperationsDataFrame ExamplesUsing the PySpark ShellLaunching the PySpark ShellCreating an RDD from a CollectionAggregating and Merging Values of KeysFiltering an RDD’s ElementsGrouping Similar KeysAggregating Values for Similar KeysETL Example with DataFramesExtractionTransformationLoadingSummary
2. Transformations in Action
The DNA Base Count ExampleThe DNA Base Count ProblemFASTA FormatSample DataDNA Base Count Solution 1Step 1: Create an RDD[String] from the InputStep 2: Define a Mapper FunctionStep 3: Find the Frequencies of DNA LettersPros and Cons of Solution 1DNA Base Count Solution 2Step 1: Create an RDD[String] from the InputStep 2: Define a Mapper FunctionStep 3: Find the Frequencies of DNA LettersPros and Cons of Solution 2DNA Base Count Solution 3The mapPartitions() TransformationStep 1: Create an RDD[String] from the InputStep 2: Define a Function to Handle a PartitionStep 3: Apply the Custom Function to Each PartitionPros and Cons of Solution 3Summary
3. Mapper Transformations
Data Abstractions and MappersWhat Are Transformations?Lazy TransformationsThe map() TransformationDataFrame MapperThe flatMap() Transformationmap() Versus flatMap()Apply flatMap() to a DataFrameThe mapValues() TransformationThe flatMapValues() TransformationThe mapPartitions() TransformationHandling Empty PartitionsBenefits and DrawbacksDataFrames and mapPartitions() TransformationSummary
4. Reductions in Spark
Creating Pair RDDsReduction TransformationsSpark’s ReductionsSimple Warmup ExampleSolving with reduceByKey()Solving with groupByKey()Solving with aggregateByKey()Solving with combineByKey()What Is a Monoid?Monoid and Non-Monoid ExamplesThe Movie ProblemInput Dataset to AnalyzeThe aggregateByKey() TransformationFirst Solution Using aggregateByKey()Second Solution Using aggregateByKey()Complete PySpark Solution Using groupByKey()Complete PySpark Solution Using reduceByKey()Complete PySpark Solution Using combineByKey()The Shuffle Step in ReductionsShuffle Step for groupByKey()Shuffle Step for reduceByKey()Summary
II. Working with Data
5. Partitioning Data
Introduction to PartitionsPartitions in SparkManaging PartitionsDefault PartitioningExplicit PartitioningPhysical Partitioning for SQL QueriesPhysical Partitioning of Data in SparkPartition as Text FormatPartition as Parquet FormatHow to Query Partitioned DataAmazon Athena ExampleSummary
6. Graph Algorithms
Introduction to GraphsThe GraphFrames APIHow to Use GraphFramesGraphFrames Functions and AttributesGraphFrames AlgorithmsFinding TrianglesMotif FindingReal-World ApplicationsGene AnalysisSocial RecommendationsFacebook CirclesConnected ComponentsAnalyzing Flight DataSummary

7. Interacting with External Data Sources
Relational DatabasesReading from a DatabaseWriting a DataFrame to a DatabaseReading Text FilesReading and Writing CSV FilesReading CSV FilesWriting CSV FilesReading and Writing JSON FilesReading JSON FilesWriting JSON FilesReading from and Writing to Amazon S3Reading from Amazon S3Writing to Amazon S3Reading and Writing Hadoop FilesReading Hadoop Text FilesWriting Hadoop Text FilesReading and Writing HDFS SequenceFilesReading and Writing Parquet FilesWriting Parquet FilesReading Parquet FilesReading and Writing Avro FilesReading Avro FilesWriting Avro FilesReading from and Writing to MS SQL ServerWriting to MS SQL ServerReading from MS SQL ServerReading Image FilesCreating a DataFrame from ImagesSummary
8. Ranking Algorithms
Rank ProductCalculation of the Rank ProductFormalizing Rank ProductRank Product ExamplePySpark SolutionPageRankPageRank’s Iterative ComputationCustom PageRank in PySpark Using RDDsCustom PageRank in PySpark Using an Adjacency MatrixPageRank with GraphFramesSummary
III. Data Design Patterns
9. Classic Data Design Patterns
Input-Map-OutputRDD SolutionDataFrame SolutionFlat Mapper functionalityInput-Filter-OutputRDD SolutionDataFrame SolutionDataFrame FilterInput-Map-Reduce-OutputRDD SolutionDataFrame SolutionInput-Multiple-Maps-Reduce-OutputRDD SolutionDataFrame SolutionInput-Map-Combiner-Reduce-OutputInput-MapPartitions-Reduce-OutputInverted IndexProblem StatementInputOutputPySpark SolutionSummary
10. Practical Data Design Patterns
In-Mapper CombiningBasic MapReduce AlgorithmIn-Mapper Combining per RecordIn-Mapper Combining per PartitionTop-10Top-N FormalizedPySpark SolutionFinding the Bottom 10MinMaxSolution 1: Classic MapReduceSolution 2: SortingSolution 3: Spark’s mapPartitions()The Composite Pattern and MonoidsMonoidsMonoidal and Non-Monoidal ExamplesNon-Monoid MapReduce ExampleMonoid MapReduce ExamplePySpark Implementation of Monoidal MeanFunctors and MonoidsConclusion on Using MonoidsBinningSortingSummary
11. Join Design Patterns
Introduction to the Join OperationJoin in MapReduceMap PhaseReducer PhaseImplementation in PySparkMap-Side Join Using RDDsMap-Side Join Using DataFramesStep 1: Create Cache for AirportsStep 2: Create Cache for AirlinesStep 3: Create Facts TableStep 4: Apply Map-Side JoinEfficient Joins Using Bloom FiltersIntroduction to Bloom FiltersA Simple Bloom Filter ExampleBloom Filters in PythonUsing Bloom Filters in PySparkSummary
12. Feature Engineering in PySpark
Introduction to Feature EngineeringAdding New FeaturesApplying UDFsCreating PipelinesBinarizing DataImputationTokenizationTokenizerRegexTokenizerTokenization with a PipelineStandardizationNormalizationScaling a Column Using a PipelineUsing MinMaxScaler on Multiple ColumnsNormalization Using NormalizerString IndexingApplying StringIndexer to a Single ColumnApplying StringIndexer to Several ColumnsVector AssemblyBucketingBucketizerQuantileDiscretizerLogarithm TransformationOne-Hot EncodingTF-IDFFeatureHasherSQLTransformerSummary
Index
About the Author

Content preview from Data Algorithms with Spark

Chapter 4. Reductions in Spark

This chapter focuses on reduction transformations on RDDs in Spark. In particular, we’ll work with RDDs of (key, value) pairs, which are a common data abstraction required for many operations in Spark. Some initial ETL operations may be required to get your data into a (key, value) form, but with pair RDDs you may perform any desired aggregation over a set of values.

Spark supports several powerful reduction transformations and actions. The most important reduction transformations are:

reduceByKey()
combineByKey()
groupByKey()
aggregateByKey()

All of the *ByKey() transformations accept a source RDD[(K, V)] and create a target RDD[(K, C)] (for some transformations, such as reduceByKey(), V and C are the same). The function of these transformations is to reduce all the values of a given key (for all unique keys), by finding, for example:

The average of all values
The sum and count of all values
The mode and median of all values
The standard deviation of all values

Reduction Transformation Selection

As with mapper transformations, it’s important to select the right tool for the job. For some reduction operations (such as finding the median), the reducer needs access to all the values at the same time. For others, such as finding the sum or count of all values, it doesn’t. If you want to find the median of values per key, then groupByKey() will be a good choice, but this transformation does not do well if a key has lots of values ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492082378Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Algorithms with Spark

by Mahmoud Parsian

Chapter 4. Reductions in Spark

Reduction Transformation Selection

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.