book

Data Algorithms with Spark

Name: Data Algorithms with Spark
Author: Mahmoud Parsian
ISBN: 9781492082385

by Mahmoud Parsian

April 2022

Intermediate to advanced

435 pages

9h 44m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Why I Wrote This BookWho This Book Is ForHow This Book Is OrganizedConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Fundamentals
1. Introduction to Spark and PySpark
Why Spark for Data AnalyticsThe Spark EcosystemSpark ArchitectureThe Power of PySparkPySpark ArchitectureSpark Data AbstractionsRDD ExamplesSpark RDD OperationsDataFrame ExamplesUsing the PySpark ShellLaunching the PySpark ShellCreating an RDD from a CollectionAggregating and Merging Values of KeysFiltering an RDD’s ElementsGrouping Similar KeysAggregating Values for Similar KeysETL Example with DataFramesExtractionTransformationLoadingSummary
2. Transformations in Action
The DNA Base Count ExampleThe DNA Base Count ProblemFASTA FormatSample DataDNA Base Count Solution 1Step 1: Create an RDD[String] from the InputStep 2: Define a Mapper FunctionStep 3: Find the Frequencies of DNA LettersPros and Cons of Solution 1DNA Base Count Solution 2Step 1: Create an RDD[String] from the InputStep 2: Define a Mapper FunctionStep 3: Find the Frequencies of DNA LettersPros and Cons of Solution 2DNA Base Count Solution 3The mapPartitions() TransformationStep 1: Create an RDD[String] from the InputStep 2: Define a Function to Handle a PartitionStep 3: Apply the Custom Function to Each PartitionPros and Cons of Solution 3Summary
3. Mapper Transformations
Data Abstractions and MappersWhat Are Transformations?Lazy TransformationsThe map() TransformationDataFrame MapperThe flatMap() Transformationmap() Versus flatMap()Apply flatMap() to a DataFrameThe mapValues() TransformationThe flatMapValues() TransformationThe mapPartitions() TransformationHandling Empty PartitionsBenefits and DrawbacksDataFrames and mapPartitions() TransformationSummary
4. Reductions in Spark
Creating Pair RDDsReduction TransformationsSpark’s ReductionsSimple Warmup ExampleSolving with reduceByKey()Solving with groupByKey()Solving with aggregateByKey()Solving with combineByKey()What Is a Monoid?Monoid and Non-Monoid ExamplesThe Movie ProblemInput Dataset to AnalyzeThe aggregateByKey() TransformationFirst Solution Using aggregateByKey()Second Solution Using aggregateByKey()Complete PySpark Solution Using groupByKey()Complete PySpark Solution Using reduceByKey()Complete PySpark Solution Using combineByKey()The Shuffle Step in ReductionsShuffle Step for groupByKey()Shuffle Step for reduceByKey()Summary
II. Working with Data
5. Partitioning Data
Introduction to PartitionsPartitions in SparkManaging PartitionsDefault PartitioningExplicit PartitioningPhysical Partitioning for SQL QueriesPhysical Partitioning of Data in SparkPartition as Text FormatPartition as Parquet FormatHow to Query Partitioned DataAmazon Athena ExampleSummary
6. Graph Algorithms
Introduction to GraphsThe GraphFrames APIHow to Use GraphFramesGraphFrames Functions and AttributesGraphFrames AlgorithmsFinding TrianglesMotif FindingReal-World ApplicationsGene AnalysisSocial RecommendationsFacebook CirclesConnected ComponentsAnalyzing Flight DataSummary

7. Interacting with External Data Sources
Relational DatabasesReading from a DatabaseWriting a DataFrame to a DatabaseReading Text FilesReading and Writing CSV FilesReading CSV FilesWriting CSV FilesReading and Writing JSON FilesReading JSON FilesWriting JSON FilesReading from and Writing to Amazon S3Reading from Amazon S3Writing to Amazon S3Reading and Writing Hadoop FilesReading Hadoop Text FilesWriting Hadoop Text FilesReading and Writing HDFS SequenceFilesReading and Writing Parquet FilesWriting Parquet FilesReading Parquet FilesReading and Writing Avro FilesReading Avro FilesWriting Avro FilesReading from and Writing to MS SQL ServerWriting to MS SQL ServerReading from MS SQL ServerReading Image FilesCreating a DataFrame from ImagesSummary
8. Ranking Algorithms
Rank ProductCalculation of the Rank ProductFormalizing Rank ProductRank Product ExamplePySpark SolutionPageRankPageRank’s Iterative ComputationCustom PageRank in PySpark Using RDDsCustom PageRank in PySpark Using an Adjacency MatrixPageRank with GraphFramesSummary
III. Data Design Patterns
9. Classic Data Design Patterns
Input-Map-OutputRDD SolutionDataFrame SolutionFlat Mapper functionalityInput-Filter-OutputRDD SolutionDataFrame SolutionDataFrame FilterInput-Map-Reduce-OutputRDD SolutionDataFrame SolutionInput-Multiple-Maps-Reduce-OutputRDD SolutionDataFrame SolutionInput-Map-Combiner-Reduce-OutputInput-MapPartitions-Reduce-OutputInverted IndexProblem StatementInputOutputPySpark SolutionSummary
10. Practical Data Design Patterns
In-Mapper CombiningBasic MapReduce AlgorithmIn-Mapper Combining per RecordIn-Mapper Combining per PartitionTop-10Top-N FormalizedPySpark SolutionFinding the Bottom 10MinMaxSolution 1: Classic MapReduceSolution 2: SortingSolution 3: Spark’s mapPartitions()The Composite Pattern and MonoidsMonoidsMonoidal and Non-Monoidal ExamplesNon-Monoid MapReduce ExampleMonoid MapReduce ExamplePySpark Implementation of Monoidal MeanFunctors and MonoidsConclusion on Using MonoidsBinningSortingSummary
11. Join Design Patterns
Introduction to the Join OperationJoin in MapReduceMap PhaseReducer PhaseImplementation in PySparkMap-Side Join Using RDDsMap-Side Join Using DataFramesStep 1: Create Cache for AirportsStep 2: Create Cache for AirlinesStep 3: Create Facts TableStep 4: Apply Map-Side JoinEfficient Joins Using Bloom FiltersIntroduction to Bloom FiltersA Simple Bloom Filter ExampleBloom Filters in PythonUsing Bloom Filters in PySparkSummary
12. Feature Engineering in PySpark
Introduction to Feature EngineeringAdding New FeaturesApplying UDFsCreating PipelinesBinarizing DataImputationTokenizationTokenizerRegexTokenizerTokenization with a PipelineStandardizationNormalizationScaling a Column Using a PipelineUsing MinMaxScaler on Multiple ColumnsNormalization Using NormalizerString IndexingApplying StringIndexer to a Single ColumnApplying StringIndexer to Several ColumnsVector AssemblyBucketingBucketizerQuantileDiscretizerLogarithm TransformationOne-Hot EncodingTF-IDFFeatureHasherSQLTransformerSummary
Index
About the Author

Content preview from Data Algorithms with Spark

Chapter 9. Classic Data Design Patterns

This chapter discusses some of the most fundamental and classic data design patterns used in the vast majority of big data solutions. Even though these are simple design patterns, they are useful in solving many common data problems, and I’ve used many of them in examples in this book. In this chapter, I will present PySpark implementations of the following design patterns:

Input-Map-Output
Input-Filter-Output
Input-Map-Reduce-Output
Input-Multiple-Maps-Reduce-Output
Input-Map-Combiner-Reduce-Output
Input-MapPartitions-Reduce-Output
Input-Inverted-Index-Pattern-Output

Before we get started, however, I’d like to address the question of what I mean by “design patterns.” In computer science and software engineering, given a commonly occurring problem, a design pattern is a reusable solution to that problem. It’s a template or best practice for how to solve a problem, not a finished design that can be transformed directly into code. The patterns presented in this chapter will equip you to handle a wide range of data analysis tasks.

Note

The data design patterns discussed in this chapter are basic patterns. You can create your own, depending on your requirements. For additional examples, see “MapReduce: Simplified Data Processing on Large Clusters” by Jeffrey Dean and Sanjay Ghemawat.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492082378Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Algorithms with Spark

by Mahmoud Parsian

Chapter 9. Classic Data Design Patterns

Note

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.