book

Data Algorithms with Spark

Name: Data Algorithms with Spark
Author: Mahmoud Parsian
ISBN: 9781492082385

by Mahmoud Parsian

April 2022

Intermediate to advanced

435 pages

9h 44m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Why I Wrote This BookWho This Book Is ForHow This Book Is OrganizedConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Fundamentals
1. Introduction to Spark and PySpark
Why Spark for Data AnalyticsThe Spark EcosystemSpark ArchitectureThe Power of PySparkPySpark ArchitectureSpark Data AbstractionsRDD ExamplesSpark RDD OperationsDataFrame ExamplesUsing the PySpark ShellLaunching the PySpark ShellCreating an RDD from a CollectionAggregating and Merging Values of KeysFiltering an RDD’s ElementsGrouping Similar KeysAggregating Values for Similar KeysETL Example with DataFramesExtractionTransformationLoadingSummary
2. Transformations in Action
The DNA Base Count ExampleThe DNA Base Count ProblemFASTA FormatSample DataDNA Base Count Solution 1Step 1: Create an RDD[String] from the InputStep 2: Define a Mapper FunctionStep 3: Find the Frequencies of DNA LettersPros and Cons of Solution 1DNA Base Count Solution 2Step 1: Create an RDD[String] from the InputStep 2: Define a Mapper FunctionStep 3: Find the Frequencies of DNA LettersPros and Cons of Solution 2DNA Base Count Solution 3The mapPartitions() TransformationStep 1: Create an RDD[String] from the InputStep 2: Define a Function to Handle a PartitionStep 3: Apply the Custom Function to Each PartitionPros and Cons of Solution 3Summary
3. Mapper Transformations
Data Abstractions and MappersWhat Are Transformations?Lazy TransformationsThe map() TransformationDataFrame MapperThe flatMap() Transformationmap() Versus flatMap()Apply flatMap() to a DataFrameThe mapValues() TransformationThe flatMapValues() TransformationThe mapPartitions() TransformationHandling Empty PartitionsBenefits and DrawbacksDataFrames and mapPartitions() TransformationSummary
4. Reductions in Spark
Creating Pair RDDsReduction TransformationsSpark’s ReductionsSimple Warmup ExampleSolving with reduceByKey()Solving with groupByKey()Solving with aggregateByKey()Solving with combineByKey()What Is a Monoid?Monoid and Non-Monoid ExamplesThe Movie ProblemInput Dataset to AnalyzeThe aggregateByKey() TransformationFirst Solution Using aggregateByKey()Second Solution Using aggregateByKey()Complete PySpark Solution Using groupByKey()Complete PySpark Solution Using reduceByKey()Complete PySpark Solution Using combineByKey()The Shuffle Step in ReductionsShuffle Step for groupByKey()Shuffle Step for reduceByKey()Summary
II. Working with Data
5. Partitioning Data
Introduction to PartitionsPartitions in SparkManaging PartitionsDefault PartitioningExplicit PartitioningPhysical Partitioning for SQL QueriesPhysical Partitioning of Data in SparkPartition as Text FormatPartition as Parquet FormatHow to Query Partitioned DataAmazon Athena ExampleSummary
6. Graph Algorithms
Introduction to GraphsThe GraphFrames APIHow to Use GraphFramesGraphFrames Functions and AttributesGraphFrames AlgorithmsFinding TrianglesMotif FindingReal-World ApplicationsGene AnalysisSocial RecommendationsFacebook CirclesConnected ComponentsAnalyzing Flight DataSummary

7. Interacting with External Data Sources
Relational DatabasesReading from a DatabaseWriting a DataFrame to a DatabaseReading Text FilesReading and Writing CSV FilesReading CSV FilesWriting CSV FilesReading and Writing JSON FilesReading JSON FilesWriting JSON FilesReading from and Writing to Amazon S3Reading from Amazon S3Writing to Amazon S3Reading and Writing Hadoop FilesReading Hadoop Text FilesWriting Hadoop Text FilesReading and Writing HDFS SequenceFilesReading and Writing Parquet FilesWriting Parquet FilesReading Parquet FilesReading and Writing Avro FilesReading Avro FilesWriting Avro FilesReading from and Writing to MS SQL ServerWriting to MS SQL ServerReading from MS SQL ServerReading Image FilesCreating a DataFrame from ImagesSummary
8. Ranking Algorithms
Rank ProductCalculation of the Rank ProductFormalizing Rank ProductRank Product ExamplePySpark SolutionPageRankPageRank’s Iterative ComputationCustom PageRank in PySpark Using RDDsCustom PageRank in PySpark Using an Adjacency MatrixPageRank with GraphFramesSummary
III. Data Design Patterns
9. Classic Data Design Patterns
Input-Map-OutputRDD SolutionDataFrame SolutionFlat Mapper functionalityInput-Filter-OutputRDD SolutionDataFrame SolutionDataFrame FilterInput-Map-Reduce-OutputRDD SolutionDataFrame SolutionInput-Multiple-Maps-Reduce-OutputRDD SolutionDataFrame SolutionInput-Map-Combiner-Reduce-OutputInput-MapPartitions-Reduce-OutputInverted IndexProblem StatementInputOutputPySpark SolutionSummary
10. Practical Data Design Patterns
In-Mapper CombiningBasic MapReduce AlgorithmIn-Mapper Combining per RecordIn-Mapper Combining per PartitionTop-10Top-N FormalizedPySpark SolutionFinding the Bottom 10MinMaxSolution 1: Classic MapReduceSolution 2: SortingSolution 3: Spark’s mapPartitions()The Composite Pattern and MonoidsMonoidsMonoidal and Non-Monoidal ExamplesNon-Monoid MapReduce ExampleMonoid MapReduce ExamplePySpark Implementation of Monoidal MeanFunctors and MonoidsConclusion on Using MonoidsBinningSortingSummary
11. Join Design Patterns
Introduction to the Join OperationJoin in MapReduceMap PhaseReducer PhaseImplementation in PySparkMap-Side Join Using RDDsMap-Side Join Using DataFramesStep 1: Create Cache for AirportsStep 2: Create Cache for AirlinesStep 3: Create Facts TableStep 4: Apply Map-Side JoinEfficient Joins Using Bloom FiltersIntroduction to Bloom FiltersA Simple Bloom Filter ExampleBloom Filters in PythonUsing Bloom Filters in PySparkSummary
12. Feature Engineering in PySpark
Introduction to Feature EngineeringAdding New FeaturesApplying UDFsCreating PipelinesBinarizing DataImputationTokenizationTokenizerRegexTokenizerTokenization with a PipelineStandardizationNormalizationScaling a Column Using a PipelineUsing MinMaxScaler on Multiple ColumnsNormalization Using NormalizerString IndexingApplying StringIndexer to a Single ColumnApplying StringIndexer to Several ColumnsVector AssemblyBucketingBucketizerQuantileDiscretizerLogarithm TransformationOne-Hot EncodingTF-IDFFeatureHasherSQLTransformerSummary
Index
About the Author

Overview

Apache Spark's speed, ease of use, sophisticated analytics, and multilanguage support makes practical knowledge of this cluster-computing framework a required skill for data engineers and data scientists. With this hands-on guide, anyone looking for an introduction to Spark will learn practical algorithms and examples using PySpark.

In each chapter, author Mahmoud Parsian shows you how to solve a data problem with a set of Spark transformations and algorithms. You'll learn how to tackle problems involving ETL, design patterns, machine learning algorithms, data partitioning, and genomics analysis. Each detailed recipe includes PySpark algorithms using the PySpark driver and shell script.

With this book, you will:

Learn how to select Spark transformations for optimized solutions
Explore powerful transformations and reductions including reduceByKey(), combineByKey(), and mapPartitions()
Understand data partitioning for optimized queries
Build and apply a model using PySpark design patterns
Apply motif-finding algorithms to graph data
Analyze graph data by using the GraphFrames API
Apply PySpark algorithms to clinical and genomics data
Learn how to use and apply feature engineering in ML algorithms
Understand and use practical and pragmatic data design patterns

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Apache Spark with Python - Big Data with PySpark and Spark

Publisher Resources

ISBN: 9781492082378Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills