Data Algorithms with Spark

Book description

Apache Spark's speed, ease of use, sophisticated analytics, and multilanguage support makes practical knowledge of this cluster-computing framework a required skill for data engineers and data scientists. With this hands-on guide, anyone looking for an introduction to Spark will learn practical algorithms and examples using PySpark.

In each chapter, author Mahmoud Parsian shows you how to solve a data problem with a set of Spark transformations and algorithms. You'll learn how to tackle problems involving ETL, design patterns, machine learning algorithms, data partitioning, and genomics analysis. Each detailed recipe includes PySpark algorithms using the PySpark driver and shell script.

With this book, you will:

  • Learn how to select Spark transformations for optimized solutions
  • Explore powerful transformations and reductions including reduceByKey(), combineByKey(), and mapPartitions()
  • Understand data partitioning for optimized queries
  • Build and apply a model using PySpark design patterns
  • Apply motif-finding algorithms to graph data
  • Analyze graph data by using the GraphFrames API
  • Apply PySpark algorithms to clinical and genomics data
  • Learn how to use and apply feature engineering in ML algorithms
  • Understand and use practical and pragmatic data design patterns

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword
  2. Preface
    1. Why I Wrote This Book
    2. Who This Book Is For
    3. How This Book Is Organized
    4. Conventions Used in This Book
    5. Using Code Examples
    6. O’Reilly Online Learning
    7. How to Contact Us
    8. Acknowledgments
  3. I. Fundamentals
  4. 1. Introduction to Spark and PySpark
    1. Why Spark for Data Analytics
      1. The Spark Ecosystem
      2. Spark Architecture
    2. The Power of PySpark
      1. PySpark Architecture
    3. Spark Data Abstractions
      1. RDD Examples
      2. Spark RDD Operations
      3. DataFrame Examples
    4. Using the PySpark Shell
      1. Launching the PySpark Shell
      2. Creating an RDD from a Collection
      3. Aggregating and Merging Values of Keys
      4. Filtering an RDD’s Elements
      5. Grouping Similar Keys
      6. Aggregating Values for Similar Keys
    5. ETL Example with DataFrames
      1. Extraction
      2. Transformation
      3. Loading
    6. Summary
  5. 2. Transformations in Action
    1. The DNA Base Count Example
      1. The DNA Base Count Problem
      2. FASTA Format
      3. Sample Data
    2. DNA Base Count Solution 1
      1. Step 1: Create an RDD[String] from the Input
      2. Step 2: Define a Mapper Function
      3. Step 3: Find the Frequencies of DNA Letters
      4. Pros and Cons of Solution 1
    3. DNA Base Count Solution 2
      1. Step 1: Create an RDD[String] from the Input
      2. Step 2: Define a Mapper Function
      3. Step 3: Find the Frequencies of DNA Letters
      4. Pros and Cons of Solution 2
    4. DNA Base Count Solution 3
      1. The mapPartitions() Transformation
      2. Step 1: Create an RDD[String] from the Input
      3. Step 2: Define a Function to Handle a Partition
      4. Step 3: Apply the Custom Function to Each Partition
      5. Pros and Cons of Solution 3
    5. Summary
  6. 3. Mapper Transformations
    1. Data Abstractions and Mappers
    2. What Are Transformations?
      1. Lazy Transformations
      2. The map() Transformation
      3. DataFrame Mapper
    3. The flatMap() Transformation
      1. map() Versus flatMap()
      2. Apply flatMap() to a DataFrame
    4. The mapValues() Transformation
    5. The flatMapValues() Transformation
    6. The mapPartitions() Transformation
      1. Handling Empty Partitions
      2. Benefits and Drawbacks
      3. DataFrames and mapPartitions() Transformation
    7. Summary
  7. 4. Reductions in Spark
    1. Creating Pair RDDs
    2. Reduction Transformations
    3. Spark’s Reductions
    4. Simple Warmup Example
      1. Solving with reduceByKey()
      2. Solving with groupByKey()
      3. Solving with aggregateByKey()
      4. Solving with combineByKey()
    5. What Is a Monoid?
      1. Monoid and Non-Monoid Examples
    6. The Movie Problem
      1. Input Dataset to Analyze
      2. The aggregateByKey() Transformation
      3. First Solution Using aggregateByKey()
      4. Second Solution Using aggregateByKey()
      5. Complete PySpark Solution Using groupByKey()
      6. Complete PySpark Solution Using reduceByKey()
      7. Complete PySpark Solution Using combineByKey()
    7. The Shuffle Step in Reductions
      1. Shuffle Step for groupByKey()
      2. Shuffle Step for reduceByKey()
    8. Summary
  8. II. Working with Data
  9. 5. Partitioning Data
    1. Introduction to Partitions
      1. Partitions in Spark
    2. Managing Partitions
      1. Default Partitioning
      2. Explicit Partitioning
    3. Physical Partitioning for SQL Queries
    4. Physical Partitioning of Data in Spark
      1. Partition as Text Format
      2. Partition as Parquet Format
    5. How to Query Partitioned Data
      1. Amazon Athena Example
    6. Summary
  10. 6. Graph Algorithms
    1. Introduction to Graphs
    2. The GraphFrames API
      1. How to Use GraphFrames
      2. GraphFrames Functions and Attributes
    3. GraphFrames Algorithms
      1. Finding Triangles
      2. Motif Finding
    4. Real-World Applications
      1. Gene Analysis
      2. Social Recommendations
      3. Facebook Circles
      4. Connected Components
      5. Analyzing Flight Data
    5. Summary
  11. 7. Interacting with External Data Sources
    1. Relational Databases
      1. Reading from a Database
      2. Writing a DataFrame to a Database
    2. Reading Text Files
    3. Reading and Writing CSV Files
      1. Reading CSV Files
      2. Writing CSV Files
    4. Reading and Writing JSON Files
      1. Reading JSON Files
      2. Writing JSON Files
    5. Reading from and Writing to Amazon S3
      1. Reading from Amazon S3
      2. Writing to Amazon S3
    6. Reading and Writing Hadoop Files
      1. Reading Hadoop Text Files
      2. Writing Hadoop Text Files
      3. Reading and Writing HDFS SequenceFiles
    7. Reading and Writing Parquet Files
      1. Writing Parquet Files
      2. Reading Parquet Files
    8. Reading and Writing Avro Files
      1. Reading Avro Files
      2. Writing Avro Files
    9. Reading from and Writing to MS SQL Server
      1. Writing to MS SQL Server
      2. Reading from MS SQL Server
    10. Reading Image Files
      1. Creating a DataFrame from Images
    11. Summary
  12. 8. Ranking Algorithms
    1. Rank Product
      1. Calculation of the Rank Product
      2. Formalizing Rank Product
      3. Rank Product Example
      4. PySpark Solution
    2. PageRank
      1. PageRank’s Iterative Computation
      2. Custom PageRank in PySpark Using RDDs
      3. Custom PageRank in PySpark Using an Adjacency Matrix
      4. PageRank with GraphFrames
    3. Summary
  13. III. Data Design Patterns
  14. 9. Classic Data Design Patterns
    1. Input-Map-Output
      1. RDD Solution
      2. DataFrame Solution
      3. Flat Mapper functionality
    2. Input-Filter-Output
      1. RDD Solution
      2. DataFrame Solution
      3. DataFrame Filter
    3. Input-Map-Reduce-Output
      1. RDD Solution
      2. DataFrame Solution
    4. Input-Multiple-Maps-Reduce-Output
      1. RDD Solution
      2. DataFrame Solution
    5. Input-Map-Combiner-Reduce-Output
    6. Input-MapPartitions-Reduce-Output
    7. Inverted Index
      1. Problem Statement
      2. Input
      3. Output
      4. PySpark Solution
    8. Summary
  15. 10. Practical Data Design Patterns
    1. In-Mapper Combining
      1. Basic MapReduce Algorithm
      2. In-Mapper Combining per Record
      3. In-Mapper Combining per Partition
    2. Top-10
      1. Top-N Formalized
      2. PySpark Solution
      3. Finding the Bottom 10
    3. MinMax
      1. Solution 1: Classic MapReduce
      2. Solution 2: Sorting
      3. Solution 3: Spark’s mapPartitions()
    4. The Composite Pattern and Monoids
      1. Monoids
      2. Monoidal and Non-Monoidal Examples
      3. Non-Monoid MapReduce Example
      4. Monoid MapReduce Example
      5. PySpark Implementation of Monoidal Mean
      6. Functors and Monoids
      7. Conclusion on Using Monoids
    5. Binning
    6. Sorting
    7. Summary
  16. 11. Join Design Patterns
    1. Introduction to the Join Operation
    2. Join in MapReduce
      1. Map Phase
      2. Reducer Phase
      3. Implementation in PySpark
    3. Map-Side Join Using RDDs
    4. Map-Side Join Using DataFrames
      1. Step 1: Create Cache for Airports
      2. Step 2: Create Cache for Airlines
      3. Step 3: Create Facts Table
      4. Step 4: Apply Map-Side Join
    5. Efficient Joins Using Bloom Filters
      1. Introduction to Bloom Filters
      2. A Simple Bloom Filter Example
      3. Bloom Filters in Python
      4. Using Bloom Filters in PySpark
    6. Summary
  17. 12. Feature Engineering in PySpark
    1. Introduction to Feature Engineering
    2. Adding New Features
    3. Applying UDFs
    4. Creating Pipelines
    5. Binarizing Data
    6. Imputation
    7. Tokenization
      1. Tokenizer
      2. RegexTokenizer
      3. Tokenization with a Pipeline
    8. Standardization
    9. Normalization
      1. Scaling a Column Using a Pipeline
      2. Using MinMaxScaler on Multiple Columns
      3. Normalization Using Normalizer
    10. String Indexing
      1. Applying StringIndexer to a Single Column
      2. Applying StringIndexer to Several Columns
    11. Vector Assembly
    12. Bucketing
      1. Bucketizer
      2. QuantileDiscretizer
    13. Logarithm Transformation
    14. One-Hot Encoding
    15. TF-IDF
    16. FeatureHasher
    17. SQLTransformer
    18. Summary
  18. Index
  19. About the Author

Product information

  • Title: Data Algorithms with Spark
  • Author(s): Mahmoud Parsian
  • Release date: April 2022
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492082385