Essential PySpark for Scalable Data Analytics

Book description

Get started with distributed computing using PySpark, a single unified framework to solve end-to-end data analytics at scale

Key Features

  • Discover how to convert huge amounts of raw data into meaningful and actionable insights
  • Use Spark's unified analytics engine for end-to-end analytics, from data preparation to predictive analytics
  • Perform data ingestion, cleansing, and integration for ML, data analytics, and data visualization

Book Description

Apache Spark is a unified data analytics engine designed to process huge volumes of data quickly and efficiently. PySpark is Apache Spark's Python language API, which offers Python developers an easy-to-use scalable data analytics framework.

Essential PySpark for Scalable Data Analytics starts by exploring the distributed computing paradigm and provides a high-level overview of Apache Spark. You'll begin your analytics journey with the data engineering process, learning how to perform data ingestion, cleansing, and integration at scale. This book helps you build real-time analytics pipelines that help you gain insights faster. You'll then discover methods for building cloud-based data lakes, and explore Delta Lake, which brings reliability to data lakes. The book also covers Data Lakehouse, an emerging paradigm, which combines the structure and performance of a data warehouse with the scalability of cloud-based data lakes. Later, you'll perform scalable data science and machine learning tasks using PySpark, such as data preparation, feature engineering, and model training and productionization. Finally, you'll learn ways to scale out standard Python ML libraries along with a new pandas API on top of PySpark called Koalas.

By the end of this PySpark book, you'll be able to harness the power of PySpark to solve business problems.

What you will learn

  • Understand the role of distributed computing in the world of big data
  • Gain an appreciation for Apache Spark as the de facto go-to for big data processing
  • Scale out your data analytics process using Apache Spark
  • Build data pipelines using data lakes, and perform data visualization with PySpark and Spark SQL
  • Leverage the cloud to build truly scalable and real-time data analytics applications
  • Explore the applications of data science and scalable machine learning with PySpark
  • Integrate your clean and curated data with BI and SQL analysis tools

Who this book is for

This book is for practicing data engineers, data scientists, data analysts, and data enthusiasts who are already using data analytics to explore distributed and scalable data analytics. Basic to intermediate knowledge of the disciplines of data engineering, data science, and SQL analytics is expected. General proficiency in using any programming language, especially Python, and working knowledge of performing data analytics using frameworks such as pandas and SQL will help you to get the most out of this book.

Table of contents

  1. Essential PySpark for Scalable Data Analytics
  2. Contributors
  3. About the author
  4. About the reviewers
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Download the example code files
    5. Download the color images
    6. Conventions used
    7. Get in touch
    8. Share your thoughts
  6. Section 1: Data Engineering
  7. Chapter 1: Distributed Computing Primer
    1. Technical requirements
    2. Distributed Computing
      1. Introduction to Distributed Computing
      2. Data Parallel Processing
      3. Data Parallel Processing using the MapReduce paradigm
    3. Distributed Computing with Apache Spark
      1. Introduction to Apache Spark
      2. Data Parallel Processing with RDDs
      3. Higher-order functions
      4. Apache Spark cluster architecture
      5. Getting started with Spark
    4. Big data processing with Spark SQL and DataFrames
      1. Transforming data with Spark DataFrames
      2. Using SQL on Spark
      3. What's new in Apache Spark 3.0?
    5. Summary
  8. Chapter 2: Data Ingestion
    1. Technical requirements
    2. Introduction to Enterprise Decision Support Systems
    3. Ingesting data from data sources
      1. Ingesting from relational data sources
      2. Ingesting from file-based data sources
      3. Ingesting from message queues
    4. Ingesting data into data sinks
      1. Ingesting into data warehouses
      2. Ingesting into data lakes
      3. Ingesting into NoSQL and in-memory data stores
    5. Using file formats for data storage in data lakes
      1. Unstructured data storage formats
      2. Semi-structured data storage formats
      3. Structured data storage formats
    6. Building data ingestion pipelines in batch and real time
      1. Data ingestion using batch processing
      2. Data ingestion in real time using structured streaming
    7. Unifying batch and real time using Lambda Architecture
      1. Lambda Architecture
      2. The Batch layer
      3. The Speed layer
      4. The Serving layer
    8. Summary
  9. Chapter 3: Data Cleansing and Integration
    1. Technical requirements
    2. Transforming raw data into enriched meaningful data
      1. Extracting, transforming, and loading data
      2. Extracting, loading, and transforming data
      3. Advantages of choosing ELT over ETL
    3. Building analytical data stores using cloud data lakes
      1. Challenges with cloud data lakes
      2. Overcoming data lake challenges with Delta Lake
    4. Consolidating data using data integration
      1. Data consolidation via ETL and data warehousing
      2. Integrating data using data virtualization techniques
      3. Data integration through data federation
    5. Making raw data analytics-ready using data cleansing
      1. Data selection to eliminate redundancies
      2. De-duplicating data
      3. Standardizing data
      4. Optimizing ELT processing performance with data partitioning
    6. Summary
  10. Chapter 4: Real-Time Data Analytics
    1. Technical requirements
    2. Real-time analytics systems architecture
      1. Streaming data sources
      2. Streaming data sinks
    3. Stream processing engines
      1. Real-time data consumers
    4. Real-time analytics industry use cases
      1. Real-time predictive analytics in manufacturing
      2. Connected vehicles in the automotive sector
      3. Financial fraud detection
      4. IT security threat detection
    5. Simplifying the Lambda Architecture using Delta Lake
    6. Change Data Capture
    7. Handling late-arriving data
      1. Stateful stream processing using windowing and watermarking
    8. Multi-hop pipelines
    9. Summary
  11. Section 2: Data Science
  12. Chapter 5: Scalable Machine Learning with PySpark
    1. Technical requirements
    2. ML overview
      1. Types of ML algorithms
      2. Business use cases of ML
    3. Scaling out machine learning
      1. Techniques for scaling ML
      2. Introduction to Apache Spark's ML library
    4. Data wrangling with Apache Spark and MLlib
      1. Data preprocessing
      2. Data cleansing
      3. Data manipulation
    5. Summary
  13. Chapter 6: Feature Engineering – Extraction, Transformation, and Selection
    1. Technical requirements
    2. The machine learning process
    3. Feature extraction
    4. Feature transformation
      1. Transforming categorical variables
      2. Transforming continuous variables
      3. Transforming the date and time variables
      4. Assembling individual features into a feature vector
      5. Feature scaling
    5. Feature selection
    6. Feature store as a central feature repository
      1. Batch inferencing using the offline feature store
    7. Delta Lake as an offline feature store
      1. Structure and metadata with Delta tables
      2. Schema enforcement and evolution with Delta Lake
      3. Support for simultaneous batch and streaming workloads
      4. Delta Lake time travel
      5. Integration with machine learning operations tools
      6. Online feature store for real-time inferencing
    8. Summary
  14. Chapter 7: Supervised Machine Learning
    1. Technical requirements
    2. Introduction to supervised machine learning
      1. Parametric machine learning
      2. Non-parametric machine learning
    3. Regression
      1. Linear regression
      2. Regression using decision trees
    4. Classification
      1. Logistic regression
      2. Classification using decision trees
      3. Naïve Bayes
      4. Support vector machines
    5. Tree ensembles
      1. Regression using random forests
      2. Classification using random forests
      3. Regression using gradient boosted trees
      4. Classification using GBTs
    6. Real-world supervised learning applications
      1. Regression applications
      2. Classification applications
    7. Summary
  15. Chapter 8: Unsupervised Machine Learning
    1. Technical requirements
    2. Introduction to unsupervised machine learning
    3. Clustering using machine learning
      1. K-means clustering
      2. Hierarchical clustering using bisecting K-means
      3. Topic modeling using latent Dirichlet allocation
      4. Gaussian mixture model
    4. Building association rules using machine learning
      1. Collaborative filtering using alternating least squares
    5. Real-world applications of unsupervised learning
      1. Clustering applications
      2. Association rules and collaborative filtering applications
    6. Summary
  16. Chapter 9: Machine Learning Life Cycle Management
    1. Technical requirements
    2. Introduction to the ML life cycle
      1. Introduction to MLflow
    3. Tracking experiments with MLflow
      1. ML model tuning
    4. Tracking model versions using MLflow Model Registry
    5. Model serving and inferencing
      1. Offline model inferencing
      2. Online model inferencing
    6. Continuous delivery for ML
    7. Summary
  17. Chapter 10: Scaling Out Single-Node Machine Learning Using PySpark
    1. Technical requirements
    2. Scaling out EDA
      1. EDA using pandas
      2. EDA using PySpark
    3. Scaling out model inferencing
    4. Model training using embarrassingly parallel computing
      1. Distributed hyperparameter tuning
      2. Scaling out arbitrary Python code using pandas UDF
    5. Upgrading pandas to PySpark using Koalas
    6. Summary
  18. Section 3: Data Analysis
  19. Chapter 11: Data Visualization with PySpark
    1. Technical requirements
    2. Importance of data visualization
      1. Types of data visualization tools
    3. Techniques for visualizing data using PySpark
      1. PySpark native data visualizations
      2. Using Python data visualizations with PySpark
    4. Considerations for PySpark to pandas conversion
      1. Introduction to pandas
      2. Converting from PySpark into pandas
    5. Summary
  20. Chapter 12: Spark SQL Primer
    1. Technical requirements
    2. Introduction to SQL
      1. DDL
      2. DML
      3. Joins and sub-queries
      4. Row-based versus columnar storage
    3. Introduction to Spark SQL
      1. Catalyst optimizer
      2. Spark SQL data sources
    4. Spark SQL language reference
      1. Spark SQL DDL
      2. Spark DML
    5. Optimizing Spark SQL performance
    6. Summary
  21. Chapter 13: Integrating External Tools with Spark SQL
    1. Technical requirements
    2. Apache Spark as a distributed SQL engine
      1. Introduction to Hive Thrift JDBC/ODBC Server
    3. Spark connectivity to SQL analysis tools
    4. Spark connectivity to BI tools
    5. Connecting Python applications to Spark SQL using Pyodbc
    6. Summary
  22. Chapter 14: The Data Lakehouse
    1. Moving from BI to AI
      1. Challenges with data warehouses
      2. Challenges with data lakes
    2. The data lakehouse paradigm
      1. Key requirements of a data lakehouse
      2. Data lakehouse architecture
      3. Examples of existing lakehouse architectures
      4. Apache Spark-based data lakehouse architecture
    3. Advantages of data lakehouses
    4. Summary
    5. Why subscribe?
  23. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Share your thoughts

Product information

  • Title: Essential PySpark for Scalable Data Analytics
  • Author(s): Sreeram Nudurupati
  • Release date: October 2021
  • Publisher(s): Packt Publishing
  • ISBN: 9781800568877