Advanced Analytics with PySpark

Book description

The amount of data being generated today is staggering and growing. Apache Spark has emerged as the de facto tool to analyze big data and is now a critical part of the data science toolbox. Updated for Spark 3.0, this practical guide brings together Spark, statistical methods, and real-world datasets to teach you how to approach analytics problems using PySpark, Spark's Python API, and other best practices in Spark programming.

Data scientists Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills offer an introduction to the Spark ecosystem, then dive into patterns that apply common techniques-including classification, clustering, collaborative filtering, and anomaly detection, to fields such as genomics, security, and finance. This updated edition also covers NLP and image processing.

If you have a basic understanding of machine learning and statistics and you program in Python, this book will get you started with large-scale data analysis.

  • Familiarize yourself with Spark's programming model and ecosystem
  • Learn general approaches in data science
  • Examine complete implementations that analyze large public datasets
  • Discover which machine learning tools make sense for particular problems
  • Explore code that can be adapted to many uses

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Why Did We Write This Book Now?
    2. How This Book Is Organized
    3. Conventions Used in This Book
    4. Using Code Examples
    5. O’Reilly Online Learning
    6. How to Contact Us
    7. Acknowledgments
  2. 1. Analyzing Big Data
    1. Working with Big Data
    2. Introducing Apache Spark and PySpark
      1. Components
      2. PySpark
      3. Ecosystem
    3. Spark 3.0
    4. PySpark Addresses Challenges of Data Science
    5. Where to Go from Here
  3. 2. Introduction to Data Analysis with PySpark
    1. Spark Architecture
    2. Installing PySpark
    3. Setting Up Our Data
    4. Analyzing Data with the DataFrame API
    5. Fast Summary Statistics for DataFrames
    6. Pivoting and Reshaping DataFrames
    7. Joining DataFrames and Selecting Features
    8. Scoring and Model Evaluation
    9. Where to Go from Here
  4. 3. Recommending Music and the Audioscrobbler Dataset
    1. Setting Up the Data
    2. Our Requirements for a Recommender System
      1. Alternating Least Squares Algorithm
    3. Preparing the Data
    4. Building a First Model
    5. Spot Checking Recommendations
    6. Evaluating Recommendation Quality
    7. Computing AUC
    8. Hyperparameter Selection
    9. Making Recommendations
    10. Where to Go from Here
  5. 4. Making Predictions with Decision Trees and Decision Forests
    1. Decision Trees and Forests
    2. Preparing the Data
    3. Our First Decision Tree
    4. Decision Tree Hyperparameters
    5. Tuning Decision Trees
    6. Categorical Features Revisited
    7. Random Forests
    8. Making Predictions
    9. Where to Go from Here
  6. 5. Anomaly Detection with K-means Clustering
    1. K-means Clustering
    2. Identifying Anomalous Network Traffic
      1. KDD Cup 1999 Dataset
    3. A First Take on Clustering
    4. Choosing k
    5. Visualization with SparkR
    6. Feature Normalization
    7. Categorical Variables
    8. Using Labels with Entropy
    9. Clustering in Action
    10. Where to Go from Here
  7. 6. Understanding Wikipedia with LDA and Spark NLP
    1. Latent Dirichlet Allocation
      1. LDA in PySpark
    2. Getting the Data
    3. Spark NLP
      1. Setting Up Your Environment
    4. Parsing the Data
    5. Preparing the Data Using Spark NLP
    6. TF-IDF
    7. Computing the TF-IDFs
    8. Creating Our LDA Model
    9. Where to Go from Here
  8. 7. Geospatial and Temporal Data Analysis on Taxi Trip Data
    1. Preparing the Data
      1. Converting Datetime Strings to Timestamps
      2. Handling Invalid Records
    2. Geospatial Analysis
      1. Intro to GeoJSON
      2. GeoPandas
    3. Sessionization in PySpark
      1. Building Sessions: Secondary Sorts in PySpark
    4. Where to Go from Here
  9. 8. Estimating Financial Risk
    1. Terminology
    2. Methods for Calculating VaR
      1. Variance-Covariance
      2. Historical Simulation
      3. Monte Carlo Simulation
    3. Our Model
    4. Getting the Data
    5. Preparing the Data
    6. Determining the Factor Weights
    7. Sampling
      1. The Multivariate Normal Distribution
    8. Running the Trials
    9. Visualizing the Distribution of Returns
    10. Where to Go from Here
  10. 9. Analyzing Genomics Data and the BDG Project
    1. Decoupling Storage from Modeling
    2. Setting Up ADAM
    3. Introduction to Working with Genomics Data Using ADAM
      1. File Format Conversion with the ADAM CLI
      2. Ingesting Genomics Data Using PySpark and ADAM
    4. Predicting Transcription Factor Binding Sites from ENCODE Data
    5. Where to Go from Here
  11. 10. Image Similarity Detection with Deep Learning and PySpark LSH
    1. PyTorch
      1. Installation
    2. Preparing the Data
      1. Resizing Images Using PyTorch
    3. Deep Learning Model for Vector Representation of Images
      1. Image Embeddings
      2. Import Image Embeddings into PySpark
    4. Image Similarity Search Using PySpark LSH
      1. Nearest Neighbor Search
    5. Where to Go from Here
  12. 11. Managing the Machine Learning Lifecycle with MLflow
    1. Machine Learning Lifecycle
    2. MLflow
    3. Experiment Tracking
    4. Managing and Serving ML Models
    5. Creating and Using MLflow Projects
    6. Where to Go from Here
  13. Index
  14. About the Authors

Product information

  • Title: Advanced Analytics with PySpark
  • Author(s): Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills
  • Release date: June 2022
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098103651