O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

PySpark Cookbook

Book Description

Combine the power of Apache Spark and Python to build effective big data applications

About This Book
  • Perform effective data processing, machine learning, and analytics using PySpark
  • Overcome challenges in developing and deploying Spark solutions using Python
  • Explore recipes for efficiently combining Python and Apache Spark to process data
Who This Book Is For

The PySpark Cookbook is for you if you are a Python developer looking for hands-on recipes for using the Apache Spark 2.x ecosystem in the best possible way. A thorough understanding of Python (and some familiarity with Spark) will help you get the best out of the book.

What You Will Learn
  • Configure a local instance of PySpark in a virtual environment
  • Install and configure Jupyter in local and multi-node environments
  • Create DataFrames from JSON and a dictionary using pyspark.sql
  • Explore regression and clustering models available in the ML module
  • Use DataFrames to transform data used for modeling
  • Connect to PubNub and perform aggregations on streams
In Detail

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. The PySpark Cookbook presents effective and time-saving recipes for leveraging the power of Python and putting it to use in the Spark ecosystem.

You'll start by learning the Apache Spark architecture and how to set up a Python environment for Spark. You'll then get familiar with the modules available in PySpark and start using them effortlessly. In addition to this, you'll discover how to abstract data with RDDs and DataFrames, and understand the streaming capabilities of PySpark. You'll then move on to using ML and MLlib in order to solve any problems related to the machine learning capabilities of PySpark and use GraphFrames to solve graph-processing problems. Finally, you will explore how to deploy your applications to the cloud using the spark-submit command.

By the end of this book, you will be able to use the Python API for Apache Spark to solve any problems associated with building data-intensive applications.

Style and approach

This book is a rich collection of recipes that will come in handy when you are working with PySpark

Addressing your common and not-so-common pain points, this is a book that you must have on the shelf.

Downloading the example code for this book You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Title Page
  2. Copyright and Credits
    1. PySpark Cookbook
  3. Packt Upsell
    1. Why subscribe?
    2. PacktPub.com
  4. Contributors
    1. About the authors
    2. About the reviewer
    3. Packt is searching for authors like you
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Download the color images
      3. Conventions used
    4. Sections
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
      5. See also
    5. Get in touch
      1. Reviews
  6. Installing and Configuring Spark
    1. Introduction
    2. Installing Spark requirements
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Installing Java
        2. Installing Python
        3. Installing R
        4. Installing Scala
        5. Installing Maven
        6. Updating PATH
    3. Installing Spark from sources
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
      5. See also
    4. Installing Spark from binaries
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    5. Configuring a local instance of Spark
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    6. Configuring a multi-node instance of Spark
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    7. Installing Jupyter
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
      5. See also
    8. Configuring a session in Jupyter
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
      5. See also
    9. Working with Cloudera Spark images
      1. Getting ready
      2. How to do it...
      3. How it works...
  7. Abstracting Data with RDDs
    1. Introduction
    2. Creating RDDs
      1. Getting ready 
      2. How to do it...
      3. How it works...
        1. Spark context parallelize method
        2. .take(...) method
    3. Reading data from files
      1. Getting ready 
      2. How to do it...
      3. How it works...
        1. .textFile(...) method
        2. .map(...) method
        3. Partitions and performance
    4. Overview of RDD transformations
      1. Getting ready
      2. How to do it...
        1. .map(...) transformation
        2. .filter(...) transformation
        3. .flatMap(...) transformation
        4. .distinct() transformation
        5. .sample(...) transformation
        6. .join(...) transformation
        7. .repartition(...) transformation
        8. .zipWithIndex() transformation
        9. .reduceByKey(...) transformation
        10. .sortByKey(...) transformation
        11. .union(...) transformation
        12. .mapPartitionsWithIndex(...) transformation
      3. How it works...
    5. Overview of RDD actions
      1. Getting ready
      2. How to do it...
        1. .take(...) action
        2. .collect() action
        3. .reduce(...) action
        4. .count() action
        5. .saveAsTextFile(...) action
      3. How it works...
    6. Pitfalls of using RDDs
      1. Getting ready
      2. How to do it...
      3. How it works...
  8. Abstracting Data with DataFrames
    1. Introduction
    2. Creating DataFrames
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. From JSON
        2. From CSV
      5. See also
    3. Accessing underlying RDDs
      1. Getting ready
      2. How to do it...
      3. How it works...
    4. Performance optimizations
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
      5. See also
    5. Inferring the schema using reflection
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    6. Specifying the schema programmatically
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    7. Creating a temporary table
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    8. Using SQL to interact with DataFrames
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    9. Overview of DataFrame transformations
      1. Getting ready
      2. How to do it...
        1. The .select(...) transformation
        2. The .filter(...) transformation
        3. The .groupBy(...) transformation
        4. The .orderBy(...) transformation
        5. The .withColumn(...) transformation
        6. The .join(...) transformation
        7. The .unionAll(...) transformation
        8. The .distinct(...) transformation
        9. The .repartition(...) transformation
        10. The .fillna(...) transformation
        11. The .dropna(...) transformation
        12. The .dropDuplicates(...) transformation
        13. The .summary() and .describe() transformations
        14. The .freqItems(...) transformation
      3. See also
    10. Overview of DataFrame actions
      1. Getting ready
      2. How to do it...
        1. The .show(...) action
        2. The .collect() action
        3. The .take(...) action
        4. The .toPandas() action
      3. See also
  9. Preparing Data for Modeling
    1. Introduction
    2. Handling duplicates
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Only IDs differ
        2. ID collisions
    3. Handling missing observations
      1. Getting ready
      2. How to do it...
      3. How it works...
        1. Missing observations per row
        2. Missing observations per column
      4. There's more...
      5. See also
    4. Handling outliers
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    5. Exploring descriptive statistics
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
        1. Descriptive statistics for aggregated columns
      5. See also
    6. Computing correlations
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    7. Drawing histograms
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
      5. See also
    8. Visualizing interactions between features
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
  10. Machine Learning with MLlib
    1. Loading the data
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    2. Exploring the data
      1. Getting ready
      2. How to do it...
      3. How it works...
        1. Numerical features
        2. Categorical features
      4. There's more...
      5. See also
    3. Testing the data
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also...
    4. Transforming the data
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
      5. See also...
    5. Standardizing the data
      1. Getting ready
      2. How to do it...
      3. How it works...
    6. Creating an RDD for training
      1. Getting ready
      2. How to do it...
        1. Classification
        2. Regression
      3. How it works...
      4. There's more...
      5. See also
    7. Predicting hours of work for census respondents
      1. Getting ready
      2. How to do it...
      3. How it works...
    8. Forecasting the income levels of census respondents
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    9. Building a clustering models
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
      5. See also
    10. Computing performance statistics
      1. Getting ready
      2. How to do it...
      3. How it works...
        1. Regression metrics
        2. Classification metrics
      4. See also
  11. Machine Learning with the ML Module
    1. Introducing Transformers
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
      5. See also
    2. Introducing Estimators
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    3. Introducing Pipelines
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    4. Selecting the most predictable features
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
      5. See also
    5. Predicting forest coverage types
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    6. Estimating forest elevation
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    7. Clustering forest cover types
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. See also
    8. Tuning hyperparameters
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    9. Extracting features from text
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
      5. See also
    10. Discretizing continuous variables
      1. Getting ready
      2. How to do it...
      3. How it works...
    11. Standardizing continuous variables
      1. Getting ready
      2. How to do it...
      3. How it works...
    12. Topic mining
      1. Getting ready
      2. How to do it...
      3. How it works...
  12. Structured Streaming with PySpark
    1. Introduction
      1. Understanding Spark Streaming
    2. Understanding DStreams
      1. Getting ready
      2. How to do it...
        1. Terminal 1 – Netcat window
        2. Terminal 2 – Spark Streaming window
      3. How it works...
      4. There's more...
    3. Understanding global aggregations
      1. Getting ready
      2. How to do it...
        1. Terminal 1 – Netcat window
        2. Terminal 2 – Spark Streaming window
      3. How it works...
    4. Continuous aggregation with structured streaming
      1. Getting ready
      2. How to do it...
        1. Terminal 1 – Netcat window
        2. Terminal 2 – Spark Streaming window
      3. How it works...
  13. GraphFrames – Graph Theory with PySpark
    1. Introduction
    2. Installing GraphFrames
      1. Getting ready
      2. How to do it...
      3. How it works...
    3. Preparing the data
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
    4. Building the graph
      1. How to do it...
      2. How it works...
    5. Running queries against the graph
      1. Getting ready
      2. How to do it...
      3. How it works...
    6. Understanding the graph
      1. Getting ready
      2. How to do it...
      3. How it works...
    7. Using PageRank to determine airport ranking
      1. Getting ready
      2. How to do it...
      3. How it works...
    8. Finding the fewest number of connections
      1. Getting ready
      2. How to do it...
      3. How it works...
      4. There's more...
      5. See also
    9. Visualizing the graph
      1. Getting ready
      2. How to do it...
      3. How it works...