O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Learning PySpark

Book Description

Build data-intensive applications locally and deploy at scale using the combined powers of Python and Spark 2.0

About This Book

  • Learn why and how you can efficiently use Python to process data and build machine learning models in Apache Spark 2.0
  • Develop and deploy efficient, scalable real-time Spark solutions
  • Take your understanding of using Spark with Python to the next level with this jump start guide

Who This Book Is For

If you are a Python developer who wants to learn about the Apache Spark 2.0 ecosystem, this book is for you. A firm understanding of Python is expected to get the best out of the book. Familiarity with Spark would be useful, but is not mandatory.

What You Will Learn

  • Learn about Apache Spark and the Spark 2.0 architecture
  • Build and interact with Spark DataFrames using Spark SQL
  • Learn how to solve graph and deep learning problems using GraphFrames and TensorFrames respectively
  • Read, transform, and understand data and use it to train machine learning models
  • Build machine learning models with MLlib and ML
  • Learn how to submit your applications programmatically using spark-submit
  • Deploy locally built applications to a cluster

In Detail

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book will show you how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python environment for Spark.

You will get familiar with the modules available in PySpark. You will learn how to abstract data with RDDs and DataFrames and understand the streaming capabilities of PySpark. Also, you will get a thorough overview of machine learning capabilities of PySpark using ML and MLlib, graph processing using GraphFrames, and polyglot persistence using Blaze. Finally, you will learn how to deploy your applications to the cloud using the spark-submit command.

By the end of this book, you will have established a firm understanding of the Spark Python API and how it can be used to build data-intensive applications.

Style and approach

This book takes a very comprehensive, step-by-step approach so you understand how the Spark ecosystem can be used with Python to develop efficient, scalable solutions. Every chapter is standalone and written in a very easy-to-understand manner, with a focus on both the hows and the whys of each concept.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Learning PySpark
    1. Table of Contents
    2. Learning PySpark
    3. Credits
    4. Foreword
    5. About the Authors
    6. About the Reviewer
    7. www.PacktPub.com
    8. Customer Feedback
    9. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    10. 1. Understanding Spark
      1. What is Apache Spark?
      2. Spark Jobs and APIs
        1. Execution process
        2. Resilient Distributed Dataset
        3. DataFrames
        4. Datasets
        5. Catalyst Optimizer
        6. Project Tungsten
      3. Spark 2.0 architecture
        1. Unifying Datasets and DataFrames
        2. Introducing SparkSession
        3. Tungsten phase 2
        4. Structured Streaming
        5. Continuous applications
      4. Summary
    11. 2. Resilient Distributed Datasets
      1. Internal workings of an RDD
      2. Creating RDDs
        1. Schema
        2. Reading from files
        3. Lambda expressions
      3. Global versus local scope
      4. Transformations
        1. The .map(...) transformation
        2. The .filter(...) transformation
        3. The .flatMap(...) transformation
        4. The .distinct(...) transformation
        5. The .sample(...) transformation
        6. The .leftOuterJoin(...) transformation
        7. The .repartition(...) transformation
      5. Actions
        1. The .take(...) method
        2. The .collect(...) method
        3. The .reduce(...) method
        4. The .count(...) method
        5. The .saveAsTextFile(...) method
        6. The .foreach(...) method
      6. Summary
    12. 3. DataFrames
      1. Python to RDD communications
      2. Catalyst Optimizer refresh
      3. Speeding up PySpark with DataFrames
      4. Creating DataFrames
        1. Generating our own JSON data
        2. Creating a DataFrame
        3. Creating a temporary table
      5. Simple DataFrame queries
        1. DataFrame API query
        2. SQL query
      6. Interoperating with RDDs
        1. Inferring the schema using reflection
        2. Programmatically specifying the schema
      7. Querying with the DataFrame API
        1. Number of rows
        2. Running filter statements
      8. Querying with SQL
        1. Number of rows
        2. Running filter statements using the where Clauses
      9. DataFrame scenario – on-time flight performance
        1. Preparing the source datasets
        2. Joining flight performance and airports
        3. Visualizing our flight-performance data
      10. Spark Dataset API
      11. Summary
    13. 4. Prepare Data for Modeling
      1. Checking for duplicates, missing observations, and outliers
        1. Duplicates
        2. Missing observations
        3. Outliers
      2. Getting familiar with your data
        1. Descriptive statistics
        2. Correlations
      3. Visualization
        1. Histograms
        2. Interactions between features
      4. Summary
    14. 5. Introducing MLlib
      1. Overview of the package
      2. Loading and transforming the data
      3. Getting to know your data
        1. Descriptive statistics
        2. Correlations
        3. Statistical testing
      4. Creating the final dataset
        1. Creating an RDD of LabeledPoints
        2. Splitting into training and testing
      5. Predicting infant survival
        1. Logistic regression in MLlib
        2. Selecting only the most predictable features
        3. Random forest in MLlib
      6. Summary
    15. 6. Introducing the ML Package
      1. Overview of the package
        1. Transformer
        2. Estimators
          1. Classification
          2. Regression
          3. Clustering
        3. Pipeline
      2. Predicting the chances of infant survival with ML
        1. Loading the data
        2. Creating transformers
        3. Creating an estimator
        4. Creating a pipeline
        5. Fitting the model
        6. Evaluating the performance of the model
        7. Saving the model
      3. Parameter hyper-tuning
        1. Grid search
        2. Train-validation splitting
      4. Other features of PySpark ML in action
        1. Feature extraction
          1. NLP - related feature extractors
          2. Discretizing continuous variables
          3. Standardizing continuous variables
        2. Classification
        3. Clustering
          1. Finding clusters in the births dataset
          2. Topic mining
        4. Regression
      5. Summary
    16. 7. GraphFrames
      1. Introducing GraphFrames
      2. Installing GraphFrames
        1. Creating a library
      3. Preparing your flights dataset
      4. Building the graph
      5. Executing simple queries
        1. Determining the number of airports and trips
        2. Determining the longest delay in this dataset
        3. Determining the number of delayed versus on-time/early flights
        4. What flights departing Seattle are most likely to have significant delays?
        5. What states tend to have significant delays departing from Seattle?
      6. Understanding vertex degrees
      7. Determining the top transfer airports
      8. Understanding motifs
      9. Determining airport ranking using PageRank
      10. Determining the most popular non-stop flights
      11. Using Breadth-First Search
      12. Visualizing flights using D3
      13. Summary
    17. 8. TensorFrames
      1. What is Deep Learning?
        1. The need for neural networks and Deep Learning
        2. What is feature engineering?
        3. Bridging the data and algorithm
      2. What is TensorFlow?
        1. Installing Pip
        2. Installing TensorFlow
        3. Matrix multiplication using constants
        4. Matrix multiplication using placeholders
          1. Running the model
          2. Running another model
        5. Discussion
      3. Introducing TensorFrames
      4. TensorFrames – quick start
        1. Configuration and setup
          1. Launching a Spark cluster
          2. Creating a TensorFrames library
          3. Installing TensorFlow on your cluster
        2. Using TensorFlow to add a constant to an existing column
          1. Executing the Tensor graph
        3. Blockwise reducing operations example
          1. Building a DataFrame of vectors
          2. Analysing the DataFrame
          3. Computing elementwise sum and min of all vectors
      5. Summary
    18. 9. Polyglot Persistence with Blaze
      1. Installing Blaze
      2. Polyglot persistence
      3. Abstracting data
        1. Working with NumPy arrays
        2. Working with pandas' DataFrame
        3. Working with files
        4. Working with databases
          1. Interacting with relational databases
          2. Interacting with the MongoDB database
      4. Data operations
        1. Accessing columns
        2. Symbolic transformations
        3. Operations on columns
        4. Reducing data
        5. Joins
      5. Summary
    19. 10. Structured Streaming
      1. What is Spark Streaming?
      2. Why do we need Spark Streaming?
      3. What is the Spark Streaming application data flow?
      4. Simple streaming application using DStreams
      5. A quick primer on global aggregations
      6. Introducing Structured Streaming
      7. Summary
    20. 11. Packaging Spark Applications
      1. The spark-submit command
        1. Command line parameters
      2. Deploying the app programmatically
        1. Configuring your SparkSession
        2. Creating SparkSession
        3. Modularizing code
          1. Structure of the module
          2. Calculating the distance between two points
          3. Converting distance units
          4. Building an egg
          5. User defined functions in Spark
        4. Submitting a job
        5. Monitoring execution
      3. Databricks Jobs
      4. Summary
    21. Index