book

Learning PySpark

Name: Learning PySpark
ISBN: 9781786463708

by Tomasz Drabas, Denny Lee

February 2017

Intermediate to advanced

274 pages

5h 58m

English

Packt Publishing

Read now

Unlock full access

Learning PySpark
Table of Contents
Learning PySpark
Credits
Foreword
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
What this book covers

What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example codeDownloading the color images of this bookErrataPiracyQuestions
1. Understanding Spark
What is Apache Spark?
Spark Jobs and APIs
Execution processResilient Distributed DatasetDataFramesDatasetsCatalyst OptimizerProject Tungsten
Spark 2.0 architecture
Unifying Datasets and DataFramesIntroducing SparkSessionTungsten phase 2Structured StreamingContinuous applications
Summary
2. Resilient Distributed Datasets
Internal workings of an RDD
Creating RDDs
SchemaReading from filesLambda expressions
Global versus local scope
Transformations
The .map(...) transformationThe .filter(...) transformationThe .flatMap(...) transformationThe .distinct(...) transformationThe .sample(...) transformationThe .leftOuterJoin(...) transformationThe .repartition(...) transformation
Actions
The .take(...) methodThe .collect(...) methodThe .reduce(...) methodThe .count(...) methodThe .saveAsTextFile(...) methodThe .foreach(...) method
Summary
3. DataFrames
Python to RDD communications
Catalyst Optimizer refresh
Speeding up PySpark with DataFrames
Creating DataFrames
Generating our own JSON dataCreating a DataFrameCreating a temporary table
Simple DataFrame queries
DataFrame API querySQL query
Interoperating with RDDs
Inferring the schema using reflectionProgrammatically specifying the schema
Querying with the DataFrame API
Number of rowsRunning filter statements
Querying with SQL
Number of rowsRunning filter statements using the where Clauses
DataFrame scenario – on-time flight performance
Preparing the source datasetsJoining flight performance and airportsVisualizing our flight-performance data
Spark Dataset API
Summary
4. Prepare Data for Modeling
Checking for duplicates, missing observations, and outliersDuplicatesMissing observationsOutliers
Getting familiar with your data
Descriptive statisticsCorrelations
Visualization
HistogramsInteractions between features
Summary
5. Introducing MLlib
Overview of the package
Loading and transforming the data
Getting to know your data
Descriptive statisticsCorrelationsStatistical testing
Creating the final dataset
Creating an RDD of LabeledPointsSplitting into training and testing
Predicting infant survival
Logistic regression in MLlibSelecting only the most predictable featuresRandom forest in MLlib
Summary
6. Introducing the ML Package
Overview of the packageTransformerEstimatorsClassificationRegressionClusteringPipeline
Predicting the chances of infant survival with ML
Loading the dataCreating transformersCreating an estimatorCreating a pipelineFitting the modelEvaluating the performance of the modelSaving the model
Parameter hyper-tuning
Grid searchTrain-validation splitting
Other features of PySpark ML in action
Feature extractionNLP - related feature extractorsDiscretizing continuous variablesStandardizing continuous variablesClassificationClusteringFinding clusters in the births datasetTopic miningRegression
Summary
7. GraphFrames
Introducing GraphFrames
Installing GraphFrames
Creating a library
Preparing your flights dataset
Building the graph
Executing simple queries
Determining the number of airports and tripsDetermining the longest delay in this datasetDetermining the number of delayed versus on-time/early flightsWhat flights departing Seattle are most likely to have significant delays?What states tend to have significant delays departing from Seattle?
Understanding vertex degrees
Determining the top transfer airports
Understanding motifs
Determining airport ranking using PageRank
Determining the most popular non-stop flights
Using Breadth-First Search
Visualizing flights using D3
Summary
8. TensorFrames
What is Deep Learning?The need for neural networks and Deep LearningWhat is feature engineering?Bridging the data and algorithm
What is TensorFlow?
Installing PipInstalling TensorFlowMatrix multiplication using constantsMatrix multiplication using placeholdersRunning the modelRunning another modelDiscussion
Introducing TensorFrames
TensorFrames – quick start
Configuration and setupLaunching a Spark clusterCreating a TensorFrames libraryInstalling TensorFlow on your clusterUsing TensorFlow to add a constant to an existing columnExecuting the Tensor graphBlockwise reducing operations exampleBuilding a DataFrame of vectorsAnalysing the DataFrameComputing elementwise sum and min of all vectors
Summary
9. Polyglot Persistence with Blaze
Installing Blaze
Polyglot persistence
Abstracting data
Working with NumPy arraysWorking with pandas' DataFrameWorking with filesWorking with databasesInteracting with relational databasesInteracting with the MongoDB database
Data operations
Accessing columnsSymbolic transformationsOperations on columnsReducing dataJoins
Summary
10. Structured Streaming
What is Spark Streaming?
Why do we need Spark Streaming?
What is the Spark Streaming application data flow?
Simple streaming application using DStreams
A quick primer on global aggregations
Introducing Structured Streaming
Summary
11. Packaging Spark Applications
The spark-submit commandCommand line parameters
Deploying the app programmatically
Configuring your SparkSessionCreating SparkSessionModularizing codeStructure of the moduleCalculating the distance between two pointsConverting distance unitsBuilding an eggUser defined functions in SparkSubmitting a jobMonitoring execution
Databricks Jobs
Summary
Index

Content preview from Learning PySpark

Speeding up PySpark with DataFrames

The significance of DataFrames and the Catalyst Optimizer (and Project Tungsten) is the increase in performance of PySpark queries when compared to non-optimized RDD queries. As shown in the following figure, prior to the introduction of DataFrames, Python query speeds were often twice as slow as the same Scala queries using RDD. Typically, this slowdown in query performance was due to the communications overhead between Python and the JVM:

Source: Introducing DataFrames in Apache-spark for Large Scale Data Science at http://bit.ly/2blDBI1

With DataFrames, not only was there a significant improvement in Python ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781786463708

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Learning PySpark

by Tomasz Drabas, Denny Lee

Speeding up PySpark with DataFrames

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.