book

Learning Apache Spark 2

Name: Learning Apache Spark 2
Author: Muhammad Asif Abbasi
ISBN: 9781785885136

by Muhammad Asif Abbasi

March 2017

Beginner to intermediate

356 pages

7h 11m

English

Packt Publishing

Read now

Unlock full access

Learning Apache Spark 2
Learning Apache Spark 2
Credits
About the Author
About the Reviewers
www.packtpub.com
Why subscribe?
Customer Feedback
Preface
The Past Why are people so excited about Spark?
What this book covers
What you need for this book

Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example codeErrataPiracyQuestions
1. Architecture and Installation
Apache Spark architecture overviewSpark-coreSpark SQLSpark streamingMLlibGraphXSpark deployment
Installing Apache Spark
Writing your first Spark program
Scala shell examplesPython shell examples
Spark architecture
High level overviewDriver programCluster ManagerWorkerExecutorsTasksSparkContextSpark Session
Apache Spark cluster manager types
Building standalone applications with Apache SparkSubmitting applicationsDeployment strategies
Running Spark examples
Building your own programs
Brain teasers
References
Summary
2. Transformations and Actions with Spark RDDs
What is an RDD?Constructing RDDsParallelizing existing collectionsReferencing external data source
Operations on RDD
TransformationsActions
Passing functions to Spark (Scala)
Anonymous functionsStatic singleton functions
Passing functions to Spark (Java)
Passing functions to Spark (Python)
Transformations
Map(func)Filter(func)flatMap(func)Sample (withReplacement, fraction, seed)
Set operations in Spark
Distinct()Intersection()Union()Subtract()Cartesian()
Actions
Reduce(func)Collect()Count()Take(n)First()SaveAsXXFile()foreach(func)
PairRDDs
Creating PairRDDsPairRDD transformationsreduceByKey(func)GroupByKey(func)reduceByKey vs. groupByKey - Performance ImplicationsCombineByKey(func)Transformations on two PairRDDsActions available on PairRDDs
Shared variables
Broadcast variablesAccumulators
References
Summary
3. ETL with Spark
What is ETL?ExactionLoadingTransformation
How is Spark being used?
Commonly Supported File Formats
Text FilesCSV and TSV FilesWriting CSV filesTab Separated FilesJSON filesSequence filesObject files
Commonly supported file systems
Working with HDFSWorking with Amazon S3
Structured Data sources and Databases
Working with NoSQL DatabasesWorking with CassandraObtaining a Cassandra table as an RDDSaving data to CassandraWorking with HBaseBulk Delete exampleMap Partition ExampleWorking with MongoDBConnection to MongoDBWriting to MongoDBLoading data from MongoDBWorking with Apache SolrImporting the JAR File via Spark-shellConnecting to Solr via DataFrame APIConnecting to Solr via RDD
References
Summary
4. Spark SQL
What is Spark SQL?
What is DataFrame API?
What is DataSet API?
What's new in Spark 2.0?
Under the hood - catalyst optimizerSolution 1Solution 2
The Sparksession
Creating a SparkSession
Creating a DataFrame
Manipulating a DataFrameScala DataFrame manipulation - examplesPython DataFrame manipulation - examplesR DataFrame manipulation - examplesJava DataFrame manipulation - examplesReverting to an RDD from a DataFrameConverting an RDD to a DataFrameOther data sources
Parquet files
Working with Hive
Hive configuration
SparkSQL CLI
Working with other databases
References
Summary
5. Spark Streaming
What is Spark Streaming?DStreamStreamingContext
Steps involved in a streaming app
Architecture of Spark Streaming
Input sourcesCore/basic sourcesAdvanced sourcesCustom sourcesTransformationsSliding window operationsOutput operations
Caching and persistence
Checkpointing
Setting up checkpointingSetting up checkpointing with ScalaSetting up checkpointing with JavaSetting up checkpointing with PythonAutomatic driver restart
DStream best practices
Fault tolerance
Worker failure impact on receiversWorker failure impact on RDDs/DStreamsWorker failure impact on output operations
What is Structured Streaming?
Under the hoodStructured Spark Streaming API :Entry pointOutput modesAppend modeComplete modeUpdate modeOutput sinksFailure recovery and checkpointing
References
Summary
6. Machine Learning with Spark
What is machine learning?
Why machine learning?
Types of machine learning
Introduction to Spark MLLib
Why do we need the Pipeline API?
How does it work?
Scala syntax - building a pipelineBuilding a pipelinePredictions on test documentsPython program - predictions on test documents
Feature engineering
Feature extraction algorithmsFeature transformation algorithmsFeature selection algorithms
Classification and regression
ClassificationRegression
Clustering
Collaborative filtering
ML-tuning - model selection and hyperparameter tuning
References
Summary
7. GraphX
Graphs in everyday life
What is a graph?
Why are Graphs elegant?
What is GraphX?
Creating your first Graph (RDD API)
Code samples
Basic graph operators (RDD API)
List of graph operators (RDD API)
Caching and uncaching of graphs
Graph algorithms in GraphX
PageRankCode example -- PageRank algorithmConnected componentsCode example -- connected componentsTriangle counting
GraphFrames
Why GraphFrames?Basic constructs of a GraphFrameMotif findingGraphFrames algorithmsLoading and saving of GraphFrames
Comparison between GraphFrames and GraphX
GraphX <=> GraphFramesConverting from GraphFrame to GraphXConverting from GraphX to GraphFrames
References
Summary
8. Operating in Clustered Mode
Clusters, nodes and daemonsKey bits about Spark Architecture
Running Spark in standalone mode
Installing Spark standalone on a clusterStarting a Spark cluster manuallyCluster overviewWorkers overviewRunning applications and drivers overviewCompleted applications and drivers overview
Using the Cluster Launch Scripts to Start a Standalone Cluster
Environment PropertiesConnecting Spark-Shell, PySpark, and R-Shell to the clusterResource scheduling
Running Spark in YARN
Spark with a Hadoop Distribution (Cloudera)Interactive ShellBatch ApplicationImportant YARN Configuration Parameters
Running Spark in Mesos
Before you startRunning in MesosModes of operation in MesosClient ModeBatch ApplicationsInteractive ApplicationsCluster ModeSteps to use the cluster modeMesos run modesKey Spark on Mesos configuration properties
References:
Summary
9. Building a Recommendation System
What is a recommendation system?Types of recommendationsManual recommendationsSimple aggregated recommendations based on PopularityUser-specific recommendations
User specific recommendations
Key issues with recommendation systems
Gathering known input dataPredicting unknown from known ratingsContent-based recommendationsPredicting unknown ratingsPros and cons of content based recommendationsCollaborative filteringJaccard similarityCosine similarityCentered cosine (Pearson Correlation)Latent factor methodsEvaluating prediction method
Recommendation system in Spark
Sample datasetHow does Spark offer recommendation?Importing relevant librariesDefining the schema for ratingsDefining the schema for moviesLoading ratings and movies dataData partitioningTraining an ALS modelPredicting the test datasetEvaluating model performanceUsing implicit preferencesSanity checkingModel Deployment
References
Summary
10. Customer Churn Prediction
Overview of customer churn
Why is predicting customer churn important?
How do we predict customer churn with Spark?
Data set descriptionCode exampleDefining schemaLoading dataData explorationPySpark import codeExploring international minutesExploring night minutesExploring day minutesExploring eve minutesComparing minutes data for churners and non-churnersComparing charge data for churners and non-churners
Exploring customer service calls
Scala code - constructing a scatter plotExploring the churn variableData transformationBuilding a machine learning pipeline
References
Summary
Theres More with Spark
Performance tuningData serializationMemory tuningExecution and storageTasks running in parallelOperators within the same taskMemory management configuration optionsMemory tuning key tips
I/O tuning
Data locality
Sizing up your executors
Calculating memory overheadSetting aside memory/CPU for YARN application masterI/O throughputSample calculations
The skew problem
Security configuration in Spark
Kerberos authenticationShared secretsShared secret on YARNShared secret on other cluster managers
Setting up Jupyter Notebook with Spark
What is a Jupyter Notebook?Setting up a Jupyter NotebookSecuring the notebook serverPreparing a hashed passwordUsing Jupyter (only with version 5.0 and later)Manually creating hashed passwordSetting up PySpark on Jupyter
Shared variables
Broadcast variablesAccumulators
References
Summary

Content preview from Learning Apache Spark 2

The skew problem

Distributed systems just like teams of people working on an activity perform at the most optimum level when the work is evenly distributed among all the members of the team or the cluster. Both suffer, if the work is unevenly distributed and the system performs only as fast as the slowest component.

In the case of Spark, data is distributed across the cluster. You might have come across cases where a map job runs fairly quickly by your joins or shuffles take an excessive time. In most real life cases you would have popular keys or null values in your data, which would result in some tasks getting more work than others, thus resulting in a system skew. In the database world, original keys would actually be used to create new keys ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Mastering Apache Spark 2.x - Second Edition

Publisher Resources

ISBN: 9781785885136

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Learning Apache Spark 2

by Muhammad Asif Abbasi

The skew problem

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.