book

Spark for Data Science

Name: Spark for Data Science
ISBN: 9781785885655

by Bikramaditya Singhal, Srinivas Duvvuri

September 2016

Beginner to intermediate

344 pages

7h 44m

English

Packt Publishing

Read now

Unlock full access

Spark for Data Science
Spark for Data Science
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for

Conventions
Reader feedback
Customer support
Downloading the example codeDownloading the color images of this bookErrataPiracyQuestions
1. Big Data and Data Science – An Introduction
Big data overview
Challenges with big data analytics
Computational challengesAnalytical challenges
Evolution of big data analytics
Spark for data analytics
The Spark stack
Spark coreSpark SQLSpark streamingMLlibGraphXSparkR
Summary
References
2. The Spark Programming Model
The programming paradigmSupported programming languagesScalaJavaPythonRChoosing the right language
The Spark engine
Driver programThe Spark shellSparkContextWorker nodesExecutorsShared variablesFlow of execution
The RDD API
RDD basicsPersistence
RDD operations
Creating RDDsTransformations on normal RDDsThe filter operationThe distinct operationThe intersection operationThe union operationThe map operationThe flatMap operationThe keys operationThe cartesian operationTransformations on pair RDDsThe groupByKey operationThe join operationThe reduceByKey operationThe aggregate operationActionsThe collect() functionThe count() functionThe take(n) functionThe first() functionThe takeSample() functionThe countByKey() function
Summary
References
3. Introduction to DataFrames
Why DataFrames?
Spark SQL
The Catalyst optimizer
The DataFrame API
DataFrame basicsRDDs versus DataFramesSimilaritiesDifferences
Creating DataFrames
Creating DataFrames from RDDsCreating DataFrames from JSONCreating DataFrames from databases using JDBCCreating DataFrames from Apache ParquetCreating DataFrames from other data sources
DataFrame operations
Under the hood
Summary
References
4. Unified Data Access
Data abstractions in Apache Spark
Datasets
Working with DatasetsCreating Datasets from JSONDatasets API's limitations
Spark SQL
SQL operationsUnder the hood
Structured Streaming
The Spark streaming programming modelUnder the hoodComparison with other streaming engines
Continuous applications
Summary
References
5. Data Analysis on Spark
Data analytics life cycle
Data acquisition
Data preparation
Data consolidationData cleansingMissing value treatmentOutlier treatmentDuplicate values treatmentData transformation
Basics of statistics
SamplingSimple random sampleSystematic samplingStratified samplingData distributionsFrequency distributionsProbability distributions
Descriptive statistics
Measures of locationMeanMedianModeMeasures of spreadRangeVarianceStandard deviationSummary statisticsGraphical techniques
Inferential statistics
Discrete probability distributionsBernoulli distributionBinomial distributionSample problemPoisson distributionSample problemContinuous probability distributionsNormal distributionStandard normal distributionChi-square distributionSample problemStudent's t-distributionF-distributionStandard errorConfidence levelMargin of error and confidence intervalVariability in the populationEstimating sample sizeHypothesis testingNull and alternate hypothesesChi-square testF-testProblem:Correlations
Summary
References
6. Machine Learning
IntroductionThe evolutionSupervised learningUnsupervised learning
MLlib and the Pipeline API
MLlibML pipelineTransformerEstimator
Introduction to machine learning
Parametric methodsNon-parametric methods
Regression methods
Linear regressionLoss functionOptimizationRegularizations on regressionRidge regressionLasso regressionElastic net regression
Classification methods
Logistic regression
Linear Support Vector Machines (SVM)
Linear kernelPolynomial kernelRadial Basis Function kernelSigmoid kernel
Training an SVM
Decision trees
Impurity measuresGini IndexEntropyVarianceStopping ruleSplit candidatesCategorical featuresContinuous featuresAdvantages of decision treesDisadvantages of decision treesExample
Ensembles
Random forestsAdvantages of random forestsGradient-Boosted Trees
Multilayer perceptron classifier
Clustering techniques
K-means clusteringDisadvantages of k-meansExample
Summary
References
7. Extending Spark with SparkR
SparkR basicsAccessing SparkR from the R environmentRDDs and DataFramesGetting started
Advantages and limitations
Programming with SparkR
Function name maskingSubsetting dataColumn functionsGrouped data
SparkR DataFrames
SQL operationsSet operationsMerging DataFrames
Machine learning
The Naive Bayes modelThe Gaussian GLM model
Summary
References
8. Analyzing Unstructured Data
Sources of unstructured data
Processing unstructured data
Count vectorizerTF-IDFStop-word removalNormalization/scalingWord2Vecn-gram modelling
Text classification
Naive Bayes classifier
Text clustering
K-means
Dimensionality reduction
Singular Value Decomposition
Principal Component Analysis
Summary
References:
9. Visualizing Big Data
Why visualize data?A data engineer's perspectiveA data scientist's perspectiveA business user's perspective
Data visualization tools
IPython notebookApache ZeppelinThird-party tools
Data visualization techniques
Summarizing and visualizingSubsetting and visualizingSampling and visualizingModeling and visualizing
Summary
References
Data source citations
10. Putting It All Together
A quick recap
Introducing a case study
The business problem
Data acquisition and data cleansing
Developing the hypothesis
Data exploration
Data preparation
Too many levels in a categorical variableNumerical variables with too much variationMissing dataContinuous dataCategorical dataPreparing the data
Model building
Data visualization
Communicating the results to business users
Summary
References
11. Building Data Science Applications
Scope of developmentExpectationsPresentation optionsInteractive notebooksReferencesWeb APIReferencesPMML and PFAReferencesDevelopment and testingReferencesData quality management
The Scala advantage
Spark development status
Spark 2.0's features and enhancementsUnifying Datasets and DataFramesStructured StreamingProject Tungsten phase 2What's in store?
The big data trends
Summary
References

Content preview from Spark for Data Science

MLlib and the Pipeline API

Let us first learn some Spark fundamentals to be able to perform the machine learning operations on it. We will discuss the MLlib and the pipeline API in this section.

MLlib

MLlib is the machine learning library built on top of Apache Spark which homes most of the algorithms that can be implemented at scale. The seamless integration of MLlib with other components such as GraphX, SQL, and Streaming provides developers with an opportunity to assemble complex, scalable, and efficient workflows relatively easily. The MLlib library consists of common learning algorithms and utilities including classification, regression, clustering, collaborative filtering, and dimensionality reduction.

MLlib works in conjunction with the

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781785885655

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Spark for Data Science

by Bikramaditya Singhal, Srinivas Duvvuri

MLlib and the Pipeline API

MLlib

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.