book

Apache Spark for Data Science Cookbook

Name: Apache Spark for Data Science Cookbook
Author: Padma Priya Chitturi
ISBN: 9781785880100

by Padma Priya Chitturi

December 2016

Beginner to intermediate

392 pages

8h 13m

English

Packt Publishing

Read now

Unlock full access

Apache Spark for Data Science Cookbook
Apache Spark for Data Science Cookbook
Credits
About the Author
About the Reviewer
www.PacktPub.com
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for

Sections
Getting readyHow to do it…How it works…There's more…See also
Conventions
Reader feedback
Customer support
Downloading the example codeErrataPiracyQuestions
1. Big Data Analytics with Spark
Introduction
Initializing SparkContext
Getting readyHow to do it…How it works…There's more…See also
Working with Spark's Python and Scala shells
How to do it…How it works…There's more…See also
Building standalone applications
Getting readyHow to do it…How it works…There's more…See also
Working with the Spark programming model
How to do it…How it works…There's more…See also
Working with pair RDDs
Getting readyHow to do it…How it works…There's more…See also
Persisting RDDs
Getting readyHow to do it…How it works…There's more…See also
Loading and saving data
Getting readyHow to do it…How it works…There's more…See also
Creating broadcast variables and accumulators
Getting readyHow to do it…How it works…There's more…See also
Submitting applications to a cluster
Getting readyHow to do it…How it works…There's more…See also
Working with DataFrames
Getting readyHow to do it…How it works…There's more…See also
Working with Spark Streaming
Getting readyHow to do it…How it works…There's more…See also
2. Tricky Statistics with Spark
IntroductionWorking with Pandas
Variable identification
Getting readyHow to do it…How it works…There's more…See also
Sampling data
Getting readyHow to do it…How it works…There's more…See also
Summary and descriptive statistics
Getting readyHow to do it…How it works…There's more…See also
Generating frequency tables
Getting readyHow to do it…How it works…There's more…See also
Installing Pandas on Linux
Getting readyHow to do it…How it works…There's more…See also
Installing Pandas from source
Getting readyHow to do it…How it works…There's more…See also
Using IPython with PySpark
Getting readyHow to do it…How it work…There's more…See also
Creating Pandas DataFrames over Spark
Getting readyHow to do it…How it works…There's more…See also
Splitting, slicing, sorting, filtering, and grouping DataFrames over Spark
Getting readyHow to do it…How it works…There's more…See also
Implementing co-variance and correlation using Pandas
Getting readyHow to do it…How it works…There's more…See also
Concatenating and merging operations over DataFrames
Getting readyHow to do it…How it works…There's more…See also
Complex operations over DataFrames
Getting readyHow to do it…How it works…There's more…See also
Sparkling Pandas
Getting readyHow to do it…How it works…There's more…See also
3. Data Analysis with Spark
Introduction
Univariate analysis
Getting readyHow to do it…How it works…There's more…See also
Bivariate analysis
Getting readyHow to do it…How it works…There's more…See also
Missing value treatment
Getting readyHow to do it…How it works…There's more…See also
Outlier detection
Getting readyHow to do it…How it works…There's more…See also
Use case - analyzing the MovieLens dataset
Getting readyHow to do it…How it works…There's more…See also
Use case - analyzing the Uber dataset
Getting readyHow to do it…How it works…There's more…See also
4. Clustering, Classification, and Regression
Introduction
Supervised learning
Unsupervised learning
Applying regression analysis for sales data
Variable identification
Getting readyHow to do it…How it works…There's more…See also
Data exploration
Getting readyHow to do it…How it works…There's more…See also
Feature engineering
Getting readyHow to do it…How it works…There's more…See also
Applying linear regression
Getting readyHow to do it…How it works…There's more…See also
Applying logistic regression on bank marketing data
Variable identification
Getting readyHow to do it…How it works…There's more…See also
Data exploration
Getting readyHow to do it…How it works…There's more…See also
Feature engineering
Getting readyHow to do it…How it works…There's more…See also
Applying logistic regression
Getting readyHow to do it…How it works…There's more…See also
Real-time intrusion detection using streaming k-means
Variable identification
Getting readyHow to do it…How it works…There's more…See also
Simulating real-time data
Getting readyHow to do it…How it works…There's more…See also
Applying streaming k-means
Getting readyHow to do it…How it works…There's more…See also
5. Working with Spark MLlib
Introduction
Working with Spark ML pipelines
Implementing Naive Bayes' classification
Getting readyHow to do it…How it works…There's more…See also
Implementing decision trees
Getting readyHow to do it…How it works…There's more…See also
Building a recommendation system
Getting readyHow to do it…How it works…There's more…See also
Implementing logistic regression using Spark ML pipelines
Getting readyHow to do it…How it works…There's more…See also
6. NLP with Spark
Introduction
Installing NLTK on Linux
Getting readyHow to do it…How it works…There's more…See also
Installing Anaconda on Linux
Getting readyHow to do it…How it works…There's more…See also
Anaconda for cluster management
Getting readyHow to do it…How it works…There's more…See also
POS tagging with PySpark on an Anaconda cluster
Getting readyHow to do it…How it works…There's more…See also
NER with IPython over Spark
Getting readyHow to do it…How it works…There's more…See also
Implementing openNLP - chunker over Spark
Getting readyHow to do it…How it works…There's more…See also
Implementing openNLP - sentence detector over Spark
Getting readyHow to do it…How it works…There's more…See also
Implementing stanford NLP - lemmatization over Spark
Getting readyHow to do it…How it works…There's more…See also
Implementing sentiment analysis using stanford NLP over Spark
Getting readyHow to do it…How it works…There's more…See also
7. Working with Sparkling Water - H2O
Introduction
Features
Working with H2O on Spark
Getting readyHow to do it…How it works…There's more…See also
Implementing k-means using H2O over Spark
Getting readyHow to do it…How it works…There's more…See also
Implementing spam detection with Sparkling Water
Getting readyHow to do it…How it works…There's more…See also
Deep learning with airlines and weather data
Getting readyHow to do it…How it works…There's more…See also
Implementing a crime detection application
Getting readyHow to do it…How it works…There's more…See also
Running SVM with H2O over Spark
Getting readyHow to do it…How it works…There's more…See also
8. Data Visualization with Spark
Introduction
Visualization using Zeppelin
Getting readyHow to do it…
Installing Zeppelin
Customizing Zeppelin's server and websocket port
Visualizing data on HDFS - parameterizing inputs
Running custom functions
Adding external dependencies to Zeppelin
Pointing to an external Spark Cluster
How to do it…How it works…There's more…See also
Creating scatter plots with Bokeh-Scala
Getting readyHow to do it…How it works…There's more…See also
Creating a time series MultiPlot with Bokeh-Scala
Getting readyHow to do it…How it work…There's more…See also
Creating plots with the lightning visualization server
Getting readyHow to do it…How it works…There's more…See also
Visualize machine learning models with Databricks notebook
Getting readyHow to do it…How it works…There's more…See also
9. Deep Learning on Spark
Introduction
Installing CaffeOnSpark
Getting readyHow to do it…How it works…There's more…See also
Working with CaffeOnSpark
Getting readyHow to do it…How it works…There's more…See also
Running a feed-forward neural network with DeepLearning 4j over Spark
Getting readyHow to do it…How it works…There's more…See also
Running an RBM with DeepLearning4j over Spark
Getting readyHow to do it…How it works…There's more…See also
Running a CNN for learning MNIST with DeepLearning4j over Spark
Getting readyHow to do it…How it works…There's more…See also
Installing TensorFlow
Getting readyHow to do it…How it works…There's more…See also
Working with Spark TensorFlow
Getting readyHow to do it…How it works…There's more…See also
10. Working with SparkR
Introduction
Installing R
Getting ready…How to do it…How it works…There's more…See also
Interactive analysis with the SparkR shell
Getting readyHow to do it…How it works…There's more…See also
Creating a SparkR standalone application from RStudio
Getting readyHow to do it…How it works…There's more…See also
Creating SparkR DataFrames
Getting readyHow to do it…How it works…There's more…See also
SparkR DataFrame operations
Getting readyHow to do it…How it works…There's more…See also
Applying user-defined functions in SparkR
Getting readyHow to do it…How it works…There's more…See also
Running SQL queries from SparkR and caching DataFrames
Getting readyHow to do it…How it works…There's more…See also
Machine learning with SparkR
Getting readyHow to do it…How it works…There's more…See also

Content preview from Apache Spark for Data Science Cookbook

Complex operations over DataFrames

This recipe shows how to perform complex operations such as computing difference on a column in Pandas DataFrames as well as Spark DataFrames.

Getting ready

To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, have Python and IPython installed on the Linux machine, that is, Ubuntu 14.04.

How to do it…

Invoke ipython console -profile=pyspark:

      In [1]: from pyspark import SparkConf, SparkContext, SQLContext 
        In [2]: import pandas as pd 
        In [3]: sqlcontext = SQLContext(sc)

Computing diff on a column in Pandas:

 In [4]: df = sqlCtx.createDataFrame([(1, 4), (1, 5), (2, 6), (2, 6), (3, 0)], ["A", "B"]) ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781785880100

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design