Book description
Over insightful 90 recipes to get lightningfast analytics with Apache Spark
About This Book
 Use Apache Spark for data processing with these handson recipes
 Implement endtoend, largescale data analysis better than ever before
 Work with powerful libraries such as MLLib, SciPy, NumPy, and Pandas to gain insights from your data
Who This Book Is For
This book is for novice and intermediate level data science professionals and data analysts who want to solve data science problems with a distributed computing framework. Basic experience with data science implementation tasks is expected. Data science professionals looking to skill up and gain an edge in the field will find this book helpful.
What You Will Learn
 Explore the topics of data mining, text mining, Natural Language Processing, information retrieval, and machine learning.
 Solve realworld analytical problems with large data sets.
 Address data science challenges with analytical tools on a distributed system like Spark (apt for iterative algorithms), which offers inmemory processing and more flexibility for data analysis at scale.
 Get handson experience with algorithms like Classification, regression, and recommendation on real datasets using Spark MLLib package.
 Learn about numerical and scientific computing using NumPy and SciPy on Spark.
 Use Predictive Model Markup Language (PMML) in Spark for statistical data mining models.
In Detail
Spark has emerged as the most promising big data analytics engine for data science professionals. The true power and value of Apache Spark lies in its ability to execute data science tasks with speed and accuracy. Spark's selling point is that it combines ETL, batch analytics, realtime stream analysis, machine learning, graph processing, and visualizations. It lets you tackle the complexities that come with raw unstructured data sets with ease.
This guide will get you comfortable and confident performing data science tasks with Spark. You will learn about implementations including distributed deep learning, numerical computing, and scalable machine learning. You will be shown effective solutions to problematic concepts in data science using Spark's data science libraries such as MLLib, Pandas, NumPy, SciPy, and more. These simple and efficient recipes will show you how to implement algorithms and optimize your work.
Style and approach
This book contains a comprehensive range of recipes designed to help you learn the fundamentals and tackle the difficulties of data science. This book outlines practical steps to produce powerful insights into Big Data through a recipebased approach.
Publisher resources
Table of contents

Apache Spark for Data Science Cookbook
 Apache Spark for Data Science Cookbook
 Credits
 About the Author
 About the Reviewer
 www.PacktPub.com
 Customer Feedback
 Preface

1. Big Data Analytics with Spark
 Introduction
 Initializing SparkContext
 Working with Spark's Python and Scala shells
 Building standalone applications
 Working with the Spark programming model
 Working with pair RDDs
 Persisting RDDs
 Loading and saving data
 Creating broadcast variables and accumulators
 Submitting applications to a cluster
 Working with DataFrames
 Working with Spark Streaming

2. Tricky Statistics with Spark
 Introduction
 Variable identification
 Sampling data
 Summary and descriptive statistics
 Generating frequency tables
 Installing Pandas on Linux
 Installing Pandas from source
 Using IPython with PySpark
 Creating Pandas DataFrames over Spark
 Splitting, slicing, sorting, filtering, and grouping DataFrames over Spark
 Implementing covariance and correlation using Pandas
 Concatenating and merging operations over DataFrames
 Complex operations over DataFrames
 Sparkling Pandas
 3. Data Analysis with Spark

4. Clustering, Classification, and Regression
 Introduction
 Supervised learning
 Unsupervised learning
 Applying regression analysis for sales data
 Variable identification
 Data exploration
 Feature engineering
 Applying linear regression
 Applying logistic regression on bank marketing data
 Variable identification
 Data exploration
 Feature engineering
 Applying logistic regression
 Realtime intrusion detection using streaming kmeans
 Variable identification
 Simulating realtime data
 Applying streaming kmeans
 5. Working with Spark MLlib

6. NLP with Spark
 Introduction
 Installing NLTK on Linux
 Installing Anaconda on Linux
 Anaconda for cluster management
 POS tagging with PySpark on an Anaconda cluster
 NER with IPython over Spark
 Implementing openNLP  chunker over Spark
 Implementing openNLP  sentence detector over Spark
 Implementing stanford NLP  lemmatization over Spark
 Implementing sentiment analysis using stanford NLP over Spark
 7. Working with Sparkling Water  H2O

8. Data Visualization with Spark
 Introduction
 Visualization using Zeppelin
 Installing Zeppelin
 Customizing Zeppelin's server and websocket port
 Visualizing data on HDFS  parameterizing inputs
 Running custom functions
 Adding external dependencies to Zeppelin
 Pointing to an external Spark Cluster
 Creating scatter plots with BokehScala
 Creating a time series MultiPlot with BokehScala
 Creating plots with the lightning visualization server
 Visualize machine learning models with Databricks notebook
 9. Deep Learning on Spark

10. Working with SparkR
 Introduction
 Installing R
 Interactive analysis with the SparkR shell
 Creating a SparkR standalone application from RStudio
 Creating SparkR DataFrames
 SparkR DataFrame operations
 Applying userdefined functions in SparkR
 Running SQL queries from SparkR and caching DataFrames
 Machine learning with SparkR
Product information
 Title: Apache Spark for Data Science Cookbook
 Author(s):
 Release date: December 2016
 Publisher(s): Packt Publishing
 ISBN: 9781785880100
You might also like
book
Mastering Hadoop 3
A comprehensive guide to mastering the most advanced Hadoop 3 concepts Key Features Get to grips …
book
HandsOn Machine Learning with ScikitLearn, Keras, and TensorFlow, 2nd Edition
Through a series of recent breakthroughs, deep learning has boosted the entire field of machine learning. …
book
Designing DataIntensive Applications
Data is at the center of many challenges in system design today. Difficult issues need to …
book
Apache Spark 2.x Cookbook
Over 70 recipes to help you use Apache Spark as your single big data computing platform …