O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Apache Spark for Data Science Cookbook

Book Description

Over insightful 90 recipes to get lightning-fast analytics with Apache Spark

About This Book

  • Use Apache Spark for data processing with these hands-on recipes
  • Implement end-to-end, large-scale data analysis better than ever before
  • Work with powerful libraries such as MLLib, SciPy, NumPy, and Pandas to gain insights from your data

Who This Book Is For

This book is for novice and intermediate level data science professionals and data analysts who want to solve data science problems with a distributed computing framework. Basic experience with data science implementation tasks is expected. Data science professionals looking to skill up and gain an edge in the field will find this book helpful.

What You Will Learn

  • Explore the topics of data mining, text mining, Natural Language Processing, information retrieval, and machine learning.
  • Solve real-world analytical problems with large data sets.
  • Address data science challenges with analytical tools on a distributed system like Spark (apt for iterative algorithms), which offers in-memory processing and more flexibility for data analysis at scale.
  • Get hands-on experience with algorithms like Classification, regression, and recommendation on real datasets using Spark MLLib package.
  • Learn about numerical and scientific computing using NumPy and SciPy on Spark.
  • Use Predictive Model Markup Language (PMML) in Spark for statistical data mining models.

In Detail

Spark has emerged as the most promising big data analytics engine for data science professionals. The true power and value of Apache Spark lies in its ability to execute data science tasks with speed and accuracy. Spark’s selling point is that it combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. It lets you tackle the complexities that come with raw unstructured data sets with ease.

This guide will get you comfortable and confident performing data science tasks with Spark. You will learn about implementations including distributed deep learning, numerical computing, and scalable machine learning. You will be shown effective solutions to problematic concepts in data science using Spark’s data science libraries such as MLLib, Pandas, NumPy, SciPy, and more. These simple and efficient recipes will show you how to implement algorithms and optimize your work.

Style and approach

This book contains a comprehensive range of recipes designed to help you learn the fundamentals and tackle the difficulties of data science. This book outlines practical steps to produce powerful insights into Big Data through a recipe-based approach.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Apache Spark for Data Science Cookbook
    1. Apache Spark for Data Science Cookbook
    2. Credits
    3. About the Author
    4. About the Reviewer
    5. www.PacktPub.com
      1. Why subscribe?
    6. Customer Feedback
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Sections
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      5. Conventions
      6. Reader feedback
      7. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    8. 1. Big Data Analytics with Spark
      1. Introduction
      2. Initializing SparkContext
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      3. Working with Spark's Python and Scala shells
        1. How to do it…
        2. How it works…
        3. There's more…
        4. See also
      4. Building standalone applications
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      5. Working with the Spark programming model
        1. How to do it…
        2. How it works…
        3. There's more…
        4. See also
      6. Working with pair RDDs
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      7. Persisting RDDs
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      8. Loading and saving data
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      9. Creating broadcast variables and accumulators
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      10. Submitting applications to a cluster
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      11. Working with DataFrames
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      12. Working with Spark Streaming
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
    9. 2. Tricky Statistics with Spark
      1. Introduction
        1. Working with Pandas
      2. Variable identification
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      3. Sampling data
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      4. Summary and descriptive statistics
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      5. Generating frequency tables
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      6. Installing Pandas on Linux
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      7. Installing Pandas from source
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      8. Using IPython with PySpark
        1. Getting ready
        2. How to do it…
        3. How it work…
        4. There's more…
        5. See also
      9. Creating Pandas DataFrames over Spark
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      10. Splitting, slicing, sorting, filtering, and grouping DataFrames over Spark
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      11. Implementing co-variance and correlation using Pandas
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      12. Concatenating and merging operations over DataFrames
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      13. Complex operations over DataFrames
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      14. Sparkling Pandas
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
    10. 3. Data Analysis with Spark
      1. Introduction
      2. Univariate analysis
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      3. Bivariate analysis
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      4. Missing value treatment
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      5. Outlier detection
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      6. Use case - analyzing the MovieLens dataset
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      7. Use case - analyzing the Uber dataset
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
    11. 4. Clustering, Classification, and Regression
      1. Introduction
      2. Supervised learning
      3. Unsupervised learning
      4. Applying regression analysis for sales data
      5. Variable identification
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      6. Data exploration
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      7. Feature engineering
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      8. Applying linear regression
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      9. Applying logistic regression on bank marketing data
      10. Variable identification
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      11. Data exploration
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      12. Feature engineering
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      13. Applying logistic regression
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      14. Real-time intrusion detection using streaming k-means
      15. Variable identification
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      16. Simulating real-time data
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      17. Applying streaming k-means
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
    12. 5. Working with Spark MLlib
      1. Introduction
      2. Working with Spark ML pipelines
      3. Implementing Naive Bayes' classification
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      4. Implementing decision trees
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      5. Building a recommendation system
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      6. Implementing logistic regression using Spark ML pipelines
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
    13. 6. NLP with Spark
      1. Introduction
      2. Installing NLTK on Linux
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      3. Installing Anaconda on Linux
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      4. Anaconda for cluster management
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      5. POS tagging with PySpark on an Anaconda cluster
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      6. NER with IPython over Spark
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      7. Implementing openNLP - chunker over Spark
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      8. Implementing openNLP - sentence detector over Spark
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      9. Implementing stanford NLP - lemmatization over Spark
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      10. Implementing sentiment analysis using stanford NLP over Spark
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
    14. 7. Working with Sparkling Water - H2O
      1. Introduction
      2. Features
      3. Working with H2O on Spark
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      4. Implementing k-means using H2O over Spark
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      5. Implementing spam detection with Sparkling Water
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      6. Deep learning with airlines and weather data
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      7. Implementing a crime detection application
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      8. Running SVM with H2O over Spark
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
    15. 8. Data Visualization with Spark
      1. Introduction
      2. Visualization using Zeppelin
        1. Getting ready
        2. How to do it…
      3. Installing Zeppelin
      4. Customizing Zeppelin's server and websocket port
      5. Visualizing data on HDFS - parameterizing inputs
      6. Running custom functions
      7. Adding external dependencies to Zeppelin
      8. Pointing to an external Spark Cluster
        1. How to do it…
        2. How it works…
        3. There's more…
        4. See also
      9. Creating scatter plots with Bokeh-Scala
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      10. Creating a time series MultiPlot with Bokeh-Scala
        1. Getting ready
        2. How to do it…
        3. How it work…
        4. There's more…
        5. See also
      11. Creating plots with the lightning visualization server
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      12. Visualize machine learning models with Databricks notebook
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
    16. 9. Deep Learning on Spark
      1. Introduction
      2. Installing CaffeOnSpark
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      3. Working with CaffeOnSpark
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      4. Running a feed-forward neural network with DeepLearning 4j over Spark
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      5. Running an RBM with DeepLearning4j over Spark
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      6. Running a CNN for learning MNIST with DeepLearning4j over Spark
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      7. Installing TensorFlow
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      8. Working with Spark TensorFlow
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
    17. 10. Working with SparkR
      1. Introduction
      2. Installing R
        1. Getting ready…
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      3. Interactive analysis with the SparkR shell
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      4. Creating a SparkR standalone application from RStudio
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      5. Creating SparkR DataFrames
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      6. SparkR DataFrame operations
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      7. Applying user-defined functions in SparkR
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      8. Running SQL queries from SparkR and caching DataFrames
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also
      9. Machine learning with SparkR
        1. Getting ready
        2. How to do it…
        3. How it works…
        4. There's more…
        5. See also