O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Machine Learning with Apache Spark Quick Start Guide

Book Description

Combine advanced analytics including Machine Learning, Deep Learning Neural Networks and Natural Language Processing with modern scalable technologies including Apache Spark to derive actionable insights from Big Data in real-time

Key Features

  • Make a hands-on start in the fields of Big Data, Distributed Technologies and Machine Learning
  • Learn how to design, develop and interpret the results of common Machine Learning algorithms
  • Uncover hidden patterns in your data in order to derive real actionable insights and business value

Book Description

Every person and every organization in the world manages data, whether they realize it or not. Data is used to describe the world around us and can be used for almost any purpose, from analyzing consumer habits to fighting disease and serious organized crime. Ultimately, we manage data in order to derive value from it, and many organizations around the world have traditionally invested in technology to help process their data faster and more efficiently.

But we now live in an interconnected world driven by mass data creation and consumption where data is no longer rows and columns restricted to a spreadsheet, but an organic and evolving asset in its own right. With this realization comes major challenges for organizations: how do we manage the sheer size of data being created every second (think not only spreadsheets and databases, but also social media posts, images, videos, music, blogs and so on)? And once we can manage all of this data, how do we derive real value from it?

The focus of Machine Learning with Apache Spark is to help us answer these questions in a hands-on manner. We introduce the latest scalable technologies to help us manage and process big data. We then introduce advanced analytical algorithms applied to real-world use cases in order to uncover patterns, derive actionable insights, and learn from this big data.

What you will learn

  • Understand how Spark fits in the context of the big data ecosystem
  • Understand how to deploy and configure a local development environment using Apache Spark
  • Understand how to design supervised and unsupervised learning models
  • Build models to perform NLP, deep learning, and cognitive services using Spark ML libraries
  • Design real-time machine learning pipelines in Apache Spark
  • Become familiar with advanced techniques for processing a large volume of data by applying machine learning algorithms

Who this book is for

This book is aimed at Business Analysts, Data Analysts and Data Scientists who wish to make a hands-on start in order to take advantage of modern Big Data technologies combined with Advanced Analytics.

Downloading the example code for this book You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Title Page
  2. Copyright and Credits
    1. Machine Learning with Apache Spark Quick Start Guide
  3. Dedication
  4. About Packt
    1. Why subscribe?
    2. Packt.com
  5. Contributors
    1. About the author
    2. About the reviewer
    3. Packt is searching for authors like you
  6. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Conventions used
    4. Get in touch
      1. Reviews
  7. The Big Data Ecosystem
    1. A brief history of data
      1. Vertical scaling
      2. Master/slave architecture
      3. Sharding
      4. Data processing and analysis
      5. Data becomes big
    2. Big data ecosystem
      1. Horizontal scaling
      2. Distributed systems
        1. Distributed data stores
        2. Distributed filesystems
        3. Distributed databases
        4. NoSQL databases
          1. Document databases
          2. Columnar databases
          3. Key-value databases
          4. Graph databases
          5. CAP theorem
        5. Distributed search engines
        6. Distributed processing
          1. MapReduce
          2. Apache Spark
          3. RDDs, DataFrames, and datasets
          4. RDDs
          5. DataFrames
          6. Datasets
          7. Jobs, stages, and tasks
          8. Job
          9. Stage
          10. Tasks
        7. Distributed messaging
        8. Distributed streaming
        9. Distributed ledgers
      3. Artificial intelligence and machine learning
      4. Cloud computing platforms
      5. Data insights platform
        1. Reference logical architecture
          1. Data sources layer
          2. Ingestion layer
          3. Persistent data storage layer
          4. Data processing layer
          5. Serving data storage layer
          6. Data intelligence layer
          7. Unified access layer
          8. Data insights and reporting layer
          9. Platform governance, management, and administration
        2. Open source implementation
    3. Summary
  8. Setting Up a Local Development Environment
    1. CentOS Linux 7 virtual machine
      1. Java SE Development Kit 8
      2. Scala 2.11
      3. Anaconda 5 with Python 3
        1. Basic conda commands
        2. Additional Python packages
        3. Jupyter Notebook
          1. Starting Jupyter Notebook
          2. Troubleshooting Jupyter Notebook
      4. Apache Spark 2.3
        1. Spark binaries
        2. Local working directories
        3. Spark configuration
          1. Spark properties
          2. Environmental variables
        4. Standalone master server
        5. Spark worker node
        6. PySpark and Jupyter Notebook
      5. Apache Kafka 2.0
        1. Kafka binaries
        2. Local working directories
        3. Kafka configuration
        4. Start the Kafka server
        5. Testing Kafka
    2. Summary
  9. Artificial Intelligence and Machine Learning
    1. Artificial intelligence
    2. Machine learning
      1. Supervised learning
      2. Unsupervised learning
      3. Reinforced learning
    3. Deep learning
      1. Natural neuron
      2. Artificial neuron
        1. Weights
        2. Activation function
          1. Heaviside step function
          2. Sigmoid function
          3. Hyperbolic tangent function
      3. Artificial neural network
        1. Single-layer perceptron
        2. Multi-layer perceptron
    4. NLP
    5. Cognitive computing
    6. Machine learning pipelines in Apache Spark
    7. Summary
  10. Supervised Learning Using Apache Spark
    1. Linear regression
      1. Case study – predicting bike sharing demand
      2. Univariate linear regression
        1. Residuals
        2. Root mean square error
        3. R-squared
        4. Univariate linear regression in Apache Spark
      3. Multivariate linear regression
        1. Correlation
        2. Multivariate linear regression in Apache Spark
    2. Logistic regression
      1. Threshold value
      2. Confusion matrix
      3. Receiver operator characteristic curve
        1. Area under the ROC curve
      4. Case study – predicting breast cancer
    3. Classification and Regression Trees
      1. Case study – predicting political affiliation
      2. Random forests
        1. K-Fold cross validation
    4. Summary
  11. Unsupervised Learning Using Apache Spark
    1. Clustering
      1. Euclidean distance
      2. Hierarchical clustering
      3. K-means clustering
        1. Case study – detecting brain tumors
        2. Feature vectors from images
        3. Image segmentation
        4. K-means cost function
        5. K-means clustering in Apache Spark
    2. Principal component analysis
      1. Case study – movie recommendation system
      2. Covariance matrix
      3. Identity matrix
      4. Eigenvectors and eigenvalues
      5. PCA in Apache Spark
    3. Summary
  12. Natural Language Processing Using Apache Spark
    1. Feature transformers
      1. Document
      2. Corpus
      3. Preprocessing pipeline
        1. Tokenization
        2. Stop words
        3. Stemming
        4. Lemmatization
        5. Normalization
    2. Feature extractors
      1. Bag of words
      2. Term frequency–inverse document frequency
    3. Case study – sentiment analysis
      1. NLP pipeline
      2. NLP in Apache Spark
    4. Summary
  13. Deep Learning Using Apache Spark
    1. Artificial neural networks
      1. Multilayer perceptrons
        1. MLP classifier
        2. Input layer
        3. Hidden layers
        4. Output layer
      2. Case study 1 – OCR
        1. Input data
        2. Training architecture
        3. Detecting patterns in the hidden layer
        4. Classifying in the output layer
        5. MLPs in Apache Spark
      3. Convolutional neural networks
        1. End-to-end neural architecture
        2. Input layer
        3. Convolution layers
        4. Rectified linear units
        5. Pooling layers
        6. Fully connected layer
        7. Output layer
      4. Case study 2 – image recognition
        1. InceptionV3 via TensorFlow
        2. Deep learning pipelines for Apache Spark
        3. Image library
        4. PySpark image recognition application
        5. Spark submit
        6. Image-recognition results
      5. Case study 3 – image prediction
        1. PySpark image-prediction application
        2. Image-prediction results
    2. Summary
  14. Real-Time Machine Learning Using Apache Spark
    1. Distributed streaming platform
    2. Distributed stream processing engines
      1. Streaming using Apache Spark
        1. Spark Streaming (DStreams)
        2. Structured Streaming
    3. Stream processing pipeline
      1. Case study – real-time sentiment analysis
        1. Start Zookeeper and Kafka Servers
        2. Kafka topic
        3. Twitter developer account
        4. Twitter apps and the Twitter API
        5. Application configuration
        6. Kafka Twitter producer application
        7. Preprocessing and feature vectorization pipelines
        8. Kafka Twitter consumer application
    4. Summary
  15. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think