O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Mastering Apache Spark

Book Description

Gain expertise in processing and storing data by using advanced techniques with Apache Spark

About This Book

  • Explore the integration of Apache Spark with third party applications such as H20, Databricks and Titan
  • Evaluate how Cassandra and Hbase can be used for storage
  • An advanced guide with a combination of instructions and practical examples to extend the most up-to date Spark functionalities

Who This Book Is For

If you are a developer with some experience with Spark and want to strengthen your knowledge of how to get around in the world of Spark, then this book is ideal for you. Basic knowledge of Linux, Hadoop and Spark is assumed. Reasonable knowledge of Scala is expected.

What You Will Learn

  • Extend the tools available for processing and storage
  • Examine clustering and classification using MLlib
  • Discover Spark stream processing via Flume, HDFS
  • Create a schema in Spark SQL, and learn how a Spark schema can be populated with data
  • Study Spark based graph processing using Spark GraphX
  • Combine Spark with H20 and deep learning and learn why it is useful
  • Evaluate how graph storage works with Apache Spark, Titan, HBase and Cassandra
  • Use Apache Spark in the cloud with Databricks and AWS

In Detail

Apache Spark is an in-memory cluster based parallel processing system that provides a wide range of functionality like graph processing, machine learning, stream processing and SQL. It operates at unprecedented speeds, is easy to use and offers a rich set of data transformations.

This book aims to take your limited knowledge of Spark to the next level by teaching you how to expand Spark functionality. The book commences with an overview of the Spark eco-system. You will learn how to use MLlib to create a fully working neural net for handwriting recognition. You will then discover how stream processing can be tuned for optimal performance and to ensure parallel processing. The book extends to show how to incorporate H20 for machine learning, Titan for graph based storage, Databricks for cloud-based Spark. Intermediate Scala based code examples are provided for Apache Spark module processing in a CentOS Linux and Databricks cloud environment.

Style and approach

This book is an extensive guide to Apache Spark modules and tools and shows how Spark's functionality can be extended for real-time processing and storage with worked examples.

Downloading the example code for this book You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Mastering Apache Spark
    1. Table of Contents
    2. Mastering Apache Spark
    3. Credits
    4. Foreword
    5. About the Author
    6. About the Reviewers
    7. www.PacktPub.com
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    8. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    9. 1. Apache Spark
      1. Overview
        1. Spark Machine Learning
        2. Spark Streaming
        3. Spark SQL
        4. Spark graph processing
        5. Extended ecosystem
        6. The future of Spark
      2. Cluster design
      3. Cluster management
        1. Local
        2. Standalone
        3. Apache YARN
        4. Apache Mesos
        5. Amazon EC2
      4. Performance
        1. The cluster structure
        2. The Hadoop file system
        3. Data locality
        4. Memory
        5. Coding
      5. Cloud
      6. Summary
    10. 2. Apache Spark MLlib
      1. The environment configuration
        1. Architecture
        2. The development environment
        3. Installing Spark
      2. Classification with Naïve Bayes
        1. Theory
        2. Naïve Bayes in practice
      3. Clustering with K-Means
        1. Theory
        2. K-Means in practice
      4. ANN – Artificial Neural Networks
        1. Theory
        2. Building the Spark server
        3. ANN in practice
      5. Summary
    11. 3. Apache Spark Streaming
      1. Overview
      2. Errors and recovery
        1. Checkpointing
      3. Streaming sources
        1. TCP stream
        2. File streams
        3. Flume
        4. Kafka
      4. Summary
    12. 4. Apache Spark SQL
      1. The SQL context
      2. Importing and saving data
        1. Processing the Text files
        2. Processing the JSON files
        3. Processing the Parquet files
      3. DataFrames
      4. Using SQL
      5. User-defined functions
      6. Using Hive
        1. Local Hive Metastore server
        2. A Hive-based Metastore server
      7. Summary
    13. 5. Apache Spark GraphX
      1. Overview
      2. GraphX coding
        1. Environment
        2. Creating a graph
        3. Example 1 – counting
        4. Example 2 – filtering
        5. Example 3 – PageRank
        6. Example 4 – triangle counting
        7. Example 5 – connected components
      3. Mazerunner for Neo4j
        1. Installing Docker
        2. The Neo4j browser
        3. The Mazerunner algorithms
          1. The PageRank algorithm
          2. The closeness centrality algorithm
          3. The triangle count algorithm
          4. The connected components algorithm
          5. The strongly connected components algorithm
      4. Summary
    14. 6. Graph-based Storage
      1. Titan
      2. TinkerPop
      3. Installing Titan
      4. Titan with HBase
        1. The HBase cluster
        2. The Gremlin HBase script
        3. Spark on HBase
        4. Accessing HBase with Spark
      5. Titan with Cassandra
        1. Installing Cassandra
        2. The Gremlin Cassandra script
        3. The Spark Cassandra connector
        4. Accessing Cassandra with Spark
      6. Accessing Titan with Spark
        1. Gremlin and Groovy
        2. TinkerPop's Hadoop Gremlin
        3. Alternative Groovy configuration
        4. Using Cassandra
        5. Using HBase
        6. Using the filesystem
      7. Summary
    15. 7. Extending Spark with H2O
      1. Overview
      2. The processing environment
      3. Installing H2O
      4. The build environment
      5. Architecture
      6. Sourcing the data
      7. Data Quality
      8. Performance tuning
      9. Deep learning
        1. Example code – income
        2. The example code – MNIST
      10. H2O Flow
      11. Summary
    16. 8. Spark Databricks
      1. Overview
      2. Installing Databricks
      3. AWS billing
      4. Databricks menus
      5. Account management
      6. Cluster management
      7. Notebooks and folders
      8. Jobs and libraries
      9. Development environments
      10. Databricks tables
        1. Data import
        2. External tables
      11. The DbUtils package
        1. Databricks file system
        2. Dbutils fsutils
        3. The DbUtils cache
        4. The DbUtils mount
      12. Summary
    17. 9. Databricks Visualization
      1. Data visualization
        1. Dashboards
        2. An RDD-based report
        3. A stream-based report
      2. REST interface
        1. Configuration
        2. Cluster management
        3. The execution context
        4. Command execution
        5. Libraries
      3. Moving data
        1. The table data
        2. Folder import
        3. Library import
      4. Further reading
      5. Summary
    18. Index