O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Fast Data Processing with Spark

Book Description

Spark offers a streamlined way to write distributed programs and this tutorial gives you the know-how as a software developer to make the most of Spark’s many great features, providing an extra string to your bow.

  • Implement Spark's interactive shell to prototype distributed applications
  • Deploy Spark jobs to various clusters such as Mesos, EC2, Chef, YARN, EMR, and so on
  • Use Shark's SQL query-like syntax with Spark

In Detail

Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and inbuilt tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big data sets.

Fast Data Processing with Spark covers how to write distributed map reduce style programs with Spark. The book will guide you through every step required to write effective distributed programs from setting up your cluster and interactively exploring the API, to deploying your job to the cluster, and tuning it for your purposes.

Fast Data Processing with Spark covers everything from setting up your Spark cluster in a variety of situations (stand-alone, EC2, and so on), to how to use the interactive shell to write distributed code interactively. From there, we move on to cover how to write and deploy distributed jobs in Java, Scala, and Python.

We then examine how to use the interactive shell to quickly prototype distributed programs and explore the Spark API. We also look at how to use Hive with Spark to use a SQL-like query syntax with Shark, as well as manipulating resilient distributed datasets (RDDs).

Table of Contents

  1. Fast Data Processing with Spark
    1. Table of Contents
    2. Fast Data Processing with Spark
    3. Credits
    4. About the Author
    5. About the Reviewers
    6. www.PacktPub.com
      1. Support files, eBooks, discount offers and more
        1. Why Subscribe?
        2. Free Access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Disclaimer
        3. Errata
        4. Piracy
        5. Questions
    8. 1. Installing Spark and Setting Up Your Cluster
      1. Running Spark on a single machine
      2. Running Spark on EC2
        1. Running Spark on EC2 with the scripts
      3. Deploying Spark on Elastic MapReduce
      4. Deploying Spark with Chef (opscode)
      5. Deploying Spark on Mesos
      6. Deploying Spark on YARN
      7. Deploying set of machines over SSH
      8. Links and references
      9. Summary
    9. 2. Using the Spark Shell
      1. Loading a simple text file
      2. Using the Spark shell to run logistic regression
      3. Interactively loading data from S3
      4. Summary
    10. 3. Building and Running a Spark Application
      1. Building your Spark project with sbt
      2. Building your Spark job with Maven
      3. Building your Spark job with something else
      4. Summary
    11. 4. Creating a SparkContext
      1. Scala
      2. Java
      3. Shared Java and Scala APIs
      4. Python
      5. Links and references
      6. Summary
    12. 5. Loading and Saving Data in Spark
      1. RDDs
      2. Loading data into an RDD
      3. Saving your data
      4. Links and references
      5. Summary
    13. 6. Manipulating Your RDD
      1. Manipulating your RDD in Scala and Java
        1. Scala RDD functions
        2. Functions for joining PairRDD functions
        3. Other PairRDD functions
        4. DoubleRDD functions
        5. General RDD functions
        6. Java RDD functions
        7. Spark Java function classes
          1. Common Java RDD functions
        8. Methods for combining JavaPairRDD functions
          1. JavaPairRDD functions
      2. Manipulating your RDD in Python
        1. Standard RDD functions
        2. PairRDD functions
      3. Links and references
      4. Summary
    14. 7. Shark – Using Spark with Hive
      1. Why Hive/Shark?
      2. Installing Shark
      3. Running Shark
      4. Loading data
      5. Using Hive queries in a Spark program
      6. Links and references
      7. Summary
    15. 8. Testing
      1. Testing in Java and Scala
        1. Refactoring your code for testability
        2. Testing interactions with SparkContext
      2. Testing in Python
      3. Links and references
      4. Summary
    16. 9. Tips and Tricks
      1. Where to find logs?
      2. Concurrency limitations
      3. Memory usage and garbage collection
      4. Serialization
      5. IDE integration
      6. Using Spark with other languages
      7. A quick note on security
      8. Mailing lists
      9. Links and references
      10. Summary
    17. Index