Apache Spark Quick Start Guide

Book description

A practical guide for solving complex data processing challenges by applying the best optimizations techniques in Apache Spark.

Key Features

  • Learn about the core concepts and the latest developments in Apache Spark
  • Master writing efficient big data applications with Spark's built-in modules for SQL, Streaming, Machine Learning and Graph analysis
  • Get introduced to a variety of optimizations based on the actual experience

Book Description

Apache Spark is a flexible framework that allows processing of batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to get started with Apache Spark 2.0 and write big data applications for a variety of use cases.

It will also introduce you to Apache Spark ? one of the most popular Big Data processing frameworks. Although this book is intended to help you get started with Apache Spark, but it also focuses on explaining the core concepts.

This practical guide provides a quick start to the Spark 2.0 architecture and its components. It teaches you how to set up Spark on your local machine. As we move ahead, you will be introduced to resilient distributed datasets (RDDs) and DataFrame APIs, and their corresponding transformations and actions. Then, we move on to the life cycle of a Spark application and learn about the techniques used to debug slow-running applications. You will also go through Spark's built-in modules for SQL, streaming, machine learning, and graph analysis.

Finally, the book will lay out the best practices and optimization techniques that are key for writing efficient Spark applications. By the end of this book, you will have a sound fundamental understanding of the Apache Spark framework and you will be able to write and optimize Spark applications.

What you will learn

  • Learn core concepts such as RDDs, DataFrames, transformations, and more
  • Set up a Spark development environment
  • Choose the right APIs for your applications
  • Understand Spark's architecture and the execution flow of a Spark application
  • Explore built-in modules for SQL, streaming, ML, and graph analysis
  • Optimize your Spark job for better performance

Who this book is for

If you are a big data enthusiast and love processing huge amount of data, this book is for you. If you are data engineer and looking for the best optimization techniques for your Spark applications, then you will find this book helpful. This book also helps data scientists who want to implement their machine learning algorithms in Spark. You need to have a basic understanding of any one of the programming languages such as Scala, Python or Java.

Table of contents

  1. Title Page
  2. Copyright and Credits
    1. Apache Spark Quick Start Guide
  3. About Packt
    1. Why subscribe?
    2. Packt.com
  4. Contributors
    1. About the authors
    2. About the reviewer
    3. Packt is searching for authors like you
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Download the color images
      3. Conventions used
    4. Get in touch
      1. Reviews
  6. Introduction to Apache Spark
    1. What is Spark?
    2. Spark architecture overview
    3. Spark language APIs
      1. Scala
      2. Java
      3. Python
      4. R
      5. SQL
    4. Spark components
      1. Spark Core
      2. Spark SQL
      3. Spark Streaming
      4. Spark machine learning
      5. Spark graph processing
      6. Cluster manager
        1. Standalone scheduler
        2. YARN
        3. Mesos
        4. Kubernetes
    5. Making the most of Hadoop and Spark
    6. Summary
  7. Apache Spark Installation
    1. AWS elastic compute cloud (EC2)
      1. Creating a free account on AWS
      2. Connecting to your Linux instance
    2. Configuring Spark
      1. Prerequisites
      2. Installing Java
      3. Installing Scala
      4. Installing Python
      5. Installing Spark
      6. Using Spark components
        1. Different modes of execution
      7. Spark sandbox
    3. Summary
  8. Spark RDD
    1. What is an RDD?
      1. Resilient metadata
    2. Programming using RDDs
    3. Transformations and actions
      1. Transformation
        1. Narrow transformations
          1. map()
          2. flatMap()
          3. filter()
          4. union()
          5. mapPartitions()
        2. Wide transformations
          1. distinct()
          2. sortBy()
          3. intersection()
          4. subtract()
          5. cartesian()
      2. Action
        1. collect()
        2. count()
        3. take()
        4. top()
        5. takeOrdered()
        6. first()
        7. countByValue()
        8. reduce()
        9. saveAsTextFile()
        10. foreach()
    4. Types of RDDs
      1. Pair RDDs
        1. groupByKey()
        2. reduceByKey()
        3. sortByKey()
        4. join()
    5. Caching and checkpointing
      1. Caching
      2. Checkpointing 
    6. Understanding partitions 
      1. repartition() versus coalesce()
      2. partitionBy()
    7. Drawbacks of using RDDs
    8. Summary
  9. Spark DataFrame and Dataset
    1. DataFrames
      1. Creating DataFrames
      2. Data sources
      3. DataFrame operations and associated functions
      4. Running SQL on DataFrames
        1. Temporary views on DataFrames
        2. Global temporary views on DataFrames
    2. Datasets
      1. Encoders
      2. Internal row
        1. Creating custom encoders
    3. Summary
  10. Spark Architecture and Application Execution Flow
    1. A sample application
      1. DAG constructor
        1. Stage
          1. Tasks
      2. Task scheduler
        1. FIFO
        2. FAIR
    2. Application execution modes
      1. Local mode
      2. Client mode
      3. Cluster mode
    3. Application monitoring
      1. Spark UI
      2. Application logs
      3. External monitoring solution
    4. Summary
  11. Spark SQL
    1. Spark SQL
      1. Spark metastore
        1. Using the Hive metastore in Spark SQL
        2. Hive configuration with Spark
      2. SQL language manual
        1. Database
        2. Table and view
        3. Load data
        4. Creating UDFs
      3. SQL database using JDBC
    2. Summary
  12. Spark Streaming, Machine Learning, and Graph Analysis
    1. Spark Streaming
      1. Use cases
      2. Data sources
      3. Stream processing
        1. Microbatch
        2. DStreams
      4. Streaming architecture
      5. Streaming example
    2. Machine learning
      1. MLlib
      2. ML
    3. Graph processing
      1. GraphX
        1. mapVertices
        2. mapEdges
        3. subgraph
      2. GraphFrames
        1. degrees
        2. subgraphs
      3. Graph algorithms
        1. PageRank
    4. Summary
  13. Spark Optimizations
    1. Cluster-level optimizations
      1. Memory
      2. Disk
      3. CPU cores
      4. Project Tungsten
    2. Application optimizations
      1. Language choice
      2. Structured versus unstructured APIs
      3. File format choice
      4. RDD optimizations
        1. Choosing the right transformations
        2. Serializing and compressing 
        3. Broadcast variables
      5. DataFrame and dataset optimizations
        1. Catalyst optimizer
        2. Storage 
        3. Parallelism 
        4. Join performance
        5. Code generation 
        6. Speculative execution
    3. Summary
  14. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think

Product information

  • Title: Apache Spark Quick Start Guide
  • Author(s): Shrey Mehrotra, Akash Grade
  • Release date: January 2019
  • Publisher(s): Packt Publishing
  • ISBN: 9781789349108