O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Apache Spark 2 for Beginners

Book Description

Develop large-scale distributed data processing applications using Spark 2 in Scala and Python

About This Book
  • This book offers an easy introduction to the Spark framework published on the latest version of Apache Spark 2
  • Perform efficient data processing, machine learning and graph processing using various Spark components
  • A practical guide aimed at beginners to get them up and running with Spark
Who This Book Is For

If you are an application developer, data scientist, or big data solutions architect who is interested in combining the data processing power of Spark from R, and consolidating data processing, stream processing, machine learning, and graph processing into one unified and highly interoperable framework with a uniform API using Scala or Python, this book is for you.

What You Will Learn
  • Get to know the fundamentals of Spark 2 and the Spark programming model using Scala and Python
  • Know how to use Spark SQL and DataFrames using Scala and Python
  • Get an introduction to Spark programming using R
  • Perform Spark data processing, charting, and plotting using Python
  • Get acquainted with Spark stream processing using Scala and Python
  • Be introduced to machine learning using Spark MLlib
  • Get started with graph processing using the Spark GraphX
  • Bring together all that you've learned and develop a complete Spark application
In Detail

Spark is one of the most widely-used large-scale data processing engines and runs extremely fast. It is a framework that has tools that are equally useful for application developers as well as data scientists.

This book starts with the fundamentals of Spark 2 and covers the core data processing framework and API, installation, and application development setup. Then the Spark programming model is introduced through real-world examples followed by Spark SQL programming with DataFrames. An introduction to SparkR is covered next. Later, we cover the charting and plotting features of Python in conjunction with Spark data processing. After that, we take a look at Spark's stream processing, machine learning, and graph processing libraries. The last chapter combines all the skills you learned from the preceding chapters to develop a real-world Spark application.

By the end of this book, you will have all the knowledge you need to develop efficient large-scale applications using Apache Spark.

Style and approach

Learn about Spark's infrastructure with this practical tutorial. With the help of real-world use cases on the main features of Spark we offer an easy introduction to the framework.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Apache Spark 2 for Beginners
    1. Apache Spark 2 for Beginners
    2. Credits
    3. About the Author
    4. About the Reviewer
    5. www.PacktPub.com
      1. eBooks, discount offers, and more
    6. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    7. 1. Spark Fundamentals
      1. An overview of Apache Hadoop
      2. Understanding Apache Spark
      3. Installing Spark on your machines
        1. Python installation
        2. R installation
        3. Spark installation
        4. Development tool installation
        5. Optional software installation
          1. IPython
          2. RStudio
          3. Apache Zeppelin
      4. References
      5. Summary
    8. 2. Spark Programming Model
      1. Functional programming with Spark
      2. Understanding Spark RDD
        1. Spark RDD is immutable
        2. Spark RDD is distributable
        3. Spark RDD lives in memory
        4. Spark RDD is strongly typed
      3. Data transformations and actions with RDDs
      4. Monitoring with Spark
      5. The basics of programming with Spark
        1. MapReduce
        2. Joins
        3. More actions
      6. Creating RDDs from files
      7. Understanding the Spark library stack
      8. Reference
      9. Summary
    9. 3. Spark SQL
      1. Understanding the structure of data
      2. Why Spark SQL?
      3. Anatomy of Spark SQL
      4. DataFrame programming
        1. Programming with SQL
        2. Programming with DataFrame API
      5. Understanding Aggregations in Spark SQL
      6. Understanding multi-datasource joining with SparkSQL
      7. Introducing datasets
      8. Understanding Data Catalogs
      9. References
      10. Summary
    10. 4. Spark Programming with R
      1. The need for SparkR
      2. Basics of the R language
      3. DataFrames in R and Spark
      4. Spark DataFrame programming with R
        1. Programming with SQL
        2. Programming with R DataFrame API
      5. Understanding aggregations in Spark R
      6. Understanding multi-datasource joins with SparkR
      7. References
      8. Summary
    11. 5. Spark Data Analysis with Python
      1. Charting and plotting libraries
      2. Setting up a dataset
      3. Data analysis use cases
      4. Charts and plots
        1. Histogram
        2. Density plot
        3. Bar chart
          1. Stacked bar chart
        4. Pie chart
          1. Donut chart
        5. Box plot
        6. Vertical bar chart
        7. Scatter plot
          1. Enhanced scatter plot
        8. Line graph
      5. References
      6. Summary
    12. 6. Spark Stream Processing
      1. Data stream processing
      2. Micro batch data processing
        1. Programming with DStreams
      3. A log event processor
        1. Getting ready with the Netcat server
        2. Organizing files
        3. Submitting the jobs to the Spark cluster
        4. Monitoring running applications
        5. Implementing the application in Scala
        6. Compiling and running the application
        7. Handling the output
        8. Implementing the application in Python
      4. Windowed data processing
        1. Counting the number of log event messages processed in Scala
        2. Counting the number of log event messages processed in Python
      5. More processing options
      6. Kafka stream processing
        1. Starting Zookeeper and Kafka
        2. Implementing the application in Scala
        3. Implementing the application in Python
      7. Spark Streaming jobs in production
        1. Implementing fault-tolerance in Spark Streaming data processing applications
        2. Structured streaming
      8. References
      9. Summary
    13. 7. Spark Machine Learning
      1. Understanding machine learning
      2. Why Spark for machine learning?
      3. Wine quality prediction
      4. Model persistence
      5. Wine classification
      6. Spam filtering
      7. Feature algorithms
      8. Finding synonyms
      9. References
      10. Summary
    14. 8. Spark Graph Processing
      1. Understanding graphs and their usage
      2. The Spark GraphX library
        1. GraphX overview
        2. Graph partitioning
        3. Graph processing
        4. Graph structure processing
      3. Tennis tournament analysis
      4. Applying the PageRank algorithm
      5. Connected component algorithm
      6. Understanding GraphFrames
      7. Understanding GraphFrames queries
      8. References
      9. Summary
    15. 9. Designing Spark Applications
      1. Lambda Architecture
      2. Microblogging with Lambda Architecture
        1. An overview of SfbMicroBlog
        2. Getting familiar with data
        3. Setting the data dictionary
      3. Implementing Lambda Architecture
        1. Batch layer
        2. Serving layer
        3. Speed layer
          1. Queries
      4. Working with Spark applications
      5. Coding style
      6. Setting up the source code
      7. Understanding data ingestion
      8. Generating purposed views and queries
      9. Understanding custom data processes
      10. References
      11. Summary