Learn about the fastest-growing open source project in the world, and find out how it revolutionizes big data analytics
About This Book
- Exclusive guide that covers how to get up and running with fast data processing using Apache Spark
- Explore and exploit various possibilities with Apache Spark using real-world use cases in this book
- Want to perform efficient data processing at real time? This book will be your one-stop solution.
Who This Book Is For
This guide appeals to big data engineers, analysts, architects, software engineers, even technical managers who need to perform efficient data processing on Hadoop at real time. Basic familiarity with Java or Scala will be helpful.
The assumption is that readers will be from a mixed background, but would be typically people with background in engineering/data science with no prior Spark experience and want to understand how Spark can help them on their analytics journey.
What You Will Learn
- Get an overview of big data analytics and its importance for organizations and data professionals
- Delve into Spark to see how it is different from existing processing platforms
- Understand the intricacies of various file formats, and how to process them with Apache Spark.
- Realize how to deploy Spark with YARN, MESOS or a Stand-alone cluster manager.
- Learn the concepts of Spark SQL, SchemaRDD, Caching and working with Hive and Parquet file formats
- Understand the architecture of Spark MLLib while discussing some of the off-the-shelf algorithms that come with Spark.
- Introduce yourself to the deployment and usage of SparkR.
- Walk through the importance of Graph computation and the graph processing systems available in the market
- Check the real world example of Spark by building a recommendation engine with Spark using ALS.
- Use a Telco data set, to predict customer churn using Random Forests.
Spark juggernaut keeps on rolling and getting more and more momentum each day. Spark provides key capabilities in the form of Spark SQL, Spark Streaming, Spark ML and Graph X all accessible via Java, Scala, Python and R. Deploying the key capabilities is crucial whether it is on a Standalone framework or as a part of existing Hadoop installation and configuring with Yarn and Mesos.
The next part of the journey after installation is using key components, APIs, Clustering, machine learning APIs, data pipelines, parallel programming. It is important to understand why each framework component is key, how widely it is being used, its stability and pertinent use cases.
Once we understand the individual components, we will take a couple of real life advanced analytics examples such as ‘Building a Recommendation system’, ‘Predicting customer churn’ and so on.
The objective of these real life examples is to give the reader confidence of using Spark for real-world problems.
Style and approach
With the help of practical examples and real-world use cases, this guide will take you from scratch to building efficient data applications using Apache Spark.
You will learn all about this excellent data processing engine in a step-by-step manner, taking one aspect of it at a time.
This highly practical guide will include how to work with data pipelines, dataframes, clustering, SparkSQL, parallel programming, and such insightful topics with the help of real-world use cases.
Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.
Table of Contents
Learning Apache Spark 2
- Learning Apache Spark 2
- About the Author
- About the Reviewers
- Customer Feedback
1. Architecture and Installation
- Apache Spark architecture overview
- Installing Apache Spark
- Writing your first Spark program
- Spark architecture
- Apache Spark cluster manager types
- Running Spark examples
- Brain teasers
2. Transformations and Actions with Spark RDDs
- What is an RDD?
- Operations on RDD
- Passing functions to Spark (Scala)
- Passing functions to Spark (Java)
- Passing functions to Spark (Python)
- Set operations in Spark
- Shared variables
3. ETL with Spark
- What is ETL?
- How is Spark being used?
- Commonly Supported File Formats
- Commonly supported file systems
Structured Data sources and Databases
Working with NoSQL Databases
- Working with Cassandra
- Working with HBase
- Working with MongoDB
- Working with Apache Solr
- Working with NoSQL Databases
4. Spark SQL
- What is Spark SQL?
- What is DataFrame API?
- What is DataSet API?
- What's new in Spark 2.0?
- The Sparksession
- Creating a DataFrame
- Parquet files
- Working with Hive
- SparkSQL CLI
5. Spark Streaming
- What is Spark Streaming?
- Steps involved in a streaming app
- Architecture of Spark Streaming
- Caching and persistence
- DStream best practices
- Fault tolerance
- What is Structured Streaming?
6. Machine Learning with Spark
- What is machine learning?
- Why machine learning?
- Types of machine learning
- Introduction to Spark MLLib
- Why do we need the Pipeline API?
- How does it work?
- Feature engineering
- Classification and regression
- Collaborative filtering
- ML-tuning - model selection and hyperparameter tuning
- Graphs in everyday life
- What is a graph?
- Why are Graphs elegant?
- What is GraphX?
- Creating your first Graph (RDD API)
- Basic graph operators (RDD API)
- Caching and uncaching of graphs
- Graph algorithms in GraphX
- Comparison between GraphFrames and GraphX
8. Operating in Clustered Mode
- Clusters, nodes and daemons
- Running Spark in standalone mode
- Using the Cluster Launch Scripts to Start a Standalone Cluster
- Running Spark in YARN
- Running Spark in Mesos
9. Building a Recommendation System
- What is a recommendation system?
- User specific recommendations
Key issues with recommendation systems
- Gathering known input data
- Predicting unknown from known ratings
Recommendation system in Spark
- Sample dataset
- How does Spark offer recommendation?
10. Customer Churn Prediction
- Overview of customer churn
- Why is predicting customer churn important?
How do we predict customer churn with Spark?
- Data set description
- Code example
- Defining schema
- Loading data
- Data exploration
- Comparing minutes data for churners and non-churners
- Comparing charge data for churners and non-churners
- Exploring customer service calls
Theres More with Spark
- Performance tuning
- I/O tuning
- Sizing up your executors
- The skew problem
- Security configuration in Spark
- Setting up Jupyter Notebook with Spark
- Shared variables
- Title: Learning Apache Spark 2
- Release date: March 2017
- Publisher(s): Packt Publishing
- ISBN: 9781785885136