Book description
Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you expected, or still don’t feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.
Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you’ll also learn how to make it sing.
With this book, you’ll explore:
- How Spark SQL’s new interfaces improve performance over SQL’s RDD data structure
- The choice between data joins in Core Spark and Spark SQL
- Techniques for getting the most out of standard RDD transformations
- How to work around performance issues in Spark’s key/value pair paradigm
- Writing high-performance Spark code without Scala or the JVM
- How to test for functionality and performance when applying suggested improvements
- Using Spark MLlib and Spark ML machine learning libraries
- Spark’s Streaming components and external community packages
Publisher resources
Table of contents
- Preface
- 1. Introduction to High Performance Spark
- 2. How Spark Works
-
3. DataFrames, Datasets, and Spark SQL
- Getting Started with the SparkSession (or HiveContext or SQLContext)
- Spark SQL Dependencies
- Basics of Schemas
- DataFrame API
- Data Representation in DataFrames and Datasets
- Data Loading and Saving Functions
- Datasets
- Extending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)
- Query Optimizer
- Debugging Spark SQL Queries
- JDBC/ODBC Server
- Conclusion
- 4. Joins (SQL and Core)
- 5. Effective Transformations
-
6. Working with Key/Value Data
- The Goldilocks Example
- Actions on Key/Value Pairs
- Whatâs So Dangerous About the groupByKey Function
- Choosing an Aggregation Operation
- Multiple RDD Operations
- Partitioners and Key/Value Data
- Dictionary of OrderedRDDOperations
- Secondary Sort and repartitionAndSortWithinPartitions
- Straggler Detection and Unbalanced Data
- Conclusion
- 7. Going Beyond Scala
- 8. Testing and Validation
-
9. Spark MLlib and ML
- Choosing Between Spark MLlib and Spark ML
- Working with MLlib
-
Working with Spark ML
- Spark ML Organization and Imports
- Pipeline Stages
- Explain Params
- Data Encoding
- Data Cleaning
- Spark ML Models
- Putting It All Together in a Pipeline
- Training a Pipeline
- Accessing Individual Stages
- Data Persistence and Spark ML
- Extending Spark ML Pipelines with Your Own Algorithms
- Model and Pipeline Persistence and Serving with Spark ML
- General Serving Considerations
- Conclusion
- 10. Spark Components and Packages
- A. Tuning, Debugging, and Other Things Developers Like to Pretend Donât Exist
- Index
Product information
- Title: High Performance Spark
- Author(s):
- Release date: May 2017
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781491943151
You might also like
book
Data Scientists at Work
Data Scientists at Work is a collection of interviews with sixteen of the world's most influential …
book
SQL for Data Analysis
With the explosion of data, computing power, and cloud data warehouses, SQL has become an even …
book
Designing Data-Intensive Applications
Data is at the center of many challenges in system design today. Difficult issues need to …
book
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition
Through a recent series of breakthroughs, deep learning has boosted the entire field of machine learning. …