Book description
Apache Spark's speed, ease of use, sophisticated analytics, and multilanguage support makes practical knowledge of this cluster-computing framework a required skill for data engineers and data scientists. With this hands-on guide, anyone looking for an introduction to Spark will learn practical algorithms and examples using PySpark.
In each chapter, author Mahmoud Parsian shows you how to solve a data problem with a set of Spark transformations and algorithms. You'll learn how to tackle problems involving ETL, design patterns, machine learning algorithms, data partitioning, and genomics analysis. Each detailed recipe includes PySpark algorithms using the PySpark driver and shell script.
With this book, you will:
- Learn how to select Spark transformations for optimized solutions
- Explore powerful transformations and reductions including reduceByKey(), combineByKey(), and mapPartitions()
- Understand data partitioning for optimized queries
- Build and apply a model using PySpark design patterns
- Apply motif-finding algorithms to graph data
- Analyze graph data by using the GraphFrames API
- Apply PySpark algorithms to clinical and genomics data
- Learn how to use and apply feature engineering in ML algorithms
- Understand and use practical and pragmatic data design patterns
Table of contents
- Foreword
- Preface
- I. Fundamentals
- 1. Introduction to Spark and PySpark
- 2. Transformations in Action
- 3. Mapper Transformations
- 4. Reductions in Spark
- II. Working with Data
- 5. Partitioning Data
- 6. Graph Algorithms
-
7. Interacting with External Data Sources
- Relational Databases
- Reading Text Files
- Reading and Writing CSV Files
- Reading and Writing JSON Files
- Reading from and Writing to Amazon S3
- Reading and Writing Hadoop Files
- Reading and Writing Parquet Files
- Reading and Writing Avro Files
- Reading from and Writing to MS SQL Server
- Reading Image Files
- Summary
- 8. Ranking Algorithms
- III. Data Design Patterns
- 9. Classic Data Design Patterns
- 10. Practical Data Design Patterns
- 11. Join Design Patterns
- 12. Feature Engineering in PySpark
- Index
- About the Author
Product information
- Title: Data Algorithms with Spark
- Author(s):
- Release date: April 2022
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781492082385
You might also like
book
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition
Through a series of recent breakthroughs, deep learning has boosted the entire field of machine learning. …
book
Fundamentals of Data Engineering
Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and …
book
Algorithms and Data Structures for Massive Datasets
Massive modern datasets make traditional data structures and algorithms grind to a halt. This fun and …
book
Azure Databricks Cookbook
Get to grips with building and productionizing end-to-end big data solutions in Azure and learn best …