Table of Contents
Preface
Section 1: Data Engineering
Chapter 1: Distributed Computing Primer
Technical requirements
Distributed Computing
Introduction to Distributed Computing5
Data Parallel Processing5
Data Parallel Processing using the MapReduce paradigm6
Distributed Computing with Apache Spark
Introduction to Apache Spark8
Data Parallel Processing with RDDs9
Higher-order functions10
Apache Spark cluster architecture11
Getting started with Spark12
Big data processing with Spark SQL and DataFrames
Transforming data with Spark DataFrames15
Using SQL on Spark 18
What's new in Apache Spark 3.0?20
Summary
Chapter 2: Data Ingestion
Technical requirements
Introduction to Enterprise Decision Support Systems
Ingesting data from data sources
Ingesting ...
Get Essential PySpark for Scalable Data Analytics now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.