Essential PySpark for Scalable Data Analytics

Preface

Section 1: Data Engineering

Chapter 1: Distributed Computing Primer

Technical requirements

Distributed Computing

Introduction to Distributed Computing5

Data Parallel Processing5

Data Parallel Processing using the MapReduce paradigm6

Distributed Computing with Apache Spark

Introduction to Apache Spark8

Data Parallel Processing with RDDs9

Higher-order functions10

Apache Spark cluster architecture11

Getting started with Spark12

Big data processing with Spark SQL and DataFrames

Transforming data with Spark DataFrames15

Using SQL on Spark 18

What's new in Apache Spark 3.0?20

Summary

Chapter 2: Data Ingestion

Technical requirements

Introduction to Enterprise Decision Support Systems

Ingesting data from data sources

Ingesting ...

Get Essential PySpark for Scalable Data Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Essential PySpark for Scalable Data Analytics by Sreeram Nudurupati

Table of Contents

Preface

Section 1: Data Engineering

Chapter 1: Distributed Computing Primer

Technical requirements

Distributed Computing

Introduction to Distributed Computing5

Data Parallel Processing5

Data Parallel Processing using the MapReduce paradigm6

Distributed Computing with Apache Spark

Introduction to Apache Spark8

Data Parallel Processing with RDDs9

Higher-order functions10

Apache Spark cluster architecture11

Getting started with Spark12

Big data processing with Spark SQL and DataFrames

Transforming data with Spark DataFrames15

Using SQL on Spark 18

What's new in Apache Spark 3.0?20

Summary

Chapter 2: Data Ingestion

Technical requirements

Introduction to Enterprise Decision Support Systems

Ingesting data from data sources

Ingesting ...

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly