1 Data Ingestion and Data Extraction with Apache Spark

Apache Spark is a powerful distributed computing framework that can handle large-scale data processing tasks. One of the most common tasks when working with data is loading it from various sources and writing it into various formats. In this hands-on chapter, you will learn how to load and write data files with Apache Spark using Python.

In this chapter, we’re going to cover the following recipes:

Reading CSV data with Apache Spark
Reading JSON data with Apache Spark
Reading Parquet data with Apache Spark
Parsing XML data with Apache Spark
Working with nested data structures in Apache Spark
Processing text data in Apache Spark
Writing data with Apache Spark

By the end of this chapter, ...

Get Data Engineering with Databricks Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Data Engineering with Databricks Cookbook by Pulkit Chadha

1

Data Ingestion and Data Extraction with Apache Spark

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly