Skip to Main Content
Data Algorithms with Spark
book

Data Algorithms with Spark

by Mahmoud Parsian
April 2022
Intermediate to advanced content levelIntermediate to advanced
435 pages
9h 44m
English
O'Reilly Media, Inc.
Book available
Content preview from Data Algorithms with Spark

Chapter 1. Introduction to Spark and PySpark

Spark is a powerful analytics engine for large-scale data processing that aims at speed, ease of use, and extensibility for big data applications. It’s a proven and widely adopted technology used by many companies that handle big data every day. Though Spark’s “native” language is Scala (most of Spark is developed in Scala), it also provides high-level APIs in Java, Python, and R.

In this book we’ll be using Python via PySpark, an API that exposes the Spark programming model to Python. With Python being the most accessible programming language and Spark’s powerful and expressive API, PySpark’s simplicity makes it the best choice for us. PySpark is an interface for Spark in the Python programming language that provides the following two important features:

  • It allows us to write Spark applications using Python APIs.

  • It provides the PySpark shell for interactively analyzing data in a distributed environment.

The purpose of this chapter is to introduce PySpark as the main component of the Spark ecosystem and show you that it can be effectively used for big data tasks such as ETL operations, indexing billions of documents, ingesting millions of genomes, machine learning, graph data analysis, DNA data analysis, and much more. I’ll start by reviewing the Spark and PySpark architectures, and provide examples to show the expressive power of PySpark. I will present an overview of Spark’s core functions (transformations and actions) and ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Algorithms

Data Algorithms

Mahmoud Parsian
Algorithms and Data Structures for Massive Datasets

Algorithms and Data Structures for Massive Datasets

Dzejla Medjedovic, Emin Tahirovic, Ines Schweigert

Publisher Resources

ISBN: 9781492082378Errata PageSupplemental Content